前言
使用Python爬取任意網(wǎng)頁的資源文件,比如圖片、音頻、視頻;一般常用的做法就是把網(wǎng)頁的HTML請求下來通過XPath或者正則來獲取自己想要的資源,這里我做了一個爬蟲工具軟件,可以一鍵爬取資源 媒體文件;但是需要說明的是,這里爬取資源文件只針對HTML已有的文件,如果需要二次請求的是爬取不到的,比如酷狗音樂播放界面,因?yàn)橐鐾ㄓ霉ぞ?,匹配不同的網(wǎng)站?。?!??????
這里主推圖片爬取,一些需要圖片素材的可以輸入網(wǎng)址一鍵爬??!
還有就是爬取視頻的時候會把磁力鏈接爬取下來!可以使用第三方下載工具下載!??
代碼
爬取資源文件
這里需要說明的就只,有的圖片資源并不是url鏈接,是data:image格式,這里需要轉(zhuǎn)換一下存儲!
def getResourceUrlList(url ,isImage, isAudio, isVideo):
global imgType_list, audioType_list, videoType_list
imageUrlList = []
audioUrlList = []
videoUrlList = []
url = url.rstrip().rstrip('/')
htmlStr = str(requestsDataBase(url))
# print(htmlStr)
Wopen = open('reptileHtml.txt','w')
Wopen.write(htmlStr)
Wopen.close()
Ropen = open('reptileHtml.txt','r')
imageUrlList = []
for line in Ropen:
line = line.replace("'", '"')
segmenterStr = '"'
if "'" in line:
segmenterStr = "'"
lineList = line.split(segmenterStr)
for partLine in lineList:
if isImage == True:
# 查找圖片
if 'data:image' in partLine:
base64List = partLine.split('base64,')
imgData = base64.urlsafe_b64decode(base64List[-1] + '=' * (4 - len(base64List[-1]) % 4))
base64ImgType = base64List[0].split('/')[-1].rstrip(';')
imageName = zfjTools.getTimestamp() + '.' + base64ImgType
imageUrlList.append(imageName + '$==$' + base64ImgType)
# 查找圖片
for imageType in imgType_list:
if imageType in partLine:
imgUrl = partLine[:partLine.find(imageType) + len(imageType)].split(segmenterStr)[-1]
# 修復(fù)URL
imgUrl = repairUrl(imgUrl, url)
sizeType = '_{' + 'size' + '}'
if sizeType in imgUrl:
imgUrl = imgUrl.replace(sizeType, '')
imgUrl = imgUrl.strip()
if imgUrl.startswith('http://') or imgUrl.startswith('https://') and imgUrl not in imageUrlList:
imageUrlList.append(imgUrl)
else:
imgUrl = ''
if isAudio == True:
# 查找音頻
for audioType in audioType_list:
if audioType in partLine or audioType.lower() in partLine:
audioType = audioType.lower() if audioType.lower() in partLine else audioType
audioUrl = partLine[:partLine.find(audioType) + len(audioType)].split(segmenterStr)[-1]
# 修復(fù)URL
audioUrl = repairUrl(audioUrl, url)
if audioUrl.startswith('http://') or audioUrl.startswith('https://') and audioUrl not in audioUrlList:
audioUrlList.append(audioUrl)
else:
audioUrl = ''
if isVideo == True:
# 查找視頻
for videoType in videoType_list:
if videoType in partLine or videoType.lower() in partLine:
videoType = videoType.lower() if videoType.lower() in partLine else videoType
videoUrl = partLine[:partLine.find(videoType) + len(videoType)].split(segmenterStr)[-1]
# 修復(fù)URL
videoUrl = repairUrl(videoUrl, url)
if videoUrl.startswith('http://') or videoUrl.startswith('https://') or videoUrl.startswith('ed2k://') or videoUrl.startswith('magnet:?') or videoUrl.startswith('ftp://') and videoUrl not in videoUrlList:
videoUrlList.append(videoUrl)
else:
videoUrl = ''
return (imageUrlList, audioUrlList, videoUrlList)
爬取自定義節(jié)點(diǎn)
# 統(tǒng)配節(jié)點(diǎn)爬取
def getNoteInfors(url, fatherNode, childNode):
url = url.rstrip().rstrip('/')
htmlStr = requestsDataBase(url)
Wopen = open('reptileHtml.txt','w')
Wopen.write(htmlStr)
Wopen.close()
html_etree = etree.HTML(htmlStr)
dataArray = []
if html_etree != None:
nodes_list = html_etree.xpath(fatherNode)
for k_value in nodes_list:
partValue = k_value.xpath(childNode)
if len(partValue) > 0:
dataArray.append(partValue[0])
return dataArray
軟件
軟件下載地址https://gitee.com/zfj1128/ZFJObsLib_dmg
使用教學(xué)視頻
資源爬?。烘溄?https://pan.baidu.com/s/1xa9ruF_hMcN49716BJUx2w 密碼:1zpg
節(jié)點(diǎn)爬?。烘溄?https://pan.baidu.com/s/1ebWWYtjoKkiH9mqakR6EMQ 密碼:cosa
使用截圖如下:

WX20190802-162443@2x.png
結(jié)束語
歡迎各位大佬提出寶貴的意見和建議!!!!??????