做網(wǎng)頁(yè)內(nèi)容分類或者NLP研究時(shí),往往需要研究者自己建文本數(shù)據(jù)集,對(duì)模型進(jìn)行訓(xùn)練。Stallions對(duì)抓取中文網(wǎng)頁(yè)有深度優(yōu)化,解決構(gòu)建數(shù)據(jù)獲取中可能遇到的麻煩,是一件不錯(cuò)的利器。
安裝:
pip install stallions
注:僅支持python3
使用方式
from stallions import extract
url = "https://www.163.com/"
article = extract(url=url)
# 提取 title
print("title", article.title)
# 提取 h1
print("h1", article.h1)
# 提取 meta_keywords
print("meta_keywords", article.meta_keywords)
# 提取 meta_description
print("meta_description", article.meta_description)
# 提取網(wǎng)頁(yè)的整個(gè)頁(yè)面內(nèi)容
print(article.content)
title,h1,keywords,description對(duì)網(wǎng)頁(yè)分類影響較大的標(biāo)簽內(nèi)容。
在實(shí)際工作中,抓取中文會(huì)通常遇到兩個(gè)頭疼問題:
1.由于中文網(wǎng)頁(yè)編碼格式不統(tǒng)一,經(jīng)常會(huì)出現(xiàn)網(wǎng)頁(yè)亂碼。
2.頁(yè)面中<script>標(biāo)簽和注釋會(huì)包含文信息,在提取時(shí)候不易剔除。
import re
def clean_content(content):
"""streamline \r\n\ space"""
# Eliminate Chinese characters
return re.sub(r'[^\u4e00-\u9fa5]+', ' ', content)
起初,筆者如上用正則表達(dá)式暴力頁(yè)面提取中文,這樣會(huì)丟失文字中包含的數(shù)字和英文信息。
Stallions很好的解決了這兩個(gè)問題。
適配不同編碼的網(wǎng)頁(yè):
def get_html(self, url):
# do request
try:
req = requests.get(url, headers=self.headers, timeout=self.http_timeout)
if req.encoding == 'ISO-8859-1':
if req.apparent_encoding is not None:
req.encoding = requests.utils.get_encodings_from_content(req.text)[0]
else:
req.encoding = req.apparent_encoding
elif len(requests.utils.get_encodings_from_content(req.text)) > 0:
if requests.utils.get_encodings_from_content(req.text)[0] == "GBK":
req.encoding = requests.utils.get_encodings_from_content(req.text)[0]
req.keep_alive = False
html = req.text
except Exception as e:
print(e)
默認(rèn)剔除頁(yè)面中<script>標(biāo)簽和注釋部分包含的文字。
class EliminateScript:
@staticmethod
def delete_all_tag(html_raw):
# <!--done--> style script
html_raw = EliminateScript.delete_notes(html_raw)
html_raw = EliminateScript.delete_tags(html_raw, "style")
return EliminateScript.delete_tags(html_raw, "script")
@staticmethod
def delete_tags(html_raw, tags):
html_list = html_raw.split("<{0}".format(tags))
fresh_html = ""
if html_raw.startswith("<{0}".format(tags)):
for i, htm in enumerate(html_list):
if "</{0}>".format(tags) not in htm:
continue
fresh_html += htm.split("</{0}>".format(tags))[-1]
else:
for i, htm in enumerate(html_list):
if i == 0:
fresh_html += htm
continue
if "</{0}>".format(tags) not in htm:
continue
fresh_html += htm.split("</{0}>".format(tags))[-1]
return fresh_html
@staticmethod
def delete_notes(html_raw):
# delete note
html_list = html_raw.split("<!--")
fresh_html = ""
if html_raw.startswith("<!--"):
for i, htm in enumerate(html_list):
if "-->" not in htm:
continue
fresh_html += htm.split("-->")[-1]
else:
for i, htm in enumerate(html_list):
if i == 0:
fresh_html += htm
continue
if "-->" not in htm:
continue
fresh_html += htm.split("-->")[-1]
return fresh_html