抓取一切中文網(wǎng)頁(yè)文字

做網(wǎng)頁(yè)內(nèi)容分類或者NLP研究時(shí),往往需要研究者自己建文本數(shù)據(jù)集,對(duì)模型進(jìn)行訓(xùn)練。Stallions對(duì)抓取中文網(wǎng)頁(yè)有深度優(yōu)化,解決構(gòu)建數(shù)據(jù)獲取中可能遇到的麻煩,是一件不錯(cuò)的利器。
安裝
pip install stallions

注:僅支持python3

使用方式

from stallions import extract

url = "https://www.163.com/"
article = extract(url=url)
# 提取 title
print("title", article.title)
# 提取 h1
print("h1", article.h1)
# 提取 meta_keywords
print("meta_keywords", article.meta_keywords)
# 提取 meta_description
print("meta_description", article.meta_description)
# 提取網(wǎng)頁(yè)的整個(gè)頁(yè)面內(nèi)容
print(article.content)

title,h1,keywords,description對(duì)網(wǎng)頁(yè)分類影響較大的標(biāo)簽內(nèi)容。

在實(shí)際工作中,抓取中文會(huì)通常遇到兩個(gè)頭疼問題:
1.由于中文網(wǎng)頁(yè)編碼格式不統(tǒng)一,經(jīng)常會(huì)出現(xiàn)網(wǎng)頁(yè)亂碼。
2.頁(yè)面中<script>標(biāo)簽和注釋會(huì)包含文信息,在提取時(shí)候不易剔除。

import re
def clean_content(content):
  """streamline \r\n\ space"""
  # Eliminate Chinese characters
  return re.sub(r'[^\u4e00-\u9fa5]+', ' ', content)

起初,筆者如上用正則表達(dá)式暴力頁(yè)面提取中文,這樣會(huì)丟失文字中包含的數(shù)字和英文信息。

Stallions很好的解決了這兩個(gè)問題。
適配不同編碼的網(wǎng)頁(yè):

def get_html(self, url):
    # do request
    try:
        req = requests.get(url, headers=self.headers, timeout=self.http_timeout)
        if req.encoding == 'ISO-8859-1':
            if req.apparent_encoding is not None:
                req.encoding = requests.utils.get_encodings_from_content(req.text)[0]
            else:
                req.encoding = req.apparent_encoding
        elif len(requests.utils.get_encodings_from_content(req.text)) > 0:
            if requests.utils.get_encodings_from_content(req.text)[0] == "GBK":
                req.encoding = requests.utils.get_encodings_from_content(req.text)[0]
        req.keep_alive = False
        html = req.text
    except Exception as e:
        print(e)

默認(rèn)剔除頁(yè)面中<script>標(biāo)簽和注釋部分包含的文字。

class EliminateScript:
    @staticmethod
    def delete_all_tag(html_raw):
        # <!--done-->  style script
        html_raw = EliminateScript.delete_notes(html_raw)
        html_raw = EliminateScript.delete_tags(html_raw, "style")
        return EliminateScript.delete_tags(html_raw, "script")

    @staticmethod
    def delete_tags(html_raw, tags):
        html_list = html_raw.split("<{0}".format(tags))
        fresh_html = ""
        if html_raw.startswith("<{0}".format(tags)):
            for i, htm in enumerate(html_list):
                if "</{0}>".format(tags) not in htm:
                    continue
                fresh_html += htm.split("</{0}>".format(tags))[-1]
        else:
            for i, htm in enumerate(html_list):
                if i == 0:
                    fresh_html += htm
                    continue
                if "</{0}>".format(tags) not in htm:
                    continue
                fresh_html += htm.split("</{0}>".format(tags))[-1]
        return fresh_html

    @staticmethod
    def delete_notes(html_raw):
        # delete note
        html_list = html_raw.split("<!--")
        fresh_html = ""
        if html_raw.startswith("<!--"):
            for i, htm in enumerate(html_list):
                if "-->" not in htm:
                    continue
                fresh_html += htm.split("-->")[-1]
        else:
            for i, htm in enumerate(html_list):
                if i == 0:
                    fresh_html += htm
                    continue
                if "-->" not in htm:
                    continue
                fresh_html += htm.split("-->")[-1]
        return fresh_html

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容