久久一一本,午夜婷婷综合干,偷拍亚洲激情

做網(wǎng)頁(yè)內(nèi)容分類或者NLP研究時(shí)，往往需要研究者自己建文本數(shù)據(jù)集，對(duì)模型進(jìn)行訓(xùn)練。Stallions對(duì)抓取中文網(wǎng)頁(yè)有深度優(yōu)化，解決構(gòu)建數(shù)據(jù)獲取中可能遇到的麻煩，是一件不錯(cuò)的利器。
安裝：
pip install stallions

注：僅支持python3

使用方式

from stallions import extract

url = "https://www.163.com/"
article = extract(url=url)
# 提取 title
print("title", article.title)
# 提取 h1
print("h1", article.h1)
# 提取 meta_keywords
print("meta_keywords", article.meta_keywords)
# 提取 meta_description
print("meta_description", article.meta_description)
# 提取網(wǎng)頁(yè)的整個(gè)頁(yè)面內(nèi)容
print(article.content)

title,h1,keywords,description對(duì)網(wǎng)頁(yè)分類影響較大的標(biāo)簽內(nèi)容。

在實(shí)際工作中，抓取中文會(huì)通常遇到兩個(gè)頭疼問題：
1.由于中文網(wǎng)頁(yè)編碼格式不統(tǒng)一，經(jīng)常會(huì)出現(xiàn)網(wǎng)頁(yè)亂碼。
2.頁(yè)面中<script>標(biāo)簽和注釋會(huì)包含文信息，在提取時(shí)候不易剔除。

import re
def clean_content(content):
  """streamline \r\n\ space"""
  # Eliminate Chinese characters
  return re.sub(r'[^\u4e00-\u9fa5]+', ' ', content)

起初，筆者如上用正則表達(dá)式暴力頁(yè)面提取中文，這樣會(huì)丟失文字中包含的數(shù)字和英文信息。

Stallions很好的解決了這兩個(gè)問題。
適配不同編碼的網(wǎng)頁(yè):

def get_html(self, url):
    # do request
    try:
        req = requests.get(url, headers=self.headers, timeout=self.http_timeout)
        if req.encoding == 'ISO-8859-1':
            if req.apparent_encoding is not None:
                req.encoding = requests.utils.get_encodings_from_content(req.text)[0]
            else:
                req.encoding = req.apparent_encoding
        elif len(requests.utils.get_encodings_from_content(req.text)) > 0:
            if requests.utils.get_encodings_from_content(req.text)[0] == "GBK":
                req.encoding = requests.utils.get_encodings_from_content(req.text)[0]
        req.keep_alive = False
        html = req.text
    except Exception as e:
        print(e)

默認(rèn)剔除頁(yè)面中<script>標(biāo)簽和注釋部分包含的文字。

class EliminateScript:
    @staticmethod
    def delete_all_tag(html_raw):
        # <!--done-->  style script
        html_raw = EliminateScript.delete_notes(html_raw)
        html_raw = EliminateScript.delete_tags(html_raw, "style")
        return EliminateScript.delete_tags(html_raw, "script")

    @staticmethod
    def delete_tags(html_raw, tags):
        html_list = html_raw.split("<{0}".format(tags))
        fresh_html = ""
        if html_raw.startswith("<{0}".format(tags)):
            for i, htm in enumerate(html_list):
                if "</{0}>".format(tags) not in htm:
                    continue
                fresh_html += htm.split("</{0}>".format(tags))[-1]
        else:
            for i, htm in enumerate(html_list):
                if i == 0:
                    fresh_html += htm
                    continue
                if "</{0}>".format(tags) not in htm:
                    continue
                fresh_html += htm.split("</{0}>".format(tags))[-1]
        return fresh_html

    @staticmethod
    def delete_notes(html_raw):
        # delete note
        html_list = html_raw.split("<!--")
        fresh_html = ""
        if html_raw.startswith("<!--"):
            for i, htm in enumerate(html_list):
                if "-->" not in htm:
                    continue
                fresh_html += htm.split("-->")[-1]
        else:
            for i, htm in enumerate(html_list):
                if i == 0:
                    fresh_html += htm
                    continue
                if "-->" not in htm:
                    continue
                fresh_html += htm.split("-->")[-1]
        return fresh_html

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

抓取一切中文網(wǎng)頁(yè)文字

抓取一切中文網(wǎng)頁(yè)文字

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

抓取一切中文網(wǎng)頁(yè)文字

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av