1.需求：

本程序源于我的一個真實(shí)需求，需要每天去關(guān)注數(shù)十個國外網(wǎng)站，并在上面找到有用信息后，編譯為文章。

就想能不能寫一個程序，每天早上幫我去數(shù)十個網(wǎng)站上，看有沒有最新的文章，爬取下來，自動翻譯后發(fā)到我的郵箱里。

2.設(shè)計思路：

設(shè)計思路.jpg

3.各模塊代碼解析

3.1 初始化目標(biāo)網(wǎng)站及抓取規(guī)則

由于后期需要對爬取的網(wǎng)站和規(guī)則進(jìn)行修改完善，因此，把目標(biāo)網(wǎng)站和規(guī)則等內(nèi)容單獨(dú)做在一個Config.json文件中，每次程序運(yùn)行都調(diào)用這個配置文件。

生成Config.json的源碼：

import json

#本文檔用于生成siteconfig.json文件。

config = [{
    'sitename':'美國白宮官網(wǎng)',
    'starturl':'https://www.whitehouse.gov/issues/economy-jobs/',
    'listXpath':r'//article//h2/a/@href',
    'OriginalUrl':"",
    'titleXpath':r'//*[@id="main-content"]/div/div/h1/text()',
    'timeXpath':r'//*[@id="main-content"]/div[1]/div/div/p/time/text()',
    'contentXpath':r'//*[@id="main-content"]/div[2]/div/div/p/text()',
    'authorXpath':r'//*[@id="main-content"]/div[1]/div/p/text()'
},
    {
        'sitename': '美國財政部官網(wǎng)',
        'starturl': 'https://home.treasury.gov/news/press-releases/',
        'listXpath': r'//*[@id="block-hamilton-content"]//h2/a/@href',
        'OriginalUrl':"https://home.treasury.gov",
        'titleXpath': r'//*[@id="block-hamilton-page-title"]/h1/span/text()',
        'timeXpath': r'//*[@id="block-hamilton-content"]/article/div//time/text()',
        'contentXpath': r'//*[@id="block-hamilton-content"]/article/div//p/text()',
        'authorXpath': r''
    },
    {
        'sitename': '美國國會預(yù)算辦公室',
        'starturl': 'https://www.cbo.gov/most-recent',
        'listXpath': r'//*[@id="content"]/div//span/a/@href',
        'OriginalUrl': "https://www.cbo.gov",
        'titleXpath': r'//*[@id="page-title"]/text()',
        'timeXpath': r'//*[@class="date-display-single"]/text()',
        'contentXpath': r'//*[@id="content-panel"]//p/text()',
        'authorXpath': r''
    }
]
with open("siteconfig.json", "w") as text:
    json.dump(config, text)

其中，config字典的解釋如下：

 {
        'sitename':  #網(wǎng)站名
        'starturl':  #開始url地址
        'listXpath':  #文章列表的Xpath表達(dá)式
        'OriginalUrl': #原始url，用于拼接地址
        'titleXpath':  #文章標(biāo)題的Xpath表達(dá)式
        'timeXpath': #文章時間的Xpath表達(dá)式
        'contentXpath':   #文章內(nèi)容的Xpath表達(dá)式
        'authorXpath': #文章作者的Xpath表達(dá)式
    },

3.2 抓取網(wǎng)頁內(nèi)容

生成一個seen.json文件，保存已經(jīng)抓取過的頁面網(wǎng)址，不用重復(fù)抓取。
傳遞的參數(shù)為config.json轉(zhuǎn)成的字典列表。
IOarticle函數(shù)負(fù)責(zé)生成md文檔。

import json
from lxml import etree
import datetime
import SendEmail

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        #r.encoding = 'utf-8'
        return r.text
    except:
        return ""

def getLinkList(dic):
    url = dic['starturl']
    html = getHTMLText(url)
    selector = etree.HTML(html)
    links = selector.xpath(dic['listXpath'])
    return links

def getContent(dic,articles):
    links = getLinkList(dic)
    with open("seen.json") as seen:
        seenLink = json.load(seen)
    print(seenLink)
    for link in links:
        link = dic['OriginalUrl'] + link
        if link in seenLink:
            print("已經(jīng)抓取過本頁面")
            continue
        else:
            print(link)
            seenLink.append(link)
            html = getHTMLText(link)
            selector = etree.HTML(html)
            title = selector.xpath(dic['titleXpath'])
            print(title)
            time = selector.xpath(dic['timeXpath'])
            print(time)
            author = dic['sitename']
            print(author)
            paras = selector.xpath(dic['contentXpath'])
            print(paras)
    #將爬取到的文章用字典格式來存
        article = {
         'Title' : str(title).replace("'"," ").replace('"'," "),
         'Link' : str(link),
         'Time' : str(time).replace("'"," ").replace('"'," "),
         'Paragraph' : str(paras).replace("'"," ").replace('"'," ").replace("http://n"," "),
         'Author' : str(author).replace("'"," ").replace('"'," ")
       }
        articles.append(article)

    with open("seen.json",'w') as seen:
        json.dump(seenLink,seen)
    return articles

def IOarticle(articles):
    nowTime = datetime.datetime.now().strftime('%Y%m%d')
    filename = nowTime + ".md"
    fo = open(filename, "w+", encoding="utf-8")
    fo.writelines("[TOC]"+ "\n")
    for article in articles:
        fo.writelines("# "+ article['Title'] + "\n")
        fo.writelines("**參考譯文：**" + GoogleTransla.translateGoogle(article['Title'].strip('[]')) + "\n")
        fo.writelines("**來源：**" + article['Link'] + "\n")
        fo.writelines(article['Time'].strip() + "\n")
        fo.writelines(GoogleTransla.translateGoogle(article['Time'].strip().replace("'"," ").replace('"'," "))+ "\n")
        fo.writelines("**正文：**" + article['Paragraph'] + "\n")
        try:
            fo.writelines("**參考譯文：**" + GoogleTransla.translateGoogle(article['Paragraph'].strip('[').strip(']').replace("'"," ").replace('"'," ") + "\n"))
        except:
            fo.writelines("文本太長，暫時只提供前2000字符的翻譯\n")
            fo.writelines("**參考譯文：**" + GoogleTransla.translateGoogle(
                article['Paragraph'][:10000].strip('[').strip(']').replace("'", " ").replace('"', " ") + "\n"))
        fo.writelines("\n **來源網(wǎng)站：**" + article['Author'] + "\n")
        fo.writelines("\n\n")
    fo.close()
    with open("text.json","w") as text:
        json.dump(articles,text)
    return filename

def main():
    articles = []
    with open("siteconfig.json") as site:
        webs = json.load(site)
    for web in webs:
        articles = getContent(web,articles);
    filename = IOarticle(articles)
    SendEmail.send_mail(filename,filename)
    # getWHList(keyWord="china")

3.3 調(diào)用Google翻譯

import re
import urllib.parse, urllib.request
import urllib

url_google = 'http://translate.google.cn'
reg_text = re.compile(r'(?<=TRANSLATED_TEXT=).*?;')
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                  r'Chrome/44.0.2403.157 Safari/537.36'

def translateGoogle(text, f='', t='zh-cn'):
    text = text.strip('[]').replace("'"," ").replace('"'," ")
    values = {'hl': 'zh-cn', 'ie': 'utf-8', 'text': text, 'langpair': '%s|%s' % (f, t)}
    value = urllib.parse.urlencode(values)
    req = urllib.request.Request(url_google + '?' + value)
    req.add_header('User-Agent', user_agent)
    response = urllib.request.urlopen(req)
    content = response.read().decode('utf-8')
    data = reg_text.search(content)
    result = data.group(0).strip(';').strip('\'')
    print(result)
    return result

3.4 發(fā)送郵件模塊

from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email import encoders
import smtplib
import time


def send_mail(subject,filename):
    email_host = 'smtp.163.com'  # 服務(wù)器地址
    sender = '     '  # 發(fā)件人
    password = '   '  # 密碼，如果是授權(quán)碼就填授權(quán)碼
    receiver = '   '  # 收件人

    msg = MIMEMultipart()
    msg['Subject'] = subject  # 標(biāo)題
    msg['From'] = ''  # 發(fā)件人昵稱
    msg['To'] = ''  # 收件人昵稱

    # 正文-圖片 只能通過html格式來放圖片，所以要注釋25，26行
    mail_msg = '''
<p>\n\t 這是電腦自動發(fā)送的郵件!</p>
<p>\n\t 不必回復(fù)。</p>
<p><a href="http://www.itdecent.cn/u/ee5f3fe1b932">簡書</a></p>
<p>如需增加收件人、爬取網(wǎng)站，請聯(lián)系：QQ:39489421</p>
<p><img src="cid:image1"></p>
'''
    msg.attach(MIMEText(mail_msg, 'html', 'utf-8'))
    # 指定圖片為當(dāng)前目錄
    # fp = open(r'111.png', 'rb')
    # msgImage = MIMEImage(fp.read())
    # fp.close()
    # # 定義圖片 ID，在 HTML 文本中引用
    # msgImage.add_header('Content-ID', '<image1>')
    # msg.attach(msgImage)

    ctype = 'application/octet-stream'
    maintype, subtype = ctype.split('/', 1)
    # 附件-圖片
    # image = MIMEImage(open(r'111.jpg', 'rb').read(), _subtype=subtype)
    # image.add_header('Content-Disposition', 'attachment', filename='img.jpg')
    # msg.attach(image)
    # 附件-文件
    file = MIMEBase(maintype, subtype)
    file.set_payload(open(filename, 'rb').read())
    file.add_header('Content-Disposition', 'attachment', filename=subject + '.md')
    encoders.encode_base64(file)
    msg.attach(file)

    # 發(fā)送
    smtp = smtplib.SMTP()
    smtp.connect(email_host, 25)
    smtp.login(sender, password)
    smtp.sendmail(sender, receiver, msg.as_string())
    smtp.quit()
    print('success')

4. 復(fù)盤程序編寫中遇到的問題：

4.1 程序間數(shù)據(jù)交換格式選擇

用Json文件作為程序間交換數(shù)據(jù)的格式，比較方便，與python中的字典列表很像。

4.2 爬蟲方式的選擇

爬蟲有很多現(xiàn)成的框架和方法，如Beautifulsoup，Scrapy，正則表達(dá)式等，而Xpath是一種兼具效率和難度的好工具，在Chrome下有一個Xpath helper的插件，可以很方便的調(diào)試，還有一個Chrome按下F12，還有一個Copy Xpath的功能，十分方便?？梢陨暇W(wǎng)找一些相關(guān)文檔學(xué)習(xí)。

4.3 翻譯引擎的使用

當(dāng)前比較好的有谷歌翻譯、百度翻譯、有道翻譯，都有比較好的方法，不同的文章，三個翻譯引擎結(jié)果都不一樣，各有優(yōu)勢，這里選了谷歌，但不一定是最好的。
谷歌翻譯傳遞的參數(shù)如果太長，也就是文章如果太長，就會報錯，所以我設(shè)定了一個限額，只翻譯前10000個字符。

4.3 發(fā)送郵箱的坑

做郵件自動發(fā)送這塊花了不少時間，主要是現(xiàn)在郵件如果使用SMTP，登錄的密碼需要使用授權(quán)碼，而不是密碼本身。這點(diǎn)很重要，切記。

4.3 用Markdown生成最終報告

Markdown是我比較喜歡的一種方式，因此，使用這個生成后，可以轉(zhuǎn)化為各種格式。

5 改進(jìn)方向及展望

google翻譯的效果真的很一般，不過有參考價值。
可以增加關(guān)鍵字，只發(fā)送含有關(guān)鍵字的內(nèi)容。
目標(biāo)網(wǎng)站的研究很重要，不同的網(wǎng)站雖然大致一樣，但信息源的結(jié)構(gòu)都不一樣。一句話：功夫在編程外，要讓一個程序發(fā)揮作用，需要做更多基礎(chǔ)性的工作。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python編程從0到1（實(shí)戰(zhàn)篇：抓取外網(wǎng)資源自動翻譯并發(fā)至個人郵箱）

Python編程從0到1（實(shí)戰(zhàn)篇：抓取外網(wǎng)資源自動翻譯并發(fā)至個人郵箱）

1.需求：

2.設(shè)計思路：

3.各模塊代碼解析

3.1 初始化目標(biāo)網(wǎng)站及抓取規(guī)則

3.2 抓取網(wǎng)頁內(nèi)容

3.3 調(diào)用Google翻譯

3.4 發(fā)送郵件模塊

4. 復(fù)盤程序編寫中遇到的問題：

4.1 程序間數(shù)據(jù)交換格式選擇

4.2 爬蟲方式的選擇

4.3 翻譯引擎的使用

4.3 發(fā)送郵箱的坑

4.3 用Markdown生成最終報告

5 改進(jìn)方向及展望

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python編程從0到1（實(shí)戰(zhàn)篇：抓取外網(wǎng)資源自動翻譯并發(fā)至個人郵箱）

1.需求：

2.設(shè)計思路：

3.各模塊代碼解析

3.1 初始化目標(biāo)網(wǎng)站及抓取規(guī)則

3.2 抓取網(wǎng)頁內(nèi)容

3.3 調(diào)用Google翻譯

3.4 發(fā)送郵件模塊

4. 復(fù)盤程序編寫中遇到的問題：

4.1 程序間數(shù)據(jù)交換格式選擇

4.2 爬蟲方式的選擇

4.3 翻譯引擎的使用

4.3 發(fā)送郵箱的坑

4.3 用Markdown生成最終報告

5 改進(jìn)方向及展望

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av