中文字幕人妻久久,一区二区美女,干VA播放

Python----->Pubmed文獻(xiàn)摘要下載和整理 PART1

目的：下載檢索詞返回的文獻(xiàn)摘要，現(xiàn)在僅針對NCBI pubmed 數(shù)據(jù)庫，下載格式為xml。從xml文本中提取需要的字段，整理成標(biāo)準(zhǔn)模式，此外還需要增加發(fā)表期刊的影響因子，以及統(tǒng)計摘要中出現(xiàn)的基因詞頻，為后續(xù)的工作準(zhǔn)備。
不建議爬取NCBI容易被封，而且NCBI有開放的API接口，允許自由下載。只研究了pubmed 中下載的方法，如果是其他數(shù)據(jù)庫需要自己看文檔，附上使用文檔鏈接：https://www.ncbi.nlm.nih.gov/books/NBK25497/

思路：

(1) 獲得文獻(xiàn)Pubmed id
pubmed提供的接口：https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
import requests 可以很方便的下載，這邊順便提一下url構(gòu)建規(guī)則。url構(gòu)建規(guī)則：url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=' + db + '&term=' + query + '&usehistory=y' ，其中db表示數(shù)據(jù)庫，query表示檢索關(guān)鍵詞
例如：
query = '(((lung+cancer)+AND+gene[Title/Abstract])+AND+("2010"[Date - Publication] : "2020"[Date - Publication])'

def get_udi(url):# 獲得pubmed文獻(xiàn)的uid
    # 參數(shù)說明：url:構(gòu)建的用于檢索的url; i：檢索關(guān)鍵詞； basepath:保存的根目錄
    docsums = requests.get(url).text
    # print(docsums)
    ids = re.findall(r'<Id>(\d+)</Id>', docsums)
    if len(ids) > 0:
        print('search counts: ', len(ids))
        return ids
    else:
        print('search result is None!!!')

(2)下載xml 摘要
因為一次下載的id數(shù)目為250，在這是設(shè)置為200，需要寫一個for 循環(huán)。每次循環(huán)后，time.sleep(t)，如果每次休息時間一樣，會認(rèn)為是在爬蟲，會自動斷開服務(wù)器連接，最好是寫一個隨機(jī)函數(shù)調(diào)整time.sleep(t)，模擬人在手動下載。

def get_xml(uidlist, base_url, basepath):   #根據(jù)uid下載xml
    count = len(uidlist)
    rem = count % 200  # 余數(shù)
    intrger = count // 200  # 整數(shù)
    for n in range(0, intrger+1):    #循環(huán)下載，每次200篇文獻(xiàn)
        xrange = n * 200 + 1
        yrange = (n + 1) * 200
        out_file = open(os.path.join(basepath, str(n) + '_pubItem.xml'), 'w', encoding='utf-8')
        cur_ids = ','.join(uidlist[xrange:yrange])

        cur_url = base_url + '&id=' + cur_ids + '&rettype=abstract&retmode=xml'
        cur_data = requests.get(cur_url).text

        for string in cur_data:
            out_file.write(string)
        t = random.uniform(10, 20)   #random sleep time
        time.sleep(t)
        print(n)
    print('complete')
    print(time.ctime())

先判斷保存路徑是否存在，如果不存在需要新建一個路徑

def makedirfile(root):#創(chuàng)建文件夾
    if os.path.exists(root):
        pass
    else:
        os.makedirs(root)
        print('創(chuàng)建新文件夾：', root)

def downloadXML(basepath, query):
    #參數(shù)說明： basepath:保存路徑； query:構(gòu)成的檢索字符串
    makedirfile(basepath)

    db = 'pubmed'
    base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
    url = base + 'esearch.fcgi?db=' + db + '&term=' + query + '&usehistory=y'
    print(url)

    output = requests.get(url).text

    web = re.findall(r'<WebEnv>(.*?)</WebEnv>', output)[0]
    key = re.findall(r'<QueryKey>(\d+)</QueryKey>', output)[0]
    count = re.findall(r'<Count>(\d+?)</Count>', output)[0]
    print(count)

    if int(count) > 0:
        cur_url = base + 'esearch.fcgi?db=' + db + '&query_key=' + key + '&WebEnv=' + web + '&retmax=' + str(count)
        uidlist=get_udi(cur_url)            #1、獲得uid

        base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/' + 'efetch.fcgi?' + 'db=' + db + '&WebEnv=' + web + '&query_key=' + query
        get_xml(uidlist, base_url, basepath)         #2、下載xml

(3)xml格式解析
(4)關(guān)聯(lián)影響因子
(5)統(tǒng)計基因

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

2019-07-14

2019-07-14

Python----->Pubmed文獻(xiàn)摘要下載和整理 PART1

思路：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

2019-07-14

Python----->Pubmed文獻(xiàn)摘要下載和整理 PART1

思路：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av