欧美一区二区在线视频,亚洲97ai,近日AV电影网

第一次參加DateWhale組織的組隊學習，加入了數(shù)據(jù)分析的任務(wù)。本月的賽題是“學術(shù)前沿趨勢分析”。
插一句：發(fā)現(xiàn)石墨文檔的表單功能很不錯。

Task1：論文數(shù)據(jù)統(tǒng)計。

學習主題：論文數(shù)量統(tǒng)計，統(tǒng)計2019年全年，計算機各個方向的論文數(shù)量。
涉及到的知識點：jupyter notebook中安裝庫；json文件的讀??；列表推導式；爬蟲；正則表達式

1、在jupyter notebook 中直接安裝第三方庫

在使用jupyter notebook的過程中，有時候會遇到需要安裝第三方庫的情況，可以直接在notebook中進行pip install，只需要前面加一個感嘆號就行。

! pip install bs4

2、json文件的讀取

首先導入json庫

import json

使用with語句打開json文件，然后使用for循環(huán)讀取數(shù)據(jù)。

data=[]

#使用with語句打開文件，有兩個有點：1、自動關(guān)閉文件句柄。2、自動顯示（處理）文件讀取數(shù)據(jù)異常
with open('arxiv-metadata-oai-2019.json','r') as f:
    for idx,line in enumerate(f):
        #讀取前100行，如果讀取所有數(shù)據(jù)需要8GB內(nèi)存
        if idx >=100:
            break
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #將list變?yōu)閐ataframe格式，方便使用的pandas分析

json文件的格式如：
{"id":value, "name":value, "score":value}
{"id":value, "name":value, "score":value}
{"id":value, "name":value, "score":value}
將其轉(zhuǎn)換為DataFrame格式之后，更便于使用Pandas進行分析。

可以定義一個專門用來讀取json文件的函數(shù)。使用此函數(shù)進行讀取時，可以指定讀取的列，以及讀取的行數(shù)。

def readArxivFile(path,columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],count=None):
    '''
    定義讀取文件的函數(shù)
    path:文件路徑
    columns:需要選擇的列
    count:讀取行數(shù)
    '''
    data = []
    with open(path,'r') as f:
        for idx, line in enumerate(f):
            if idx==count:
                break
            d = json.loads(line)
            d = {col:d[col] for col in columns}
            data.append(d)
            
    data = pd.DataFrame(data)
    return data

data = readArxivFile('arxiv-metadata-oai-2019.json',['id','categories','update_date'])

代碼解析：

json.loads(str)的作用，就是把json的字符串轉(zhuǎn)換成字典。效果如下，雖然轉(zhuǎn)換前后看起來一樣，本質(zhì)上卻不同。一個是字符串，一個是字典。

json.load()
在字典中篩選出指定的key，也可以使用字典推導式，用for語句，一句話完成：

d = {col:d[col] for col in columns}

json讀取的基本的流程是：
1、創(chuàng)建空列表data；
2、循環(huán)data.append(字典)，形成 [字典，字典，字典....]；
3、使用 pd.DataFrame(data)，轉(zhuǎn)化為DataFrame格式。

3、列表推導式

（字典也有推導式，如上文所述。）
列表推導式的執(zhí)行是C語言的速度，比使用for循環(huán)更快。

對于復雜的雙層循環(huán)，其執(zhí)行邏輯如圖：

列表推導式.png

這條語句的使用場景：
我有蘋果+橘子，他有橘子+梨+香蕉。。。統(tǒng)計全班一共有幾種水果。

4、提取年份

pd.to_datetime() :直接把字符串轉(zhuǎn)換為datetime格式
.dt.year :提取出“年”

data["year"] = pd.to_datetime(data["update_date"]).dt.year
 #將update_date從例如2019-02-20的str變?yōu)閐atetime格式，并提取出year

類似的還有：

    df['purchase_date'] = pd.to_datetime(df['purchase_date'])
    df['year'] = df['purchase_date'].dt.year
    df['weekofyear'] = df['purchase_date'].dt.weekofyear
    df['month'] = df['purchase_date'].dt.month
    df['dayofweek'] = df['purchase_date'].dt.dayofweek
    df['weekend'] = (df.purchase_date.dt.weekday >=5).astype(int)
    df['hour'] = df['purchase_date'].dt.hour

5、刪除一列

del data["update_date"]

6、重新編號

data.reset_index(drop=True, inplace=True) #重新編號

7、爬蟲

from bs4 import BeautifulSoup
import requests

#爬取所有的類別
website_url = requests.get('https://arxiv.org/category_taxonomy').text #獲取網(wǎng)頁的文本數(shù)據(jù)
soup = BeautifulSoup(website_url,'lxml') #爬取數(shù)據(jù)，這里使用lxml的解析器，加速
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 對應(yīng)的標簽入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #讀取 tags

website_url = requests.get('網(wǎng)址').text
返回網(wǎng)頁的html文本
soup = BeautifulSoup(website_url,'lxml')
把html文本解析為soup對象
root = soup.find('div',{'id':'category_taxonomy_list'})
找出對應(yīng)的標簽入口。我們想要的數(shù)據(jù)，都在id為'category_taxonomy_list'的div標簽中。
tags = root.find_all(["h2","h3","h4","p"], recursive=True)
讀取 tags，把root中["h2","h3","h4","p"]標簽都找出來。

8、正則表達式

re.sub(pattern,repl,string,count=0,flag=0)
用來替換或提取字符串
pattern : 正則中的模式字符串。
repl : 替換的字符串，也可為一個函數(shù)。
string : 要被查找替換的原始字符串。
count : 模式匹配后替換的最大次數(shù)，默認 0 表示替換所有的匹配。
flags : 編譯時用的匹配模式，數(shù)字形式。
其中pattern、repl、string為必選參數(shù)

例如（替換）：

a = re.sub(r'hello', 'i love the', 'hello world')
print(a)

輸出'i love the world'
hello world里面的hello被 i love the替換

例如（提?。?/h6>

a = re.sub('(\d{4})-(\d{2})-(\d{2})', r'\2-\3-\1', '2018-06-07')
>>> a
'06-07-2018'

這里，\2 指代的是前面的第二個分組。
通過這種方式，使用re.sub就可以從字符串中提取出想要的數(shù)據(jù)。
比如，原始字符串 raw="張三(警察)"，要從中提取出“警察”，就可以使用如下代碼：

raw="張三(警察)"
job = re.sub(r"(.*)\((.*)\)",r"\2",raw)
>>> job
'警察'

這里，( )代表一個分組，里面寫 (.*)，代表對這個分組里的內(nèi)容不做要求。
"\(" 和 "\)"就是轉(zhuǎn)義符，表示這就是括號符號，沒有特殊意義。

9、分組聚合 .agg

.groupby( ).agg( )
使用agg函數(shù)，可以對同一列數(shù)據(jù)做兩次聚合。

例如1：單列單種聚合

df.groupby(['key1'])['data1'].min() ==
df.groupby(['key1'])['data1'].agg({'min'}) ==
df.groupby(['key1']).agg({'data1':'min'})
對data1列，取各組的最小值，名字還是data1(推薦使用)

例如2：單列多種聚合

df.groupby(['key1'])['data1'].agg({'min','max'})==
df.groupby(['key1']).agg({'data1':['min','max']})
(推薦使用)

例如3：多列多種聚合

DataFrame.png

grouped = df1.groupby(['sex','smoker'])
# sex有 F M 二值，smoker有 Y N 二值，故分成四組。
grouped.agg({'age':['sum','mean'],'weight':['min','max']})

不同列運用不同的聚合函數(shù).png

例如4：自定義聚合函數(shù)

def Max_cut_Min(group):
    return group.max()-group.min()

grouped.agg(Max_cut_Min)

自定義聚合函數(shù).png

另外，group by之后，可以直接describe()

grouped = df1.groupby(['sex','smoker'])
grouped.describe()

groupby之后describe.png

10、餅圖爆炸

使用不同的爆炸程度，可以將占比較小的份額更加清除地展現(xiàn)。

fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

餅圖爆炸.png

11、DataFrame篩選——query函數(shù)更簡潔

[參考文章]https://www.cnblogs.com/traditional/p/11111327.html
比如以下的df
"""
name age height where
0 mashiro 17 161 櫻花莊
1 satori 17 155 地靈殿
2 koishi 16 154 地靈殿
3 kurisu 18 172 石頭門
4 nagsia 21 158 clannad
"""

篩選age > 17的
print(df.query("age > 17")) # 等價于df[df["age"] > 17]
篩選where == '地靈殿'的
print(df.query("where == '地靈殿'")) # 等價于df[df["where"] > "地靈殿"]
篩選where == '地靈殿' 并且 age == 17的,# 等價于df[(df["where"] > "地靈殿") & (df["age"] == 17)]
print(df.query("where == '地靈殿' and age == 17"))
篩選出身高大于年齡的，當然這個肯定都滿足，只是演示一下query支持的功能
print(df.query("height > age")) # 等價于df[df["height"] >df["age"]]
另外如果字段名含有空格怎么辦？我們將name變成na me
df = df.rename(columns={"name": "na me"})
這個時候應(yīng)該將na me使用``包起來，告訴pandas，``里面的內(nèi)容是一個整體。否則的話，是不會正確解析的
print(df.query("`na me` == 'satori'"))
查找age位于[17, 21]當中、并且where位于["櫻花莊", "clannad]大當中的記錄
print(df.query("age in [17, 21] and where in ['櫻花莊', 'clannad']"))
問題來了，如果我們外面有一個變量，我們怎么在query里面使用呢？如果使用的話，那么應(yīng)該使用@作為前綴，來標識這是一個外面存在的變量

place="地靈殿"
print(df.query("where == @place"))

12、pandas中的透視表——pivot函數(shù)

df.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")

按照種類和年份分組，然后使用count()聚合，然后使用reset_index()重新索引，最后使用pivot函數(shù)，搞成透視表的格式。
.pivot( )函數(shù)三個變量：index是行，columns是列，values是值。
.pivot( )函數(shù)一般 用在groupby分組聚合之后，index和columns是用來groupby的特征。
而且分組聚合之后，還要重新reset_index()，變成正常的數(shù)字index。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

DateWhale--2021.1--Task1

DateWhale--2021.1--Task1

Task1：論文數(shù)據(jù)統(tǒng)計。

1、在jupyter notebook 中直接安裝第三方庫

2、json文件的讀取

代碼解析：

3、列表推導式

4、提取年份

5、刪除一列

6、重新編號

7、爬蟲

8、正則表達式

例如（替換）：

9、分組聚合 .agg

例如1：單列單種聚合

例如2：單列多種聚合

例如3：多列多種聚合

例如4：自定義聚合函數(shù)

10、餅圖爆炸

11、DataFrame篩選——query函數(shù)更簡潔

12、pandas中的透視表——pivot函數(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

DateWhale--2021.1--Task1

Task1：論文數(shù)據(jù)統(tǒng)計。

1、在jupyter notebook 中直接安裝第三方庫

2、json文件的讀取

代碼解析：

3、列表推導式

4、提取年份

5、刪除一列

6、重新編號

7、爬蟲

8、正則表達式

例如（替換）：

9、分組聚合 .agg

例如1：單列單種聚合

例如2：單列多種聚合

例如3：多列多種聚合

例如4：自定義聚合函數(shù)

10、餅圖爆炸

11、DataFrame篩選——query函數(shù)更簡潔

12、pandas中的透視表——pivot函數(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Task1：論文數(shù)據(jù)統(tǒng)計。

1、在jupyter notebook 中直接安裝第三方庫

2、json文件的讀取

3、列表推導式

5、刪除一列

6、重新編號

7、爬蟲

9、分組聚合 .agg

11、DataFrame篩選——query函數(shù)更簡潔