久草色伦理网,亚洲日本视频,人人看操日本女人

拿到數(shù)據(jù)進行分析之前，應該對數(shù)據(jù)進行探索，所謂的探索性數(shù)據(jù)分析(EDA: Exploratory Data Analysis)，了解數(shù)據(jù)集的變量類型、大致分布、異常值、缺失值……等等等等。

Report generated with pandas-profiling

在探索Pandas表格時經常會被用到的命令包括：

pandas.info()查看變量數(shù)據(jù)類型；
pandas.describe()查看數(shù)值型（離散型、日期型）變量的匯總統(tǒng)計信息。如數(shù)值型變量的非NA值計數(shù)count、最大值max、最小值min、均值mean、標準差std以及三個分位數(shù)( [0.25, 0.5, 0.75] )；
pandas.head()、pandas.tail()查看數(shù)據(jù)集最前（后）5條數(shù)據(jù)；
pandas.value_counts()查看序列中各個值的出現(xiàn)頻次；
此外，可能用到的命令還包括計算樣本值的偏度skew()和峰度kurt()，此外，單變量的直方圖，雙變量間的相關系數(shù)計算，缺失值的統(tǒng)計與濾除（替換）等也都是免不了的。
雖然探索性數(shù)據(jù)分析的方法明確步驟清晰，但要用到這么多命令，難免掛一漏萬。何況各個命令又有一堆參數(shù)，要想熟練使用必須經過一段時間才能熟悉。比如pandas.describe()默認只對數(shù)據(jù)集中的數(shù)值型變量虛擬統(tǒng)計，傳入include參數(shù)才能看到離散型變量的統(tǒng)計信息。

pandas.describe()示例

在優(yōu)達學城的專欄上看到了pandas_profiling的推薦，似乎找到了快速搞定EDA的利器。正如其名稱所示，df.ProfileReport()給出一個DataFrame表格的全方位快照，便于用戶了解數(shù)據(jù)集的各類信息。

pandas_profiling的安裝

pandas-profiling官方文檔中的安裝方法如下：

pip install pandas-profiling
# 直接從github上安裝
pip install <https://github.com/pandas-profiling/pandas-profiling/archive/master.zip>
# conda安裝
conda install -c conda-forge pandas-profiling

不過在本人的機器上出了點小插曲。安裝成功后提示錯誤，無法導入pandas_profiling包。

cannot import name 'to_html'.jpg

重新安裝后導入倒是成功了，但無法運行profile_report()命令。

cannot import name 'GridspecLayout'

自己懷疑是版本沖突的原因，在網上搜索了沒找到直接的答案，不過看到ImportError: cannot import name 'AppLayout' from 'ipywidgets'一個類似問題，提到的解決方法是將ipywidgets制定版本為7.5。照貓畫虎按此居然將GridspecLayout的importerror也給解決了（竊笑，機智）。
記下一筆看有沒有會碰到同樣問題的人。

stackoverflow上關于ipywidgets錯誤的回答

對 pandas 數(shù)據(jù)表進行預覽分析（Profiling）

安裝成功后，使用很簡單，直接df.profile_report()就行了。以Kaggle上的 ASHRAE 建筑能耗預測中的數(shù)據(jù)集為例，本文題圖即為building_metadata.csv中的數(shù)據(jù)快照。

weather_train = pd.read_csv(f'weather_train.csv',  encoding = "utf-8", 
                            parse_dates = ['timestamp'],  index_col = 'timestamp')
weather_train.profile_report()

有時候會遇到Error rendering Jupyter widget: missing widget manager的報錯。

profile = weather_train.profile_report(title = "Pandas Profiling Reprot")
profile.to_notebook_iframe()
#保存快照為獨立的html文件
profile.to_file(output_file="your_report.html")

pandas_profiling探索報告示例：

pandas-profiling.gif

第一印象就是生成的報告內容非常全面。包括Overview、Variables、Correlations、Missing values和Sample五個部分。
Overview概述部分主要就是變量類型的分類統(tǒng)計，如數(shù)值型變量、日期型變量、離散型變量等分別有幾個。由于后面還有專門的Variables報告部分，所以沒有像df.info()命令那樣羅列每個列的數(shù)據(jù)類型。
值得一提的是概述部分中的Warnings警告部分。給出了各類需要引起注意的提示信息，如下圖官方文檔中提供的NZA (open data from the Dutch Healthcare Authority)報告所示。

NZA(open data from the Dutch Healthcare Authority)數(shù)據(jù)集快照

包括變量間相關系數(shù)過大、NA值（或zero值）的比例過高、偏度值過大等等等等，都會提示warnings。
Variables變量部分的數(shù)據(jù)類型及統(tǒng)計信息，如unique值、NA值、zero值的計數(shù)及占比等；點開Toggle details才是精華所在，單變量分析的各類信息基本上都已經給出了。

Variables給出各列變量的統(tǒng)計信息、直方圖等

Correlations部分計算變量間的（Spearman, Pearson and Kendall）相關系數(shù)：

Correlations結果

Missing values給出了各列變量中缺失值的相關信息。Counts是非NA值的計數(shù)；Matrix顯示的時各變量中NA值出現(xiàn)的位置；Heatmap給出了NA值出現(xiàn)機率的相關性，在missingno文檔中將其稱為無效相關性（Nullity Correlation）。當某列出現(xiàn)NA值時另外一列必定出現(xiàn)NA值，則Nullity Correlation值為1；某列出現(xiàn)NA值時另外一列必定不出現(xiàn)NA值，則Nullity Correlation值為-1；NA值出現(xiàn)不相干是值為0。Dendrogram部分則是按照NA值繪制的各列的樹枝狀圖?？偟挠∠?，Missing values部分的結果與另一個missingno包的結果非常相像，不知道是不是pandas_profiling作者直接調用了missingno執(zhí)行的結果？

Missing values結果

sample部分最簡單，相當于df.head(10)加df.tail(10)的結果。

sample部分顯示最前（后）10行數(shù)據(jù)

其它的命令參數(shù)還包括如結果保存為JSON文件、傳入字典指定直方圖的bins等分數(shù)量；對于大數(shù)據(jù)集指定minimal=True使不進行耗時的相關系數(shù)計算等。更詳細的信息大家可參閱pandas-profiling官方文檔。

# As a string
json_data = profile.to_json()
# As a file
profile.to_file(output_file="your_report.json")

profile = ProfileReport(df, title='Pandas Profiling Report', style={'full_width':True})
profile

profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file(output_file="output.html")

profile = ProfileReport(large_dataset, minimal=True)

結論

pandas_profiling可以給出DataFrame表格的快照，涵蓋了感興趣的絕大多數(shù)統(tǒng)計信息，且效率更高；
默認參數(shù)下計算時間可能會較長，因此數(shù)據(jù)集較大時可指定ProfileReport(minimal=True)不進行相關系數(shù)等計算；

參考資料

pandas.DataFrame.describe
優(yōu)達學城：Python數(shù)據(jù)分析，有哪些不為人知的小技巧？
ImportError: cannot import name 'AppLayout' from 'ipywidgets'
相關性分析指標-Pearson，Spearman，Kendall，Multual information
https://github.com/ResidentMario/missingno
pandas-profiling官方文檔

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

用pandas_profiling快速探索數(shù)據(jù)，算不算EDA（Exploratory Data Analysis）首選工具

用pandas_profiling快速探索數(shù)據(jù)，算不算EDA（Exploratory Data Analysis）首選工具

pandas_profiling的安裝

對 pandas 數(shù)據(jù)表進行預覽分析（Profiling）

結論

參考資料

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

用pandas_profiling快速探索數(shù)據(jù)，算不算EDA（Exploratory Data Analysis）首選工具

pandas_profiling的安裝

對 pandas 數(shù)據(jù)表進行預覽分析（Profiling）

結論

參考資料

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

用pandas_profiling快速探索數(shù)據(jù)，算不算EDA（Exploratory Data Analysis）首選工具