Pandas讀取文件的效率-CSV VS Pickle

讀取csv文件

import pandas as pd
csv_path = 'gun_deaths_in_america.csv'
data_csv = pd.read_csv(csv_path,header=0)
data_csv.head()
image.png
data_csv.shape
(100798, 10)
%timeit pd.read_csv(csv_path,header=0)
114 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

查看文件大小

查看本地文件大小

import os
os.stat('gun_deaths_in_america.csv').st_size # 單位是byte
4824404

查看占用內(nèi)存大小

data_csv.memory_usage(deep=True).sum()
30368107

查看每一列占用內(nèi)存大小

  • object 類型占用內(nèi)存空間很大
  • int/float類型占用內(nèi)存小
data_csv.memory_usage(deep=True)
Index             80
year          806384
month         806384
intent       6495168
police        806384
sex          6249476
age           806384
race         6322009
hispanic      806384
place        6463070
education     806384
dtype: int64
data_csv.dtypes
year           int64
month          int64
intent        object
police         int64
sex           object
age          float64
race          object
hispanic       int64
place         object
education    float64
dtype: object

保存為Pickle文件

直接保存為Pickle文件

保存為本地文件后,文件大小比原文件大。

data_csv.to_pickle('gun_deaths_in_america_before_transform.pkl')
pkl_path_before = 'gun_deaths_in_america_before_transform.pkl'
os.stat(pkl_path_before).st_size
5656925

對(duì)比文件讀取速度

pickle文件的讀取速度比csv文件讀取速度快2倍 !

%timeit pd.read_csv(csv_path,header=0)
102 ms ± 7.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_before)
32.4 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

類型轉(zhuǎn)換后保存為Pickle文件

剛才看到object類型很占內(nèi)存,可以將其轉(zhuǎn)換為category類型。

data_csv.intent.astype('category').head()
0    Suicide
1    Suicide
2    Suicide
3    Suicide
4    Suicide
Name: intent, dtype: category
Categories (4, object): [Accidental, Homicide, Suicide, Undetermined]

先準(zhǔn)換intent列,對(duì)比object的6495168,category的大小為object的1/65.

data_csv.intent.astype('category').memory_usage(deep=True)
101303

將所有數(shù)據(jù)轉(zhuǎn)換成category類型

for col in data_csv.columns:
    data_csv[col] = data_csv[col].astype('category')

查看轉(zhuǎn)換后占用內(nèi)存大小,相比轉(zhuǎn)換前的303688107,轉(zhuǎn)換后的內(nèi)存大小減小57倍。

data_csv.memory_usage(deep=True).sum()
1018587

將轉(zhuǎn)換后的數(shù)據(jù)保存為pickle文件,并查看pickle本地文件大小。相比轉(zhuǎn)換前的4824404,轉(zhuǎn)換后的文件的大小減小4倍。

data_csv.to_pickle('gun_deaths_in_america_after_transform.pkl')
pkl_path_after = 'gun_deaths_in_america_after_transform.pkl'
os.stat(pkl_path_after).st_size
1012643

對(duì)比文件讀取速度,比轉(zhuǎn)換前快42倍。

%timeit pd.read_pickle(pkl_path_after)
2.57 ms ± 262 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_csv(csv_path,header=0)
106 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

綜合對(duì)比

files = [csv_path,pkl_path_before,pkl_path_after]

對(duì)比本地文件大小

轉(zhuǎn)換后的文件占用磁盤空間最小,比原文件小4倍,對(duì)于保存大量數(shù)據(jù)非常有用。

for file in files:
    print('File size of the {0} is {1}:  '.format(file,os.stat(file).st_size))
File size of the gun_deaths_in_america.csv is 4824404:  
File size of the gun_deaths_in_america_before_transform.pkl is 5656925:  
File size of the gun_deaths_in_america_after_transform.pkl is 1012643:  

對(duì)比文件讀取速度

轉(zhuǎn)換后的讀取速度比普通csv文件的讀取速度快42倍。

%timeit pd.read_csv(csv_path,header=0)
97.5 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_before)
28.5 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_after)
2.18 ms ± 141 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

對(duì)比占用內(nèi)存大小

轉(zhuǎn)換后占用內(nèi)存比轉(zhuǎn)換前小30倍。

for file in files:
    if os.path.splitext(file)[1]=='.csv':
        print('memory_usage of the {0} is : {1}'. \
            format(file,pd.read_csv(file,header=0).memory_usage(deep=True).sum()))
    else:
        print('memory_usage of the {0} is : {1}'. \
            format(file,pd.read_pickle(file).memory_usage(deep=True).sum()))
memory_usage of the gun_deaths_in_america.csv is : 30368107
memory_usage of the gun_deaths_in_america_before_transform.pkl is : 30368107
memory_usage of the gun_deaths_in_america_after_transform.pkl is : 1010827

讀取的數(shù)據(jù)都是一樣的,就是數(shù)據(jù)類型不一樣。


image.png
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容