title: "讀寫(xiě)競(jìng)賽"
author: "ky"
date: "2020年1月28日"
output: word_document
如果我們?cè)赗中已經(jīng)得到了一個(gè)非常大的數(shù)據(jù)文件,應(yīng)該用什么工具才能迅速寫(xiě)出這個(gè)數(shù)據(jù)表呢?
如果我們?cè)谖募A中有一個(gè)很大的數(shù)據(jù)表文件,怎樣才能迅速讀取并載入工作區(qū)間呢?
下面我們構(gòu)建一個(gè)較大的數(shù)據(jù)框,實(shí)踐測(cè)試一下。
library(pacman)
p_load(fst,feather,data.table,tidyverse)
nr_of_rows <- 1e7
df <- data.frame(
logical = sample(c(TRUE,FALSE,NA),prob = c(0.85,0.1,0.05),nr_of_rows,replace = TRUE),
Integer = sample(1L:100L,nr_of_rows,replace = TRUE),
Real = sample(sample(1:10000,20)/100,nr_of_rows,replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD),nr_of_rows,replace = TRUE))
)
查看一下數(shù)據(jù)文件大小
object.size(df) %>%
print(unit = 'auto')
csv組別
write.csv base包中基礎(chǔ)函數(shù)
write_csv tidyverse包中的函數(shù)
fwrite data.table包中的函數(shù)
setwd('e:/r-lhtz')
p_load(microbenchmark)
microbenchmark(write.csv(df,'df_base.csv'),
write_csv(df,'df_readr.csv'),
fwrite(df,'df_dt.csv'),
times = 1,unit = 's')
讀入測(cè)試(統(tǒng)一讀入df_dt數(shù)據(jù))
microbenchmark(read.csv('df_dt.csv') -> df_base,
read_csv('df_dt.csv') -> df_readr,
fread('df_dt.csv') -> df_dt,
times = 1,unit = 's')
df_dt %>% as_tibble() -> df_readr1 #使用data.table讀入轉(zhuǎn)化為tidyverse系統(tǒng)處理
df_readr1
gdata::keep(df,sure =T) #僅僅保留df變量
file.remove(c('df_dt.csv','df_base.csv','df_readr.csv')) #刪除寫(xiě)入文件
2.bin組別
二進(jìn)制可以獲得更快的讀寫(xiě)速度,在base包中,存儲(chǔ)一個(gè)數(shù)據(jù)表可以使用saveRDS函數(shù),文件后綴為“.rds”,重新
讀取可以使用readRDS函數(shù)。在tidyverse生態(tài)系統(tǒng)中,readr包提供了read_rds函數(shù)和write_rds函數(shù)。data.table
則主要有feather和fst包。
microbenchmark(write_rds(df,'df.rds'),
write_feather(df,'df.feather'),
write_fst(df,'df.fst'),
times = 10,unit = 's')
microbenchmark(read_rds('df.rds') -> df_rds,
read_feather('df.feather') -> df_feather,
read_fst('df.fst') -> df_fst,
times = 10, unit = 's')
setequal(df,df_rds)
setequal(df,df_feather)
setequal(df,df_fst)
file.remove(c('df.rds','df.feather','df.fst'))
數(shù)據(jù)存取轉(zhuǎn)換的瑞士軍刀(rio)
rio包能夠?qū)Ω鞣N格式進(jìn)行輸入和輸出,首次使用需要安裝rio包
library(pacman)
p_load(rio)
export(iris,'iris.xlsx')
export(list(mtcars =mtcars,iris=iris),file = 'mtcars_iris.xlsx') #兩個(gè)數(shù)據(jù)集生成在一個(gè)文件中的兩個(gè)sheet
import_list('mtcars_iris.xlsx') -> mtcars_iris
mtcars_iris[[1]]
mtcars_iris[[2]]
import_list('mtcars_iris.xlsx',which = 2) -> iris2 #指定取出第2個(gè)工作簿文件
iris2
rio包中的convert函數(shù)支持格式轉(zhuǎn)換
convert('iris.xlsx','iris.fst')
unlink(c('iris.xlsx','mtcars_iris.xlsx','iris.fst','iris.csv','iris1.csv')) #unlink等價(jià)于file.remove