1、concat連接
1.1、DataFrame連接
可以使用concat將下面的三個(gè)DataFrame連接。
ignore_index:是否忽略前面的索引。
axis:columns或者1表示按照行添加。
import pandas as pd
df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')
row_concat = pd.concat([df1,df2,df3], ignore_index=False)
# A B C D
#0 a0 b0 c0 d0
#1 a1 b1 c1 d1
#2 a2 b2 c2 d2
#3 a3 b3 c3 d3
# A B C D
#0 a4 b4 c4 d4
#1 a5 b5 c5 d5
#2 a6 b6 c6 d6
#3 a7 b7 c7 d7
# A B C D
#0 a8 b8 c8 d8
#1 a9 b9 c9 d9
#2 a10 b10 c10 d10
#3 a11 b11 c11 d11
row_concat = pd.concat([df1,df2,df3], axis='columns')
# A B C D A B C D A B C D
#0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8
#1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9
#2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10
#3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11
1.2、連接DataFrame和Series
1、使用concat連接DataFrame和Series
new_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(new_series)
#0 n1
#1 n2
#2 n3
#3 n4
#dtype: object
pd.concat([df1, new_series])
# A B C D 0
#0 a0 b0 c0 d0 NaN
#1 a1 b1 c1 d1 NaN
#2 a2 b2 c2 d2 NaN
#3 a3 b3 c3 d3 NaN
#0 NaN NaN NaN NaN n1
#1 NaN NaN NaN NaN n2
#2 NaN NaN NaN NaN n3
#3 NaN NaN NaN NaN n4
# 由于Series是列數(shù)據(jù),concat方法默認(rèn)是添加行,但是Series數(shù)據(jù)沒有行索引
# 所以添加了一個(gè)新列,缺失的數(shù)據(jù)用NaN填充
2、將['n1','n2','n3','n4']作為行連接到df1后,可以創(chuàng)建DataFrame并指定列名
new_row_df = pd.DataFrame([['n1','n2', 'n3', 'n4']], columns=['A','B','C','D'])
# A B C D
#0 n1 n2 n3 n4
2、append函數(shù)
1、concat可以連接多個(gè)對象,如果只需要向現(xiàn)有DataFrame追加一個(gè)對象,可以通過append函數(shù)來實(shí)現(xiàn)
注:append函數(shù)未來可能會被替換
df1.append(df2)
2、添加列dataframe['列名']=['值']
axis=1表示按列添加
col_concat = pd.concat([df1,df2,df3], axis=1)
col_concat['new_col']=['n1','n2','n3','n4']
# A B C D A B C D A B C D new_col
#0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1
#1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2
#2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3
#3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4
3、dataframe['列名'] = Series對象
col_concat['new_series']=pd.Series(['n1','n2','n3','n4'])
# A B C D A B C D A B C D new_col new_series
#0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 n1 n1
#1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 n2 n2
#2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 n3 n3
#3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 n4 n4
3、從數(shù)據(jù)庫加載數(shù)據(jù)
3.1、從數(shù)據(jù)庫中讀取表
第一個(gè)參數(shù)表名,第二個(gè)參數(shù)數(shù)據(jù)庫連接對象
import _sqlite3
con = _sqlite3.connect('data/chinook.db')
tracks = pd.read_sql_query('SELECT * from tracks', con)
tracks.head(1)
3.2、一對一合并
1、通過DataFrame把一列與另一列數(shù)據(jù)連接(不重復(fù))
tracks_subset = tracks.loc[[0,62,76,98,110,193,204,281,322,359]]
# TrackId Name AlbumId MediaTypeId \
#0 1 For Those About To Rock (We Salute You) 1 1
#62 63 Desafinado 8 1
#76 77 Enter Sandman 9 1
#98 99 Your Time Has Come 11 1
#110 111 Money 12 1
#193 194 First Time I Met The Blues 20 1
#204 205 Jorge Da Capadócia 21 1
#281 282 Girassol 26 1
#322 323 Dig-Dig, Lambe-Lambe (Ao Vivo) 29 1
#359 360 Vai-Vai 2001 32 1
2、通過GenreId列合并數(shù)據(jù),how參數(shù)指定連接方式
how = ’left‘ 對應(yīng)SQL中的 left outer 保留左側(cè)表中的所有key
how = ’right‘ 對應(yīng)SQL中的 right outer 保留右側(cè)表中的所有key
how = 'outer' 對應(yīng)SQL中的 full outer 保留左右兩側(cè)側(cè)表中的所有key
how = 'inner' 對應(yīng)SQL中的 inner 只保留左右兩側(cè)都有的key
3、left
genres = pd.read_sql_query('SELECT * from genres', con)
genres_track = genres.merge(tracks_subset[['TrackId', 'GenreId', 'Milliseconds']], on='GenreId', how='left')
4、right
genres_track = genres.merge(tracks_subset[['TrackId', 'GenreId', 'Milliseconds']], on='GenreId', how='right')
3.2、多對一合并
1、將cust中的CustomerId與invoice中CustomerId合并
將cust中的InvoiceId與ii中InvoiceId合并
cust = pd.read_sql_query("SELECT CustomerId,FirstName,LastName from customers",con)
invoice = pd.read_sql_query('SELECT InvoiceId,CustomerId from invoices',con)
ii = pd.read_sql_query('SELECT InvoiceId,UnitPrice,Quantity from invoice_items',con)
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
2、DataFrame的assign方法 創(chuàng)建新列
total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cust_inv = cust_inv.assign(Total = total)
# CustomerId FirstName LastName InvoiceId UnitPrice Quantity Total
#0 1 Luís Gon?alves 98 1.99 1 1.99
3.3、join合并
1、使用join合并,可以是依據(jù)兩個(gè)DataFrame的行索引,或者一個(gè)DataFrame的行索引另一個(gè)DataFrame的列索引進(jìn)行數(shù)據(jù)合并
stocks_2016 = pd.read_csv('data/stocks_2016.csv')
stocks_2017 = pd.read_csv('data/stocks_2017.csv')
stocks_2018 = pd.read_csv('data/stocks_2018.csv')
2、依據(jù)兩個(gè)DataFrame的行索引
如果合并的兩個(gè)數(shù)據(jù)有相同的列名,需要通過lsuffix,和rsuffix,指定合并后的列名的后綴
stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer’)
3、將兩個(gè)DataFrame的Symbol設(shè)置為行索引,再次join數(shù)據(jù)
stocks_2016.set_index('Symbol').join(stocks_2018.set_index('Symbol'),lsuffix='_2016', rsuffix='_2018’)
4、將一個(gè)DataFrame的Symbol列設(shè)置為行索引,與另一個(gè)DataFrame的Symbol列進(jìn)行join
stocks_2016.join(stocks_2018.set_index('Symbol'),lsuffix='_2016', rsuffix='_2018',on='Symbol')