前言
使用Pandas
import pandas as pd
import numpy as np
import datetime
剛剛開(kāi)始學(xué)python?;趥€(gè)人科研過(guò)程中遇到的問(wèn)題做一下筆記。
將日期設(shè)置為索引:pd.to_datetime
氣象中經(jīng)常要計(jì)算年/季/月平均,若是處理站點(diǎn)數(shù)據(jù)還涉及hourly的數(shù)據(jù),因此將日期作為索引可以方便后續(xù)的處理。
常用使用方法
- 由DataFrame的多列組成. 列名可以是 [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) 或者類(lèi)似的詞。
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5]})
pd.to_datetime(df)
Out[9]:
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
- 由字符串轉(zhuǎn)換而成。可以用
format='%Y%m%d'之類(lèi)的來(lái)指定格式。
pd.to_datetime('197901010600')
Out[2]: Timestamp('1979-01-01 06:00:00')
pd.to_datetime('1979-01-01 00') # 日期和時(shí)間之間需要有個(gè)空格
Out[7]: Timestamp('1979-01-01 00:00:00')
需注意,使用to_datetime生成的Timestamp有范圍限制:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
因此如果超出限制,不會(huì)生成Timestamp,只會(huì)顯示datetime.datetime格式。
In [5]: pd.to_datetime('13000101',format='%Y%m%d', errors='ignore')
Out[5]: datetime.datetime(1300, 1, 1, 0, 0)
可以用errors='ignore'將其置為NaT
In [6]: pd.to_datetime('13000101',format='%Y%m%d', errors='coerce')
Out[6]: NaT
實(shí)踐示例
示例一
數(shù)據(jù)為ISD站點(diǎn)數(shù)據(jù),原文件為以逗號(hào)分隔的csv格式。部分?jǐn)?shù)據(jù)用excel預(yù)覽如下

可見(jiàn)原數(shù)據(jù)已經(jīng)有列名,且日期單獨(dú)為一列,可以直接用
pd.read_csv讀取
df = pd.read_csv(file, dtype={'STATION':str})
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.set_index('DATE') # set date as index
示例二
如果日期是如下圖更常見(jiàn)的以空格分隔的形式,同樣可以用用pd.to_datetime轉(zhuǎn)換。
各列依次為:station ID, station type, year, month, day, standard report hour, actual report time...

colNames = ['id','fmtflag', 'year', 'month', 'day', 'hour', 'time', 'lat', 'lon', 'elev',
'ww', 'pw', 'slp', 'stp', 't', 'td'] #present weather, past weather, sea level pressure, station pressure
data = pd.read_table(file, header=None, names=colNames,
delim_whitespace=True, dtype={'id':str}) # read_csv也行
df['date'] = pd.to_datetime(df.loc[:, ['year', 'month', 'day', 'hour'] ])
df = df.set_index('DATE')
生成日期序列
pd.date_range
months = pd.date_range('1973-01', '2020-01', freq='M')
如果要顯示為‘‘yyyymm’’的格式(在讀取數(shù)據(jù)時(shí)會(huì)用到,例如有些數(shù)據(jù)路徑為/200701/...)
months = pd.date_range('1973-01', '2020-01', freq='M').strftime('%Y%M')
strftime是把時(shí)間轉(zhuǎn)換成string格式
通過(guò)日期索引選取數(shù)據(jù)

1. 直接通過(guò)日期字符串選取
選取某一年:
In [27]: df['1973']
Out[27]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 00:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-01 12:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-01 18:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-02 00:00:00 01001099999 -9.0 1 ... 1 NaN NaN
... ... ... ... ... ... ...
1973-12-31 09:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-12-31 12:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-12-31 15:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-12-31 18:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-12-31 21:00:00 01001099999 -3.0 1 ... 1 NaN NaN
[2801 rows x 19 columns]
或者選取某個(gè)區(qū)間內(nèi)的年份:
In [39]: df['1973':'1974'] # 包括1994年
Out[39]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 00:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-01 12:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-01 18:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-02 00:00:00 01001099999 -9.0 1 ... 1 NaN NaN
... ... ... ... ... ... ...
1974-12-31 09:00:00 01001099999 0.0 1 ... 1 NaN NaN
1974-12-31 12:00:00 01001099999 0.0 1 ... 1 NaN NaN
1974-12-31 15:00:00 01001099999 -4.0 1 ... 1 NaN NaN
1974-12-31 18:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1974-12-31 21:00:00 01001099999 -5.0 1 ... 1 NaN NaN
[5517 rows x 19 columns]
按年份選取應(yīng)該是最常用的。
如果只需要年平均之類(lèi)的話,groupby(updating) ......
較復(fù)雜的逐年的數(shù)據(jù)處理,需要通過(guò)循環(huán)遍歷所有年份,對(duì)每一年數(shù)據(jù)單獨(dú)處理:
df_years = df.index.to_period('A').unique().year #將index按年份顯示,取唯一值,再取年份
Out[29]:
Int64Index([1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983,
1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
2017, 2018, 2019],
dtype='int64')
for yr in df_years:
df_yr = df[str(yr)] # 索引得是字符
balabala……
了解一下選月、日、小時(shí):
# 選取某月
df['1973-01'] # df['197301']會(huì)報(bào)錯(cuò)
# 選取某天
df['1973-01-01'] # 我的數(shù)據(jù)包含小時(shí)、分鐘數(shù)據(jù)所以這么寫(xiě)不報(bào)錯(cuò)
df['1973-01-01':'1973-01-01'] # 用區(qū)間來(lái)選取
# 選取某時(shí)刻
df['1973-01-01 06'] # (目前只知道最多選到小時(shí)為止)
用.loc也可以,更保險(xiǎn)
df.loc['1973':'1974'] # 得到的結(jié)果和上面的df['1973':'1974']是一樣的
2. 通過(guò)條件篩選index
我一般是通過(guò)index來(lái)進(jìn)行篩選。datetime格式的index可以提取year, month, day, hour的信息:
In [30]: df.index.month
Out[30]:
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...
6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
dtype='int64', length=98404)
然后用boolean篩選就行了:
In [31]: df[df.index.month==1]
Out[31]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 00:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-01 12:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-01 18:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-02 00:00:00 01001099999 -9.0 1 ... 1 NaN NaN
... ... ... ... ... ... ...
2020-01-30 18:00:00 01001099999 1.2 1 ... 1 NaN NaN
2020-01-31 00:00:00 01001099999 1.2 1 ... 1 NaN NaN
2020-01-31 09:00:00 01001099999 0.6 1 ... 1 NaN NaN
2020-01-31 15:00:00 01001099999 -1.0 1 ... 1 NaN NaN
2020-01-31 18:00:00 01001099999 -1.4 1 ... 1 NaN NaN
[8786 rows x 19 columns]
3. truncate
可以去掉某日期之前或之后的數(shù)據(jù)。
dates = pd.date_range('2016-01-01', '2016-02-01', freq='s')
df = pd.DataFrame(index=dates, data={'A': 1})
# 去掉1-5之前和1-10之后的行
df.truncate(before=pd.Timestamp('2016-01-05'),
after=pd.Timestamp('2016-01-10'))
# 也可以簡(jiǎn)單地寫(xiě)成:
df.truncate('2016-01-05', '2016-01-10')
# truncate方法默認(rèn)只匹配到0時(shí),最后一個(gè)值是2016-01-10 00:00:00
# 對(duì)比df.loc['2016-01-05':'2016-01-10', :], 最后一個(gè)值是2016-01-10 23:59:59
改變?nèi)掌陲@示方式
1. to_period (updating
df.index.to_period('A').unique().year
# process day by day
dates = df_ww.index.to_period('d').strftime('%Y-%m-%d').unique()
2. asfreq (updating)
說(shuō)實(shí)在的我還沒(méi)搞清楚這個(gè)函數(shù)怎么用QAQ
3. astype('datetime64[M]')
轉(zhuǎn)換成np.datetime64格式,返回的是array
df.index.values
Out[43]:
array(['1978-01-01T00:00:00.000000000', '1978-01-01T06:00:00.000000000',
'1978-01-01T12:00:00.000000000', ...,
'2007-02-28T17:00:00.000000000', '2007-02-28T19:00:00.000000000',
'2007-02-28T20:00:00.000000000'], dtype='datetime64[ns]')
df.index.values.astype('datetime64[M]')
Out[44]:
array(['1978-01', '1978-01', '1978-01', ..., '2007-02', '2007-02',
'2007-02'], dtype='datetime64[M]')
按年份/月份批量處理(計(jì)數(shù)、求和、平均): resample
df.resample('M')返回的是一個(gè)<pandas.core.resample.DatetimeIndexResampler object>, 后面要再加上具體的method才可以顯示結(jié)果,例如.count(), .sum(), .mean()...
# monthly number of ww observations
In [36]: num = df['WW_VALUE'].resample('M').count()
Out[36]:
1973-01-31 217
1973-02-28 210
1973-03-31 242
1973-04-30 237
1973-05-31 243
2020-02-29 159
2020-03-31 180
2020-04-30 179
2020-05-31 186
2020-06-30 89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64
返回的index自動(dòng)設(shè)置成了每月的最后一天。如果想只保留“年-月”格式的話,可以用to_period('m')
In [37]: num = df['WW_VALUE'].resample('M').count().to_period('m') # "m"大小寫(xiě)沒(méi)有影響。但“Y”和“y”格式有差別。
Out[37]:
1973-01 217
1973-02 210
1973-03 242
1973-04 237
1973-05 243
2020-02 159
2020-03 180
2020-04 179
2020-05 186
2020-06 89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64
選取某個(gè)時(shí)間區(qū)間內(nèi)的數(shù)據(jù)
Pandas.DataFrame.between_time
一個(gè)簡(jiǎn)單粗暴的選取時(shí)間段內(nèi)數(shù)據(jù)的方法:
In [38]: df.between_time('5:00', '7:00')
Out[38]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-02 06:00:00 01001099999 -9.0 1 ... 1 NaN NaN
1973-01-03 06:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-01-05 06:00:00 01001099999 -10.0 1 ... 1 NaN NaN
1973-01-06 06:00:00 01001099999 -11.0 1 ... 1 NaN NaN
日期/時(shí)間增減Timedelta
選時(shí)間段,也可以通過(guò)時(shí)間的增減來(lái)完成。pandas的好處是會(huì)自動(dòng)計(jì)算加減時(shí)間之后的日期。
In [40]: pd.Timedelta('6h')
Out[40]: Timedelta('0 days 06:00:00')
In [41]: pd.Timedelta(6,unit='h')
Out[41]: Timedelta('0 days 06:00:00')
In[42]: pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')
Out[42]: Timedelta('2 days 02:15:30')
Timedelta中的unit:
- ‘Y’, ‘M’, ‘W’, ‘D’, 'H', ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’ (我補(bǔ)充了一個(gè)'H'。對(duì)應(yīng)年、月、周、日、時(shí)、分、秒、毫秒、微秒、納秒)
- ‘days’ or ‘day’
- ‘hours’, ‘hour’, ‘hr’, or ‘h’
- ‘minutes’, ‘minute’, ‘min’, or ‘m’
- ‘seconds’, ‘second’, or ‘sec’
- ...【其余參見(jiàn)官方文檔】
0點(diǎn)前后一小時(shí)數(shù)據(jù)的選取方法:
repo00 = pd.to_datetime(date+' 00')
df_date00 = df[
(df.index >= repo00 - pd.Timedelta('1h'))
& (df.index <= repo00 + pd.Timedelta('1h'))
]
日期相加減
這個(gè)項(xiàng)目中,經(jīng)常遇到時(shí)間不是整點(diǎn)的情況(例如5:45, 6:15),每天的數(shù)據(jù)量也有所不同。為了將數(shù)據(jù)限定在一日四次,我需要選出00, 06, 12, 18點(diǎn)前后1小時(shí)內(nèi)的數(shù)據(jù),并取其中與其最接近的觀測(cè)值。這一點(diǎn)涉及時(shí)間的差,不能通過(guò)resample來(lái)實(shí)現(xiàn)。
# the closest observation within 1h of 06Z, 12Z, 18Z
df_date = df_ww[date]
obs06 = df_date.between_time('5:00', '7:00').index # 06點(diǎn)的數(shù)據(jù)的index
delta06 = abs((obs06 - pd.to_datetime(date+' 06')).values)
if delta06.size > 0:
df_resampled = pd.concat([df_resampled,
df_date.loc[[obs06[np.argmin(delta06)]]]
])
這里注意,兩個(gè)datetime相減之后得到的結(jié)果以ns為單位。如果需要變換單位,可以用.dt.seconds或者.values取出數(shù)值,再進(jìn)行變換。參考http://blog.gqylpy.com/gqy/22545/#pandas_74
df['diff_time'] = (df['tm_1'] - df['tm_2']).dt.seconds/60 # to minutes
df['diff_time'] = (df['tm_1'] - df['tm_2']).values/np.timedelta64(1, 'h') # to hours
與numpy datetime64格式的比較
1.
起因是要做兩個(gè)dataset的collocation,需要合并兩個(gè)dataset相同時(shí)間下的氣象變量。
其中一個(gè)tablet data讀取為pd.DataFrame,由pd.to_datetime轉(zhuǎn)換日期格式并設(shè)置為index:
In [31] df.index
Out[31]:
DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 06:00:00',
'1979-01-01 12:00:00', '1979-01-01 18:00:00',
'1979-01-02 00:00:00', '1979-01-02 06:00:00',
'1979-01-02 12:00:00', '1979-01-02 18:00:00',
'1979-01-03 00:00:00', '1979-01-03 06:00:00',
...
'1979-01-29 12:00:00', '1979-01-29 18:00:00',
'1979-01-30 00:00:00', '1979-01-30 06:00:00',
'1979-01-30 12:00:00', '1979-01-30 18:00:00',
'1979-01-31 00:00:00', '1979-01-31 06:00:00',
'1979-01-31 12:00:00', '1979-01-31 18:00:00'],
dtype='datetime64[ns]', name='date', length=124, freq=None)
另一個(gè)是nc數(shù)據(jù),用xarray讀取變量,其time坐標(biāo)默認(rèn)為np.datetime64格式
In [33]: tk.time
Out[33]:
<xarray.DataArray 'time' (time: 744)>
array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000',
'1979-01-01T02:00:00.000000000', ..., '1979-01-31T21:00:00.000000000',
'1979-01-31T22:00:00.000000000', '1979-01-31T23:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 1979-01-01 ... 1979-01-31T23:00:00
Attributes:
long_name: time
【我居然忘了當(dāng)時(shí)糾結(jié)了很久的bug是什么】
如果寫(xiě)np.argwhere(df.index[0] == tk.time.values)得到的是空集
因此都用values提取數(shù)值進(jìn)行對(duì)比即可
In [77]: np.argwhere(df.index.values[0] == tk.time.values)
Out[77]: array([[0]])
# 用int()提取數(shù)值:
In [78]: int(np.argwhere(df.index.values[0] == tk.time.values))
Out[78]: 0
2. Timestamp, datetime.datetime, np.datetime64之間的轉(zhuǎn)換
本人暫時(shí)用不到。先參考這個(gè)鏈接https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
pd.to_datetime再轉(zhuǎn)換成np.datetime64的格式, 只需要加上np.array:
date_n = np.array(pd.to_datetime(df[['year', 'month', 'day', 'hour']]) )
Out[83]:
array(['1979-01-01T00:00:00.000000000', '1979-01-01T06:00:00.000000000',
'1979-01-01T12:00:00.000000000', '1979-01-01T18:00:00.000000000',
...
'1979-01-31T00:00:00.000000000', '1979-01-31T06:00:00.000000000',
'1979-01-31T12:00:00.000000000', '1979-01-31T18:00:00.000000000'],
dtype='datetime64[ns]')
(1) python datetime => datetime64 / Timestamp
dt = datetime.datetime(year=2017, month=10, day=24, hour=4,
minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')
>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')
(2) numpy datetime64 => Timestamp
In [86]: pd.Timestamp(np.datetime64('2012-05-01T01:00:00.000000'))
Out[86]: Timestamp('2012-05-01 01:00:00')
In [87]: pd.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[87]: Timestamp('2012-05-01 01:00:00+0100', tz='pytz.FixedOffset(60)')
In [88]: pd.to_datetime('2012-05-01T01:00:00.000000+0100').replace(tzinfo=None)
Out[88]: Timestamp('2012-05-01 01:00:00')
np.datetime64 => datetime似乎有點(diǎn)復(fù)雜。以后需要的話再查一查。
(3) Timestamp => datetime / datetime64
>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')
>>> ts.to_pydatetime() # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)
>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')
模擬matlab的tic, toc計(jì)時(shí)功能
https://blog.csdn.net/u010199776/article/details/69941965
import datetime
tic = datetime.datetime.now()
...
toc = datetime.datetime.now()
print('%s, Elapsed time: %f seconds' % (toc, (toc-tic).total_seconds() ))
2020-07-19 16:23:13.321484, Elapsed time: 5.371069 seconds