氣象datetime處理合集【Updating】

前言

使用Pandas

import pandas as pd
import numpy as np
import datetime

剛剛開(kāi)始學(xué)python?;趥€(gè)人科研過(guò)程中遇到的問(wèn)題做一下筆記。

將日期設(shè)置為索引:pd.to_datetime

氣象中經(jīng)常要計(jì)算年/季/月平均,若是處理站點(diǎn)數(shù)據(jù)還涉及hourly的數(shù)據(jù),因此將日期作為索引可以方便后續(xù)的處理。

常用使用方法

  1. 由DataFrame的多列組成. 列名可以是 [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) 或者類(lèi)似的詞。
df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
pd.to_datetime(df)
Out[9]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]
  1. 由字符串轉(zhuǎn)換而成。可以用format='%Y%m%d'之類(lèi)的來(lái)指定格式。
pd.to_datetime('197901010600')
Out[2]: Timestamp('1979-01-01 06:00:00')
pd.to_datetime('1979-01-01 00') # 日期和時(shí)間之間需要有個(gè)空格
Out[7]: Timestamp('1979-01-01 00:00:00')

需注意,使用to_datetime生成的Timestamp有范圍限制:

In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')

In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')

因此如果超出限制,不會(huì)生成Timestamp,只會(huì)顯示datetime.datetime格式。

In [5]: pd.to_datetime('13000101',format='%Y%m%d', errors='ignore')
Out[5]: datetime.datetime(1300, 1, 1, 0, 0)

可以用errors='ignore'將其置為NaT

In [6]: pd.to_datetime('13000101',format='%Y%m%d', errors='coerce')
Out[6]: NaT

實(shí)踐示例

示例一

數(shù)據(jù)為ISD站點(diǎn)數(shù)據(jù),原文件為以逗號(hào)分隔的csv格式。部分?jǐn)?shù)據(jù)用excel預(yù)覽如下

ISD station data

可見(jiàn)原數(shù)據(jù)已經(jīng)有列名,且日期單獨(dú)為一列,可以直接用pd.read_csv讀取

df = pd.read_csv(file,  dtype={'STATION':str})        
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.set_index('DATE')       # set date as index

示例二

如果日期是如下圖更常見(jiàn)的以空格分隔的形式,同樣可以用用pd.to_datetime轉(zhuǎn)換。
各列依次為:station ID, station type, year, month, day, standard report hour, actual report time...

NCEP ADP decoded data

colNames =  ['id','fmtflag', 'year', 'month', 'day', 'hour', 'time', 'lat', 'lon', 'elev', 
                      'ww', 'pw', 'slp', 'stp', 't', 'td'] #present weather, past weather, sea level pressure, station pressure
data = pd.read_table(file,  header=None, names=colNames, 
                     delim_whitespace=True, dtype={'id':str}) # read_csv也行
df['date'] = pd.to_datetime(df.loc[:, ['year', 'month', 'day', 'hour'] ])
df = df.set_index('DATE')

生成日期序列

pd.date_range

months = pd.date_range('1973-01', '2020-01', freq='M')

如果要顯示為‘‘yyyymm’’的格式(在讀取數(shù)據(jù)時(shí)會(huì)用到,例如有些數(shù)據(jù)路徑為/200701/...)

months = pd.date_range('1973-01', '2020-01', freq='M').strftime('%Y%M')

strftime是把時(shí)間轉(zhuǎn)換成string格式

通過(guò)日期索引選取數(shù)據(jù)

df

1. 直接通過(guò)日期字符串選取

選取某一年:

In [27]: df['1973']

Out[27]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
1973-12-31 09:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 12:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 15:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 18:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-12-31 21:00:00  01001099999       -3.0        1  ...        1      NaN      NaN

[2801 rows x 19 columns]

或者選取某個(gè)區(qū)間內(nèi)的年份:

In [39]: df['1973':'1974']   # 包括1994年
Out[39]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
1974-12-31 09:00:00  01001099999        0.0        1  ...        1      NaN      NaN
1974-12-31 12:00:00  01001099999        0.0        1  ...        1      NaN      NaN
1974-12-31 15:00:00  01001099999       -4.0        1  ...        1      NaN      NaN
1974-12-31 18:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1974-12-31 21:00:00  01001099999       -5.0        1  ...        1      NaN      NaN

[5517 rows x 19 columns]

按年份選取應(yīng)該是最常用的。
如果只需要年平均之類(lèi)的話,groupby(updating) ......
較復(fù)雜的逐年的數(shù)據(jù)處理,需要通過(guò)循環(huán)遍歷所有年份,對(duì)每一年數(shù)據(jù)單獨(dú)處理:

df_years = df.index.to_period('A').unique().year #將index按年份顯示,取唯一值,再取年份
Out[29]: 
Int64Index([1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983,
            1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
            1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
            2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
            2017, 2018, 2019],
           dtype='int64')

for yr in df_years:
    df_yr = df[str(yr)]   # 索引得是字符
    balabala……

了解一下選月、日、小時(shí):

 # 選取某月
df['1973-01']    #  df['197301']會(huì)報(bào)錯(cuò)
# 選取某天
df['1973-01-01'] # 我的數(shù)據(jù)包含小時(shí)、分鐘數(shù)據(jù)所以這么寫(xiě)不報(bào)錯(cuò)
df['1973-01-01':'1973-01-01'] # 用區(qū)間來(lái)選取
# 選取某時(shí)刻
df['1973-01-01 06'] # (目前只知道最多選到小時(shí)為止)

.loc也可以,更保險(xiǎn)

df.loc['1973':'1974']   # 得到的結(jié)果和上面的df['1973':'1974']是一樣的

2. 通過(guò)條件篩選index

我一般是通過(guò)index來(lái)進(jìn)行篩選。datetime格式的index可以提取year, month, day, hour的信息:

In [30]: df.index.month

Out[30]: 
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64', length=98404)

然后用boolean篩選就行了:

In [31]: df[df.index.month==1]
Out[31]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
2020-01-30 18:00:00  01001099999        1.2        1  ...        1      NaN      NaN
2020-01-31 00:00:00  01001099999        1.2        1  ...        1      NaN      NaN
2020-01-31 09:00:00  01001099999        0.6        1  ...        1      NaN      NaN
2020-01-31 15:00:00  01001099999       -1.0        1  ...        1      NaN      NaN
2020-01-31 18:00:00  01001099999       -1.4        1  ...        1      NaN      NaN

[8786 rows x 19 columns]

3. truncate

可以去掉某日期之前或之后的數(shù)據(jù)。

dates = pd.date_range('2016-01-01', '2016-02-01', freq='s')
df = pd.DataFrame(index=dates, data={'A': 1})
 # 去掉1-5之前和1-10之后的行
df.truncate(before=pd.Timestamp('2016-01-05'),
            after=pd.Timestamp('2016-01-10')) 

# 也可以簡(jiǎn)單地寫(xiě)成:
df.truncate('2016-01-05', '2016-01-10')

# truncate方法默認(rèn)只匹配到0時(shí),最后一個(gè)值是2016-01-10 00:00:00  
# 對(duì)比df.loc['2016-01-05':'2016-01-10', :], 最后一個(gè)值是2016-01-10 23:59:59  

改變?nèi)掌陲@示方式

1. to_period (updating

df.index.to_period('A').unique().year
# process day by day
dates = df_ww.index.to_period('d').strftime('%Y-%m-%d').unique()

2. asfreq (updating)

說(shuō)實(shí)在的我還沒(méi)搞清楚這個(gè)函數(shù)怎么用QAQ

3. astype('datetime64[M]')

轉(zhuǎn)換成np.datetime64格式,返回的是array

df.index.values
Out[43]: 
array(['1978-01-01T00:00:00.000000000', '1978-01-01T06:00:00.000000000',
       '1978-01-01T12:00:00.000000000', ...,
       '2007-02-28T17:00:00.000000000', '2007-02-28T19:00:00.000000000',
       '2007-02-28T20:00:00.000000000'], dtype='datetime64[ns]')

df.index.values.astype('datetime64[M]')
Out[44]: 
array(['1978-01', '1978-01', '1978-01', ..., '2007-02', '2007-02',
       '2007-02'], dtype='datetime64[M]')

按年份/月份批量處理(計(jì)數(shù)、求和、平均): resample

df.resample('M')返回的是一個(gè)<pandas.core.resample.DatetimeIndexResampler object>, 后面要再加上具體的method才可以顯示結(jié)果,例如.count(), .sum(), .mean()...

# monthly number of ww observations
In [36]: num = df['WW_VALUE'].resample('M').count()

Out[36]: 
1973-01-31    217
1973-02-28    210
1973-03-31    242
1973-04-30    237
1973-05-31    243

2020-02-29    159
2020-03-31    180
2020-04-30    179
2020-05-31    186
2020-06-30     89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64

返回的index自動(dòng)設(shè)置成了每月的最后一天。如果想只保留“年-月”格式的話,可以用to_period('m')

In [37]: num = df['WW_VALUE'].resample('M').count().to_period('m')   # "m"大小寫(xiě)沒(méi)有影響。但“Y”和“y”格式有差別。
Out[37]: 
1973-01    217
1973-02    210
1973-03    242
1973-04    237
1973-05    243

2020-02    159
2020-03    180
2020-04    179
2020-05    186
2020-06     89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64

選取某個(gè)時(shí)間區(qū)間內(nèi)的數(shù)據(jù)

Pandas.DataFrame.between_time

一個(gè)簡(jiǎn)單粗暴的選取時(shí)間段內(nèi)數(shù)據(jù)的方法:

In [38]: df.between_time('5:00', '7:00')
Out[38]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-02 06:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
1973-01-03 06:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-01-05 06:00:00  01001099999      -10.0        1  ...        1      NaN      NaN
1973-01-06 06:00:00  01001099999      -11.0        1  ...        1      NaN      NaN

日期/時(shí)間增減Timedelta

選時(shí)間段,也可以通過(guò)時(shí)間的增減來(lái)完成。pandas的好處是會(huì)自動(dòng)計(jì)算加減時(shí)間之后的日期。

In [40]: pd.Timedelta('6h')
Out[40]: Timedelta('0 days 06:00:00')

In [41]: pd.Timedelta(6,unit='h')
Out[41]: Timedelta('0 days 06:00:00')

In[42]: pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')
Out[42]: Timedelta('2 days 02:15:30')

Timedelta中的unit:

  • ‘Y’, ‘M’, ‘W’, ‘D’, 'H', ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’ (我補(bǔ)充了一個(gè)'H'。對(duì)應(yīng)年、月、周、日、時(shí)、分、秒、毫秒、微秒、納秒)
  • ‘days’ or ‘day’
  • ‘hours’, ‘hour’, ‘hr’, or ‘h’
  • ‘minutes’, ‘minute’, ‘min’, or ‘m’
  • ‘seconds’, ‘second’, or ‘sec’
  • ...【其余參見(jiàn)官方文檔】

0點(diǎn)前后一小時(shí)數(shù)據(jù)的選取方法:

repo00 = pd.to_datetime(date+' 00')
df_date00 = df[
                  (df.index >= repo00 - pd.Timedelta('1h'))
                & (df.index <= repo00 + pd.Timedelta('1h'))
                  ]

日期相加減

這個(gè)項(xiàng)目中,經(jīng)常遇到時(shí)間不是整點(diǎn)的情況(例如5:45, 6:15),每天的數(shù)據(jù)量也有所不同。為了將數(shù)據(jù)限定在一日四次,我需要選出00, 06, 12, 18點(diǎn)前后1小時(shí)內(nèi)的數(shù)據(jù),并取其中與其最接近的觀測(cè)值。這一點(diǎn)涉及時(shí)間的差,不能通過(guò)resample來(lái)實(shí)現(xiàn)。

# the closest observation within 1h of 06Z, 12Z, 18Z
df_date = df_ww[date]
obs06 = df_date.between_time('5:00', '7:00').index     # 06點(diǎn)的數(shù)據(jù)的index
delta06 = abs((obs06 - pd.to_datetime(date+' 06')).values)
if delta06.size > 0:
    df_resampled = pd.concat([df_resampled,
                              df_date.loc[[obs06[np.argmin(delta06)]]] 
                             ])

這里注意,兩個(gè)datetime相減之后得到的結(jié)果以ns為單位。如果需要變換單位,可以用.dt.seconds或者.values取出數(shù)值,再進(jìn)行變換。參考http://blog.gqylpy.com/gqy/22545/#pandas_74

df['diff_time'] = (df['tm_1'] - df['tm_2']).dt.seconds/60        # to minutes
df['diff_time'] = (df['tm_1'] - df['tm_2']).values/np.timedelta64(1, 'h')       # to hours

與numpy datetime64格式的比較

1.

起因是要做兩個(gè)dataset的collocation,需要合并兩個(gè)dataset相同時(shí)間下的氣象變量。
其中一個(gè)tablet data讀取為pd.DataFrame,由pd.to_datetime轉(zhuǎn)換日期格式并設(shè)置為index:

In [31] df.index
Out[31]: 
DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 06:00:00',
               '1979-01-01 12:00:00', '1979-01-01 18:00:00',
               '1979-01-02 00:00:00', '1979-01-02 06:00:00',
               '1979-01-02 12:00:00', '1979-01-02 18:00:00',
               '1979-01-03 00:00:00', '1979-01-03 06:00:00',
               ...
               '1979-01-29 12:00:00', '1979-01-29 18:00:00',
               '1979-01-30 00:00:00', '1979-01-30 06:00:00',
               '1979-01-30 12:00:00', '1979-01-30 18:00:00',
               '1979-01-31 00:00:00', '1979-01-31 06:00:00',
               '1979-01-31 12:00:00', '1979-01-31 18:00:00'],
              dtype='datetime64[ns]', name='date', length=124, freq=None)

另一個(gè)是nc數(shù)據(jù),用xarray讀取變量,其time坐標(biāo)默認(rèn)為np.datetime64格式

In [33]: tk.time
Out[33]: 
<xarray.DataArray 'time' (time: 744)>
array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000',
       '1979-01-01T02:00:00.000000000', ..., '1979-01-31T21:00:00.000000000',
       '1979-01-31T22:00:00.000000000', '1979-01-31T23:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01 ... 1979-01-31T23:00:00
Attributes:
    long_name:  time

【我居然忘了當(dāng)時(shí)糾結(jié)了很久的bug是什么】
如果寫(xiě)np.argwhere(df.index[0] == tk.time.values)得到的是空集
因此都用values提取數(shù)值進(jìn)行對(duì)比即可

In [77]: np.argwhere(df.index.values[0] == tk.time.values)
Out[77]: array([[0]])

# 用int()提取數(shù)值:
In [78]: int(np.argwhere(df.index.values[0] == tk.time.values))
Out[78]: 0

2. Timestamp, datetime.datetime, np.datetime64之間的轉(zhuǎn)換

本人暫時(shí)用不到。先參考這個(gè)鏈接https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

pd.to_datetime再轉(zhuǎn)換成np.datetime64的格式, 只需要加上np.array:

date_n = np.array(pd.to_datetime(df[['year', 'month', 'day', 'hour']]) )
Out[83]: 
array(['1979-01-01T00:00:00.000000000', '1979-01-01T06:00:00.000000000',
       '1979-01-01T12:00:00.000000000', '1979-01-01T18:00:00.000000000',
        ...
       '1979-01-31T00:00:00.000000000', '1979-01-31T06:00:00.000000000',
       '1979-01-31T12:00:00.000000000', '1979-01-31T18:00:00.000000000'],
      dtype='datetime64[ns]')

(1) python datetime => datetime64 / Timestamp

dt = datetime.datetime(year=2017, month=10, day=24, hour=4, 
                   minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')

>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')

(2) numpy datetime64 => Timestamp

In [86]: pd.Timestamp(np.datetime64('2012-05-01T01:00:00.000000'))
Out[86]: Timestamp('2012-05-01 01:00:00')

In [87]: pd.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[87]: Timestamp('2012-05-01 01:00:00+0100', tz='pytz.FixedOffset(60)')

In [88]: pd.to_datetime('2012-05-01T01:00:00.000000+0100').replace(tzinfo=None)
Out[88]: Timestamp('2012-05-01 01:00:00')

np.datetime64 => datetime似乎有點(diǎn)復(fù)雜。以后需要的話再查一查。

(3) Timestamp => datetime / datetime64

>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')

>>> ts.to_pydatetime()   # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)

>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')

模擬matlab的tic, toc計(jì)時(shí)功能

https://blog.csdn.net/u010199776/article/details/69941965

import datetime

tic = datetime.datetime.now()
...
toc = datetime.datetime.now()
print('%s, Elapsed time: %f seconds' % (toc, (toc-tic).total_seconds() ))

2020-07-19 16:23:13.321484, Elapsed time: 5.371069 seconds
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容