前言

使用Pandas

import pandas as pd
import numpy as np
import datetime

剛剛開(kāi)始學(xué)python?；趥€(gè)人科研過(guò)程中遇到的問(wèn)題做一下筆記。

將日期設(shè)置為索引：pd.to_datetime

氣象中經(jīng)常要計(jì)算年/季/月平均，若是處理站點(diǎn)數(shù)據(jù)還涉及hourly的數(shù)據(jù)，因此將日期作為索引可以方便后續(xù)的處理。

常用使用方法

由DataFrame的多列組成. 列名可以是 [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) 或者類(lèi)似的詞。

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
pd.to_datetime(df)
Out[9]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

由字符串轉(zhuǎn)換而成。可以用format='%Y%m%d'之類(lèi)的來(lái)指定格式。

pd.to_datetime('197901010600')
Out[2]: Timestamp('1979-01-01 06:00:00')

pd.to_datetime('1979-01-01 00') # 日期和時(shí)間之間需要有個(gè)空格
Out[7]: Timestamp('1979-01-01 00:00:00')

需注意，使用to_datetime生成的Timestamp有范圍限制：

In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')

In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')

因此如果超出限制，不會(huì)生成Timestamp，只會(huì)顯示datetime.datetime格式。

In [5]: pd.to_datetime('13000101',format='%Y%m%d', errors='ignore')
Out[5]: datetime.datetime(1300, 1, 1, 0, 0)

可以用errors='ignore'將其置為NaT

In [6]: pd.to_datetime('13000101',format='%Y%m%d', errors='coerce')
Out[6]: NaT

實(shí)踐示例

示例一

數(shù)據(jù)為ISD站點(diǎn)數(shù)據(jù)，原文件為以逗號(hào)分隔的csv格式。部分?jǐn)?shù)據(jù)用excel預(yù)覽如下

ISD station data

可見(jiàn)原數(shù)據(jù)已經(jīng)有列名，且日期單獨(dú)為一列，可以直接用pd.read_csv讀取

df = pd.read_csv(file,  dtype={'STATION':str})        
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.set_index('DATE')       # set date as index

示例二

如果日期是如下圖更常見(jiàn)的以空格分隔的形式，同樣可以用用pd.to_datetime轉(zhuǎn)換。
各列依次為：station ID, station type, year, month, day, standard report hour, actual report time...

NCEP ADP decoded data

colNames =  ['id','fmtflag', 'year', 'month', 'day', 'hour', 'time', 'lat', 'lon', 'elev', 
                      'ww', 'pw', 'slp', 'stp', 't', 'td'] #present weather, past weather, sea level pressure, station pressure
data = pd.read_table(file,  header=None, names=colNames, 
                     delim_whitespace=True, dtype={'id':str}) # read_csv也行
df['date'] = pd.to_datetime(df.loc[:, ['year', 'month', 'day', 'hour'] ])
df = df.set_index('DATE')

生成日期序列

pd.date_range

months = pd.date_range('1973-01', '2020-01', freq='M')

如果要顯示為‘‘yyyymm’’的格式（在讀取數(shù)據(jù)時(shí)會(huì)用到，例如有些數(shù)據(jù)路徑為/200701/...）

months = pd.date_range('1973-01', '2020-01', freq='M').strftime('%Y%M')

strftime是把時(shí)間轉(zhuǎn)換成string格式

通過(guò)日期索引選取數(shù)據(jù)

1. 直接通過(guò)日期字符串選取

選取某一年：

In [27]: df['1973']

Out[27]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
1973-12-31 09:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 12:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 15:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 18:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-12-31 21:00:00  01001099999       -3.0        1  ...        1      NaN      NaN

[2801 rows x 19 columns]

或者選取某個(gè)區(qū)間內(nèi)的年份：

In [39]: df['1973':'1974']   # 包括1994年
Out[39]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
1974-12-31 09:00:00  01001099999        0.0        1  ...        1      NaN      NaN
1974-12-31 12:00:00  01001099999        0.0        1  ...        1      NaN      NaN
1974-12-31 15:00:00  01001099999       -4.0        1  ...        1      NaN      NaN
1974-12-31 18:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1974-12-31 21:00:00  01001099999       -5.0        1  ...        1      NaN      NaN

[5517 rows x 19 columns]

按年份選取應(yīng)該是最常用的。
如果只需要年平均之類(lèi)的話，groupby(updating) ......
較復(fù)雜的逐年的數(shù)據(jù)處理，需要通過(guò)循環(huán)遍歷所有年份，對(duì)每一年數(shù)據(jù)單獨(dú)處理：

df_years = df.index.to_period('A').unique().year #將index按年份顯示，取唯一值，再取年份
Out[29]: 
Int64Index([1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983,
            1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
            1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
            2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
            2017, 2018, 2019],
           dtype='int64')

for yr in df_years:
    df_yr = df[str(yr)]   # 索引得是字符
    balabala……

了解一下選月、日、小時(shí)：

 # 選取某月
df['1973-01']    #  df['197301']會(huì)報(bào)錯(cuò)
# 選取某天
df['1973-01-01'] # 我的數(shù)據(jù)包含小時(shí)、分鐘數(shù)據(jù)所以這么寫(xiě)不報(bào)錯(cuò)
df['1973-01-01':'1973-01-01'] # 用區(qū)間來(lái)選取
# 選取某時(shí)刻
df['1973-01-01 06'] # （目前只知道最多選到小時(shí)為止）

用.loc也可以，更保險(xiǎn)

df.loc['1973':'1974']   # 得到的結(jié)果和上面的df['1973':'1974']是一樣的

2. 通過(guò)條件篩選index

我一般是通過(guò)index來(lái)進(jìn)行篩選。datetime格式的index可以提取year, month, day, hour的信息：

In [30]: df.index.month

Out[30]: 
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64', length=98404)

然后用boolean篩選就行了：

In [31]: df[df.index.month==1]
Out[31]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
2020-01-30 18:00:00  01001099999        1.2        1  ...        1      NaN      NaN
2020-01-31 00:00:00  01001099999        1.2        1  ...        1      NaN      NaN
2020-01-31 09:00:00  01001099999        0.6        1  ...        1      NaN      NaN
2020-01-31 15:00:00  01001099999       -1.0        1  ...        1      NaN      NaN
2020-01-31 18:00:00  01001099999       -1.4        1  ...        1      NaN      NaN

[8786 rows x 19 columns]

3. truncate

可以去掉某日期之前或之后的數(shù)據(jù)。

dates = pd.date_range('2016-01-01', '2016-02-01', freq='s')
df = pd.DataFrame(index=dates, data={'A': 1})
 # 去掉1-5之前和1-10之后的行
df.truncate(before=pd.Timestamp('2016-01-05'),
            after=pd.Timestamp('2016-01-10')) 

# 也可以簡(jiǎn)單地寫(xiě)成：
df.truncate('2016-01-05', '2016-01-10')

# truncate方法默認(rèn)只匹配到0時(shí)，最后一個(gè)值是2016-01-10 00:00:00  
# 對(duì)比df.loc['2016-01-05':'2016-01-10', :]， 最后一個(gè)值是2016-01-10 23:59:59

改變?nèi)掌陲@示方式

1. to_period (updating

df.index.to_period('A').unique().year

# process day by day
dates = df_ww.index.to_period('d').strftime('%Y-%m-%d').unique()

2. asfreq （updating）

說(shuō)實(shí)在的我還沒(méi)搞清楚這個(gè)函數(shù)怎么用QAQ

3. astype('datetime64[M]')

轉(zhuǎn)換成np.datetime64格式，返回的是array

df.index.values
Out[43]: 
array(['1978-01-01T00:00:00.000000000', '1978-01-01T06:00:00.000000000',
       '1978-01-01T12:00:00.000000000', ...,
       '2007-02-28T17:00:00.000000000', '2007-02-28T19:00:00.000000000',
       '2007-02-28T20:00:00.000000000'], dtype='datetime64[ns]')

df.index.values.astype('datetime64[M]')
Out[44]: 
array(['1978-01', '1978-01', '1978-01', ..., '2007-02', '2007-02',
       '2007-02'], dtype='datetime64[M]')

按年份/月份批量處理（計(jì)數(shù)、求和、平均）： resample

df.resample('M')返回的是一個(gè)<pandas.core.resample.DatetimeIndexResampler object>，后面要再加上具體的method才可以顯示結(jié)果，例如.count(), .sum(), .mean()...

# monthly number of ww observations
In [36]: num = df['WW_VALUE'].resample('M').count()

Out[36]: 
1973-01-31    217
1973-02-28    210
1973-03-31    242
1973-04-30    237
1973-05-31    243

2020-02-29    159
2020-03-31    180
2020-04-30    179
2020-05-31    186
2020-06-30     89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64

返回的index自動(dòng)設(shè)置成了每月的最后一天。如果想只保留“年-月”格式的話，可以用to_period('m')

In [37]: num = df['WW_VALUE'].resample('M').count().to_period('m')   # "m"大小寫(xiě)沒(méi)有影響。但“Y”和“y”格式有差別。
Out[37]: 
1973-01    217
1973-02    210
1973-03    242
1973-04    237
1973-05    243

2020-02    159
2020-03    180
2020-04    179
2020-05    186
2020-06     89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64

選取某個(gè)時(shí)間區(qū)間內(nèi)的數(shù)據(jù)

Pandas.DataFrame.between_time

一個(gè)簡(jiǎn)單粗暴的選取時(shí)間段內(nèi)數(shù)據(jù)的方法:

In [38]: df.between_time('5:00', '7:00')
Out[38]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-02 06:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
1973-01-03 06:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-01-05 06:00:00  01001099999      -10.0        1  ...        1      NaN      NaN
1973-01-06 06:00:00  01001099999      -11.0        1  ...        1      NaN      NaN

日期/時(shí)間增減Timedelta

選時(shí)間段，也可以通過(guò)時(shí)間的增減來(lái)完成。pandas的好處是會(huì)自動(dòng)計(jì)算加減時(shí)間之后的日期。

In [40]: pd.Timedelta('6h')
Out[40]: Timedelta('0 days 06:00:00')

In [41]: pd.Timedelta(6,unit='h')
Out[41]: Timedelta('0 days 06:00:00')

In[42]: pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')
Out[42]: Timedelta('2 days 02:15:30')

Timedelta中的unit:

‘Y’, ‘M’, ‘W’, ‘D’, 'H', ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’ (我補(bǔ)充了一個(gè)'H'。對(duì)應(yīng)年、月、周、日、時(shí)、分、秒、毫秒、微秒、納秒)
‘days’ or ‘day’
‘hours’, ‘hour’, ‘hr’, or ‘h’
‘minutes’, ‘minute’, ‘min’, or ‘m’
‘seconds’, ‘second’, or ‘sec’
...【其余參見(jiàn)官方文檔】

0點(diǎn)前后一小時(shí)數(shù)據(jù)的選取方法：

repo00 = pd.to_datetime(date+' 00')
df_date00 = df[
                  (df.index >= repo00 - pd.Timedelta('1h'))
                & (df.index <= repo00 + pd.Timedelta('1h'))
                  ]

日期相加減

這個(gè)項(xiàng)目中，經(jīng)常遇到時(shí)間不是整點(diǎn)的情況（例如5:45, 6:15），每天的數(shù)據(jù)量也有所不同。為了將數(shù)據(jù)限定在一日四次，我需要選出00, 06, 12, 18點(diǎn)前后1小時(shí)內(nèi)的數(shù)據(jù)，并取其中與其最接近的觀測(cè)值。這一點(diǎn)涉及時(shí)間的差，不能通過(guò)resample來(lái)實(shí)現(xiàn)。

# the closest observation within 1h of 06Z, 12Z, 18Z
df_date = df_ww[date]
obs06 = df_date.between_time('5:00', '7:00').index     # 06點(diǎn)的數(shù)據(jù)的index
delta06 = abs((obs06 - pd.to_datetime(date+' 06')).values)
if delta06.size > 0:
    df_resampled = pd.concat([df_resampled,
                              df_date.loc[[obs06[np.argmin(delta06)]]] 
                             ])

這里注意，兩個(gè)datetime相減之后得到的結(jié)果以ns為單位。如果需要變換單位，可以用.dt.seconds或者.values取出數(shù)值，再進(jìn)行變換。參考http://blog.gqylpy.com/gqy/22545/#pandas_74

df['diff_time'] = (df['tm_1'] - df['tm_2']).dt.seconds/60        # to minutes
df['diff_time'] = (df['tm_1'] - df['tm_2']).values/np.timedelta64(1, 'h')       # to hours

與numpy datetime64格式的比較

1.

起因是要做兩個(gè)dataset的collocation，需要合并兩個(gè)dataset相同時(shí)間下的氣象變量。
其中一個(gè)tablet data讀取為pd.DataFrame，由pd.to_datetime轉(zhuǎn)換日期格式并設(shè)置為index：

In [31] df.index
Out[31]: 
DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 06:00:00',
               '1979-01-01 12:00:00', '1979-01-01 18:00:00',
               '1979-01-02 00:00:00', '1979-01-02 06:00:00',
               '1979-01-02 12:00:00', '1979-01-02 18:00:00',
               '1979-01-03 00:00:00', '1979-01-03 06:00:00',
               ...
               '1979-01-29 12:00:00', '1979-01-29 18:00:00',
               '1979-01-30 00:00:00', '1979-01-30 06:00:00',
               '1979-01-30 12:00:00', '1979-01-30 18:00:00',
               '1979-01-31 00:00:00', '1979-01-31 06:00:00',
               '1979-01-31 12:00:00', '1979-01-31 18:00:00'],
              dtype='datetime64[ns]', name='date', length=124, freq=None)

另一個(gè)是nc數(shù)據(jù)，用xarray讀取變量，其time坐標(biāo)默認(rèn)為np.datetime64格式

In [33]: tk.time
Out[33]: 
<xarray.DataArray 'time' (time: 744)>
array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000',
       '1979-01-01T02:00:00.000000000', ..., '1979-01-31T21:00:00.000000000',
       '1979-01-31T22:00:00.000000000', '1979-01-31T23:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01 ... 1979-01-31T23:00:00
Attributes:
    long_name:  time

【我居然忘了當(dāng)時(shí)糾結(jié)了很久的bug是什么】
如果寫(xiě)np.argwhere(df.index[0] == tk.time.values)得到的是空集
因此都用values提取數(shù)值進(jìn)行對(duì)比即可

In [77]: np.argwhere(df.index.values[0] == tk.time.values)
Out[77]: array([[0]])

# 用int()提取數(shù)值：
In [78]: int(np.argwhere(df.index.values[0] == tk.time.values))
Out[78]: 0

2. Timestamp, datetime.datetime, np.datetime64之間的轉(zhuǎn)換

本人暫時(shí)用不到。先參考這個(gè)鏈接https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

pd.to_datetime再轉(zhuǎn)換成np.datetime64的格式，只需要加上np.array：

date_n = np.array(pd.to_datetime(df[['year', 'month', 'day', 'hour']]) )
Out[83]: 
array(['1979-01-01T00:00:00.000000000', '1979-01-01T06:00:00.000000000',
       '1979-01-01T12:00:00.000000000', '1979-01-01T18:00:00.000000000',
        ...
       '1979-01-31T00:00:00.000000000', '1979-01-31T06:00:00.000000000',
       '1979-01-31T12:00:00.000000000', '1979-01-31T18:00:00.000000000'],
      dtype='datetime64[ns]')

(1) python datetime => datetime64 / Timestamp

dt = datetime.datetime(year=2017, month=10, day=24, hour=4, 
                   minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')

>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')

(2) numpy datetime64 => Timestamp

In [86]: pd.Timestamp(np.datetime64('2012-05-01T01:00:00.000000'))
Out[86]: Timestamp('2012-05-01 01:00:00')

In [87]: pd.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[87]: Timestamp('2012-05-01 01:00:00+0100', tz='pytz.FixedOffset(60)')

In [88]: pd.to_datetime('2012-05-01T01:00:00.000000+0100').replace(tzinfo=None)
Out[88]: Timestamp('2012-05-01 01:00:00')

np.datetime64 => datetime似乎有點(diǎn)復(fù)雜。以后需要的話再查一查。

(3) Timestamp => datetime / datetime64

>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')

>>> ts.to_pydatetime()   # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)

>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')

模擬matlab的tic, toc計(jì)時(shí)功能

https://blog.csdn.net/u010199776/article/details/69941965

import datetime

tic = datetime.datetime.now()
...
toc = datetime.datetime.now()
print('%s, Elapsed time: %f seconds' % (toc, (toc-tic).total_seconds() ))

2020-07-19 16:23:13.321484, Elapsed time: 5.371069 seconds

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

氣象datetime處理合集【Updating】

氣象datetime處理合集【Updating】

前言

將日期設(shè)置為索引：pd.to_datetime

常用使用方法

實(shí)踐示例

示例一

示例二

生成日期序列

通過(guò)日期索引選取數(shù)據(jù)

1. 直接通過(guò)日期字符串選取

2. 通過(guò)條件篩選index

3. truncate

改變?nèi)掌陲@示方式

1. to_period (updating

2. asfreq （updating）

3. astype('datetime64[M]')

按年份/月份批量處理（計(jì)數(shù)、求和、平均）： resample

選取某個(gè)時(shí)間區(qū)間內(nèi)的數(shù)據(jù)

Pandas.DataFrame.between_time

日期/時(shí)間增減Timedelta

日期相加減

與numpy datetime64格式的比較

1.

2. Timestamp, datetime.datetime, np.datetime64之間的轉(zhuǎn)換

(1) python datetime => datetime64 / Timestamp

(2) numpy datetime64 => Timestamp

(3) Timestamp => datetime / datetime64

模擬matlab的tic, toc計(jì)時(shí)功能

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

氣象datetime處理合集【Updating】

前言

將日期設(shè)置為索引：pd.to_datetime

常用使用方法

實(shí)踐示例

示例一

示例二

生成日期序列

通過(guò)日期索引選取數(shù)據(jù)

1. 直接通過(guò)日期字符串選取

2. 通過(guò)條件篩選index

3. truncate

改變?nèi)掌陲@示方式

1. to_period (updating

2. asfreq （updating）

3. astype('datetime64[M]')

按年份/月份批量處理（計(jì)數(shù)、求和、平均）： resample

選取某個(gè)時(shí)間區(qū)間內(nèi)的數(shù)據(jù)

Pandas.DataFrame.between_time

日期/時(shí)間增減Timedelta

日期相加減

與numpy datetime64格式的比較

1.

2. Timestamp, datetime.datetime, np.datetime64之間的轉(zhuǎn)換

(1) python datetime => datetime64 / Timestamp

(2) numpy datetime64 => Timestamp

(3) Timestamp => datetime / datetime64

模擬matlab的tic, toc計(jì)時(shí)功能

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

按年份/月份批量處理（計(jì)數(shù)、求和、平均）： resample