簡介
問題:有一組1維數(shù)據(jù),可能是某商品的銷售量,可能是股票的價格等,用深度學(xué)習(xí)模型來解決對該數(shù)據(jù)的預(yù)測問題,比如用前50個數(shù)據(jù),來預(yù)測下一個數(shù)據(jù)。

接下來,通過對數(shù)據(jù)進(jìn)行處理,以及模型的搭建和訓(xùn)練,最終得到想要的預(yù)測模型。
數(shù)據(jù)的讀取及處理:
讀取數(shù)據(jù) load_data(filename, time_step)
使用pandas進(jìn)行csv文件的讀取,其中需要注意的是路徑,即filename中要使用‘/’ 而不是'\',另外,time_step變量,是用于設(shè)置以多少歷史數(shù)據(jù)作為預(yù)測下一個數(shù)據(jù)的基礎(chǔ)。按照題目簡介,使用前50個數(shù)據(jù),因此,time_step為50.
import time
import keras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Activation, Dropout
def load_data(filename, time_step):
'''
filename:
instruction: file address, note '/'
time_step: int
instruction: how many previous samples are used to predict the next sample, it is the same with the time_steps of that in LSTM
'''
df = pd.read_csv(filename, header=None)
data = df.values
data = data.astype('float32') # confirm the type as 'float32'
data = data.reshape(data.shape[0], )
# plt.title('original data')
# plt.plot(data)
# plt.savefig('original data.png')
# plt.show()
# using a list variable to rebuild a dataset to store previous time_step samples and another predicted sample
result = []
for index in range(len(data) - time_step):
result.append(data[index:index + time_step + 1])
# variable 'result' can be (len(data)-time_step) * (time_step + 1), the last column is predicted sample.
return np.array(result)
在這里,使用list變量result,將50個歷史數(shù)據(jù)與一個預(yù)測數(shù)據(jù)放在一行,因此最終result是一個維數(shù)為((len(data) - time_step), 51)的一個列表,當(dāng)然后面還要轉(zhuǎn)換成numpy,便于操作。
數(shù)據(jù)歸一化以及劃分訓(xùn)練測試集
首先將數(shù)據(jù)進(jìn)行歸一化,調(diào)用的是sklearn.preprocessing中的MiniMaxScaler。之后按照7:3的比例劃分成訓(xùn)練集和測試集。
data = load_data('sp500.csv', 50)
# normalize the data and split it into train and test set
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(data)
# define a variable to represent the ratio of train/total and split the dataset
train_count = int(0.7 * len(dataset))
x_train_set, x_test_set = dataset[:train_count, :-1], dataset[train_count:, :-1]
y_train_set, y_test_set = dataset[:train_count, -1], dataset[train_count:, -1]
# reshape the data to satisfy the input acquirement of LSTM
x_train_set = x_train_set.reshape(x_train_set.shape[0], 1, x_train_set.shape[1])
x_test_set = x_test_set.reshape(x_test_set.shape[0], 1, x_test_set.shape[1])
y_train_set = y_train_set.reshape(y_train_set.shape[0], 1)
y_test_set = y_test_set.reshape(y_test_set.shape[0], 1)
需要注意的是,如果不將y_train_set進(jìn)行reshape的話,那么它的維度將會是(M, )這種向量形式,而不是一維數(shù)據(jù)。(M, )這種數(shù)據(jù)是機(jī)器學(xué)習(xí)里面最容易出現(xiàn)bug的來源。
構(gòu)建模型
此處我們構(gòu)建一個4層網(wǎng)絡(luò),每一層的神經(jīng)元個數(shù)取自layer形參。需要說明的是,構(gòu)建LSTM時,里面的參數(shù),尤其是各種size令許多新手產(chǎn)生困惑。根據(jù)個人理解,LSTM中,需要設(shè)定的基礎(chǔ)參數(shù)有兩個,分別是units以及input_shape。

units:實際上指代的就是第一層隱藏層的輸出神經(jīng)元個數(shù),即第二層隱藏層輸入神經(jīng)元的個數(shù)。
input_shape:官網(wǎng)中給出的形式如下:(samples, time_steps, features)。features實際上就是每個樣本的維度。假如time_steps = t,其實就相當(dāng)于將該神經(jīng)元unfold成x0到xt-1,samples可省略。
在本例中,數(shù)據(jù)經(jīng)過處理后,X的維度是(m, 50),m是樣本數(shù),50是特征數(shù)(其實應(yīng)該是time_steps)。因此,此處的time_steps = 1,features = 50。
def build_model(layer):
'''
layer: list
instruction: the number of neurons in each layer
'''
model = Sequential()
# set the first hidden layer and set the input dimension
model.add(LSTM(
input_shape=(1, layer[0]), units=layer[1], return_sequences=True
))
model.add(Dropout(0.2))
# add the second layer
model.add(LSTM(
units=layer[2], return_sequences=False
))
model.add(Dropout(0.2))
# add the output layer with a Dense
model.add(Dense(units=layer[3], activation='linear'))
model.compile(loss='mse', optimizer='adam')
return model
新手入門,對很多概念也不是特別清晰,若有幸得到大神指點,吾感激不盡。
模型訓(xùn)練及預(yù)測
構(gòu)建好模型之后,使用訓(xùn)練集進(jìn)行訓(xùn)練,以及使用測試集進(jìn)行測試。
# train the model and use the validation part to validate
model.fit(x_train_set, y_train_set, batch_size=128, epochs=20, validation_split=0.2)
# do the prediction
y_predicted = model.predict(x_test_set)
其中,設(shè)置了validation_split,用于從訓(xùn)練集中劃分出一部分來做驗證集,對過擬合問題提出預(yù)警。有關(guān)validation_split的問題,可以參考http://www.itdecent.cn/p/0c7af5fbcf72
畫圖
最后一步就是將預(yù)測的數(shù)據(jù)以圖的形式表現(xiàn)出來,為了與原始數(shù)據(jù)進(jìn)行比對,先將預(yù)測出的數(shù)據(jù)變換到與原始數(shù)據(jù)同單位的樣子。在此,調(diào)用的是前文定義的scaler中的inverse_transform。
遇到的問題就是inverse_transform中提示y_test_set與變換前的數(shù)據(jù)尺寸不一致,想想也是這樣子的。當(dāng)初用scaler.fit_transform的時候,是對列數(shù)為51的數(shù)據(jù)做的,因此需要對y_test_set進(jìn)行數(shù)據(jù)補充,使用hstack將y_test_set與一個0數(shù)組進(jìn)行堆疊。
# plot the predicted curve and the original curve
# fill some zeros to get a (len, 51) array
temp = np.zeros((len(y_test_set), 50))
origin_temp = np.hstack((temp, y_test_set))
predict_temp = np.hstack((temp, y_predicted))
# tranform the data back to the original one
origin_test = scaler.inverse_transform(origin_temp)
predict_test = scaler.inverse_transform(predict_temp)
plot_curve(origin_test[:, -1], predict_test[:, -1])
若前文中的y_test_set不使用reshape調(diào)整為列數(shù)為1的array的話,此處就會出現(xiàn)bug,提示維度不一,因為reshape前的為(M, )的向量。
plot_curve函數(shù)如下:
def plot_curve(true_data, predicted_data):
'''
true_data: float32
instruction: the true test data
predicted_data: float32
instruction: the predicted data from the model
'''
plt.plot(true_data, label='True data')
plt.plot(predicted_data, label='Predicted data')
plt.legend()
plt.savefig('result.png')
plt.show()
結(jié)果如下:
