《時(shí)空數(shù)據(jù)處理和組織課程實(shí)習(xí)》實(shí)驗(yàn)報(bào)告
題目: 實(shí)驗(yàn)5 決策樹分類
日期:6.13
實(shí)驗(yàn)環(huán)境:python3.6,windows,wsl2(ubuntu 20.04)
寫在前面
建議使用簡(jiǎn)書查看本文,排版更佳本文簡(jiǎn)書地址
本次涉及到的代碼文件是day5.py、day5_pre_process.py與之對(duì)應(yīng)的readme是day3.md
實(shí)習(xí)涉及到的全部代碼都已儲(chǔ)存到了github倉(cāng)庫,建議在線查看我的代碼
亦可以使用git clone https://github.com/uiharuayako/geoDataWork.git實(shí)時(shí)獲取我的最新進(jìn)展!
所有代碼均為本人原創(chuàng)或者來自老師給的資料,多點(diǎn)學(xué)習(xí)和交流思路,少點(diǎn)復(fù)制粘貼,謝謝!
實(shí)驗(yàn)內(nèi)容與完成情況:
程序編程實(shí)現(xiàn)了實(shí)驗(yàn)內(nèi)容的所有項(xiàng)目,完成了所需的全部功能
其實(shí)代碼里注釋寫得很清楚,建議直接看代碼,在這里我主要寫一下我的思路,以及代碼片段的分析
代碼分析
1. 從文件中導(dǎo)入數(shù)據(jù),并轉(zhuǎn)化為DataFrame。
2. 訓(xùn)練決策樹模型,用于預(yù)測(cè)居民收入是否超過50K;
3. 對(duì)Test數(shù)據(jù)集進(jìn)行驗(yàn)證,輸出模型的準(zhǔn)確率。
直接看代碼,這次代碼很短,但是信息量很大。
因?yàn)樽龅臅r(shí)候群里有人留言說數(shù)據(jù)有問題,還有好幾個(gè)問題,我看了一下。那個(gè)adult.test修改之前,每行的50K都寫成了50K.。這個(gè)問題確實(shí)存在。然后還有同學(xué)說數(shù)據(jù)本身也有錯(cuò)漏,于是我寫了個(gè)預(yù)處理函數(shù)進(jìn)行修補(bǔ)。
預(yù)處理函數(shù)使用的是pandas模塊,不得不說,pandas dataframe的功能比spark的對(duì)python支持好很多。起碼pandas是native python,而spark是python轉(zhuǎn)成Java的。效率差別高下立判!
說實(shí)話,我這個(gè)if嵌套的我自己都難受,但是這樣,4w條數(shù)據(jù)也能在幾秒內(nèi)處理好。效率還是很高的,反觀pyspark讀這個(gè)數(shù)據(jù)......甚至讀不出來。
預(yù)處理函數(shù)的代碼如下:
import pandas as pd
data = pd.read_csv('adult/adult.data', header=None, sep=', ', engine='python')
print(data.shape)
# 第一步,判定含有空值的行
null_lines = data.isnull().T.any()
for index, value in null_lines.items():
if value:
print("{}行有空值".format(index + 1))
# 去除空值
data.dropna(axis=0, how='any')
# 第二步,判定不對(duì)勁的值
def is_number(s):
try:
float(s)
return True
except ValueError:
pass
try:
import unicodedata
unicodedata.numeric(s)
return True
except (TypeError, ValueError):
pass
return False
work_type = {'Private': 1,
'Self-emp-not-inc': 2,
'Self-emp-inc': 3,
'Federal-gov': 4,
'Local-gov': 5,
'State-gov': 6,
'Without-pay': 7,
'Never-worked': 8,
'?': -1}
education = {'Bachelors': 1,
'Some-college': 2,
'11th': 3,
'HS-grad': 4,
'Prof-school': 5,
'Assoc-acdm': 6,
'Assoc-voc': 7,
'9th': 8,
'7th-8th': 9,
'12th': 10,
'Masters': 11,
'1st-4th': 12,
'10th': 13,
'Doctorate': 14,
'5th-6th': 15,
'Preschool': 16,
'?': -1}
marital_status = {'Married-civ-spouse': 1,
'Divorced': 2,
'Never-married': 3,
'Separated': 4,
'Widowed': 5,
'Married-spouse-absent': 6,
'Married-AF-spouse': 7,
'?': -1}
occupation = {'Tech-support': 1,
'Craft-repair': 2,
'Other-service': 3,
'Sales': 4,
'Exec-managerial': 5,
'Prof-specialty': 6,
'Handlers-cleaners': 7,
'Machine-op-inspct': 8,
'Adm-clerical': 9,
'Farming-fishing': 10,
'Transport-moving': 11,
'Priv-house-serv': 12,
'Protective-serv': 13,
'Armed-Forces': 14,
'?': -1}
relationship = {'Wife': 1,
'Own-child': 2,
'Husband': 3,
'Not-in-family': 4,
'Other-relative': 5,
'Unmarried': 6,
'?': -1}
race = {'White': 1,
'Asian-Pac-Islander': 2,
'Amer-Indian-Eskimo': 3,
'Other': 4,
'Black': 5,
'?': -1}
sex = {'Female': 1,
'Male': 2,
'?': -1}
native_country = {'United-States': 1,
'Cambodia': 2,
'England': 3,
'Puerto-Rico': 4,
'Canada': 5,
'Germany': 6,
'Outlying-US(Guam-USVI-etc)': 7,
'India': 8,
'Japan': 9,
'Greece': 10,
'South': 11,
'China': 12,
'Cuba': 13,
'Iran': 14,
'Honduras': 15,
'Philippines': 16,
'Italy': 17,
'Poland': 18,
'Jamaica': 19,
'Vietnam': 20,
'Mexico': 21,
'Portugal': 22,
'Ireland': 23,
'France': 24,
'Dominican-Republic': 25,
'Laos': 26,
'Ecuador': 27,
'Taiwan': 28,
'Haiti': 29,
'Columbia': 30,
'Hungary': 31,
'Guatemala': 32,
'Nicaragua': 33,
'Scotland': 34,
'Thailand': 35,
'Yugoslavia': 36,
'El-Salvador': 37,
'Trinadad&Tobago': 38,
'Peru': 39,
'Hong': 40,
'Holand-Netherlands': 41,
'?': -1}
for index, row in data.iterrows():
if is_number(row[0]):
if row[1] in work_type:
if is_number(row[2]):
if row[3] in education:
if is_number(row[4]):
if row[5] in marital_status:
if row[6] in occupation:
if row[7] in relationship:
if row[8] in race:
if row[9] in sex:
if is_number(row[10]):
if is_number(row[11]):
if is_number(row[12]):
if row[13] in native_country:
continue
print("{}有錯(cuò)誤".format(index + 1))
這里做了兩種驗(yàn)證,第一個(gè)是驗(yàn)證是不是有數(shù)據(jù)為空。為?的數(shù)據(jù)視為不空。然后把空行去掉,做第二步驗(yàn)證,就是拿一串很長(zhǎng)很長(zhǎng)的if。我驗(yàn)證了每一位的數(shù)據(jù)是否有效,為數(shù)字的進(jìn)行一個(gè)數(shù)字驗(yàn)證(即,驗(yàn)證字符串是否是純數(shù)字),為字符串的在字典里找有沒有對(duì)應(yīng)的key,發(fā)現(xiàn)adult.data以及adult.test都沒有錯(cuò)誤(當(dāng)然,.test文件的50K.改過來了)。
然后我就直接準(zhǔn)備開始訓(xùn)練,這里用了老師講的spark機(jī)器學(xué)習(xí)流水線的概念。有訓(xùn)練集和驗(yàn)證集,配置參數(shù)即可開始使用。Spark ML還是比較好用的。
這里的關(guān)鍵是怎么把字符串的屬性信息翻譯成數(shù)字,我選擇了字典的方式,通過預(yù)先定義的字典,使用屬性信息作為鍵值,即可直接翻譯出數(shù)字。這樣的效率很高,效果也不錯(cuò),就是前期編輯這幾個(gè)字典的時(shí)候有點(diǎn)麻煩,總體來說還是沒啥問題的。
下面是訓(xùn)練代碼:
import findspark
findspark.init()
from pyspark.ml.classification import DecisionTreeClassificationModel
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vector, Vectors
from pyspark.sql import Row, SQLContext
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
# 字典出類型
work_type = {'Private': 1,
'Self-emp-not-inc': 2,
'Self-emp-inc': 3,
'Federal-gov': 4,
'Local-gov': 5,
'State-gov': 6,
'Without-pay': 7,
'Never-worked': 8,
'?': -1}
education = {'Bachelors': 1,
'Some-college': 2,
'11th': 3,
'HS-grad': 4,
'Prof-school': 5,
'Assoc-acdm': 6,
'Assoc-voc': 7,
'9th': 8,
'7th-8th': 9,
'12th': 10,
'Masters': 11,
'1st-4th': 12,
'10th': 13,
'Doctorate': 14,
'5th-6th': 15,
'Preschool': 16,
'?': -1}
marital_status = {'Married-civ-spouse': 1,
'Divorced': 2,
'Never-married': 3,
'Separated': 4,
'Widowed': 5,
'Married-spouse-absent': 6,
'Married-AF-spouse': 7,
'?': -1}
occupation = {'Tech-support': 1,
'Craft-repair': 2,
'Other-service': 3,
'Sales': 4,
'Exec-managerial': 5,
'Prof-specialty': 6,
'Handlers-cleaners': 7,
'Machine-op-inspct': 8,
'Adm-clerical': 9,
'Farming-fishing': 10,
'Transport-moving': 11,
'Priv-house-serv': 12,
'Protective-serv': 13,
'Armed-Forces': 14,
'?': -1}
relationship = {'Wife': 1,
'Own-child': 2,
'Husband': 3,
'Not-in-family': 4,
'Other-relative': 5,
'Unmarried': 6,
'?': -1}
race = {'White': 1,
'Asian-Pac-Islander': 2,
'Amer-Indian-Eskimo': 3,
'Other': 4,
'Black': 5,
'?': -1}
sex = {'Female': 1,
'Male': 2,
'?': -1}
native_country = {'United-States': 1,
'Cambodia': 2,
'England': 3,
'Puerto-Rico': 4,
'Canada': 5,
'Germany': 6,
'Outlying-US(Guam-USVI-etc)': 7,
'India': 8,
'Japan': 9,
'Greece': 10,
'South': 11,
'China': 12,
'Cuba': 13,
'Iran': 14,
'Honduras': 15,
'Philippines': 16,
'Italy': 17,
'Poland': 18,
'Jamaica': 19,
'Vietnam': 20,
'Mexico': 21,
'Portugal': 22,
'Ireland': 23,
'France': 24,
'Dominican-Republic': 25,
'Laos': 26,
'Ecuador': 27,
'Taiwan': 28,
'Haiti': 29,
'Columbia': 30,
'Hungary': 31,
'Guatemala': 32,
'Nicaragua': 33,
'Scotland': 34,
'Thailand': 35,
'Yugoslavia': 36,
'El-Salvador': 37,
'Trinadad&Tobago': 38,
'Peru': 39,
'Hong': 40,
'Holand-Netherlands': 41,
'?': -1}
def f(x):
rel = {
'features': Vectors.dense(float(x[0]),
float(work_type[x[1]]),
float(x[2]),
float(education[x[3]]),
float(x[4]),
float(marital_status[x[5]]),
float(occupation[x[6]]),
float(relationship[x[7]]),
float(race[x[8]]),
float(sex[x[9]]),
float(x[10]),
float(x[11]),
float(x[12]),
float(native_country[x[13]])
),
'label': str(x[14])}
return rel
# spark 初始化
conf = SparkConf().setMaster("local").setAppName("ml")
sc = SparkContext(conf=conf) # 創(chuàng)建spark對(duì)象
# solve the question:AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
sqlContext = SQLContext(sc)
data = sc.textFile("adult/adult.data").map(lambda line: line.split(', ')).map(
lambda p: Row(**f(p))).toDF()
labelIndexer = StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
featureIndexer = VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(data)
labelConverter = IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
trainingData = data
testData = sc.textFile("adult/adult.test").map(lambda line: line.split(', ')).map(
lambda p: Row(**f(p))).toDF()
dtClassifier = DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")
dtPipeline = Pipeline().setStages([labelIndexer, featureIndexer, dtClassifier, labelConverter])
dtPipelineModel = dtPipeline.fit(trainingData)
dtPredictions = dtPipelineModel.transform(testData)
dtPredictions.select("predictedLabel", "label", "features").show(20)
evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")
dtAccuracy = evaluator.evaluate(dtPredictions)
print(dtAccuracy)
然后...就報(bào)錯(cuò)了,報(bào)錯(cuò)的原因是python crash了,沒有錯(cuò)誤信息,只知道是
data = sc.textFile("adult/adult.data").map(lambda line: line.split(', ')).map(
lambda p: Row(**f(p))).toDF()
這句話的toDF方法出錯(cuò),這就很離譜,數(shù)據(jù)集應(yīng)該沒問題呀。然后我想可能是內(nèi)存溢出,或者讀取的行數(shù)太多超過了某一限制。于是想了個(gè)辦法。在linux下運(yùn)行
sudo head -20000 adult.data >test.txt
然后這個(gè)test還是無法讀取。于是我
sudo head -10000 adult.data >test2.txt
這個(gè)test2奇跡般的讀取出來了???而且后面的代碼也正確運(yùn)行。經(jīng)過多次嘗試,我覺得16000這個(gè)行數(shù)比較適中,就選擇了16000行作為訓(xùn)練集。
這個(gè)時(shí)候我仍在懷疑是不是我的驗(yàn)證函數(shù)出了問題,于是我又截取了最后16000行進(jìn)行測(cè)試,發(fā)現(xiàn)也不出錯(cuò)。
那是不是就是這么碰巧,這32651行里,所有所有的錯(cuò)誤都恰好集中在中間651行,還恰好就是沒有被檢測(cè)出來呢?
在linux下,我使用sed -n '16001,24000p' adult.data>fin_data_mid.txt截取了16001行開始的8000條數(shù)據(jù),還是沒有錯(cuò)誤。
對(duì)此,我發(fā)現(xiàn)了兩個(gè)事實(shí):
- adult.data能通過測(cè)試函數(shù)的驗(yàn)證。而且當(dāng)我對(duì)其作出隨機(jī)改動(dòng)并保存,測(cè)試函數(shù)
day5_pre_process總能正確指出我改動(dòng)的位置。我認(rèn)為我的測(cè)試函數(shù)是沒有問題的 - adult.data的前16000和后16000條數(shù)據(jù),以及中間8000條數(shù)據(jù)均能進(jìn)行訓(xùn)練,證明前16000條,后16000條,中間8000條數(shù)據(jù)都是沒有問題的
我得到結(jié)論:adult.data本身沒有問題,有問題的是pyspark
最終,在和老師討論后,我使用了部分?jǐn)?shù)據(jù)進(jìn)行訓(xùn)練,得到了最終結(jié)果,不算解決了問題,但是確實(shí)有思考的過程。
在查閱資料后發(fā)現(xiàn),spark社區(qū)里有人提出過這個(gè)問題,但是至今沒有得到解決...這個(gè)問題的出現(xiàn)是因人而異,甚至有時(shí)候能讀,有時(shí)候讀不出來,我覺得這是很奇怪的。
命令行結(jié)果
將data的前16000行作為訓(xùn)練集,test做測(cè)試集,運(yùn)行,有如下結(jié)果
D:\ProgramData\Anaconda3\envs\py36\python.exe D:/code/geoDataWork/day5.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+--------------+-----+--------------------+
|predictedLabel|label| features|
+--------------+-----+--------------------+
| <=50K|<=50K|[25.0,1.0,226802....|
| <=50K|<=50K|[38.0,1.0,89814.0...|
| <=50K| >50K|[28.0,5.0,336951....|
| >50K| >50K|[44.0,1.0,160323....|
| <=50K|<=50K|[18.0,-1.0,103497...|
| <=50K|<=50K|[34.0,1.0,198693....|
| <=50K|<=50K|[29.0,-1.0,227026...|
| >50K| >50K|[63.0,2.0,104626....|
| <=50K|<=50K|[24.0,1.0,369667....|
| <=50K|<=50K|[55.0,1.0,104996....|
| <=50K| >50K|[65.0,1.0,184454....|
| >50K|<=50K|[36.0,4.0,212465....|
| <=50K|<=50K|[26.0,1.0,82091.0...|
| <=50K|<=50K|[58.0,-1.0,299831...|
| <=50K| >50K|[48.0,1.0,279724....|
| >50K| >50K|[43.0,1.0,346189....|
| <=50K|<=50K|[20.0,6.0,444554....|
| <=50K|<=50K|[43.0,1.0,128354....|
| <=50K|<=50K|[37.0,1.0,60548.0...|
| >50K| >50K|[40.0,1.0,85019.0...|
+--------------+-----+--------------------+
only showing top 20 rows
0.8234580616672087
進(jìn)程已結(jié)束,退出代碼為 0
將后16000行作為訓(xùn)練集,有如下結(jié)果
D:\ProgramData\Anaconda3\envs\py36\python.exe D:/code/geoDataWork/day5.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+--------------+-----+--------------------+
|predictedLabel|label| features|
+--------------+-----+--------------------+
| <=50K|<=50K|[25.0,1.0,226802....|
| <=50K|<=50K|[38.0,1.0,89814.0...|
| <=50K| >50K|[28.0,5.0,336951....|
| >50K| >50K|[44.0,1.0,160323....|
| <=50K|<=50K|[18.0,-1.0,103497...|
| <=50K|<=50K|[34.0,1.0,198693....|
| <=50K|<=50K|[29.0,-1.0,227026...|
| >50K| >50K|[63.0,2.0,104626....|
| <=50K|<=50K|[24.0,1.0,369667....|
| <=50K|<=50K|[55.0,1.0,104996....|
| >50K| >50K|[65.0,1.0,184454....|
| >50K|<=50K|[36.0,4.0,212465....|
| <=50K|<=50K|[26.0,1.0,82091.0...|
| <=50K|<=50K|[58.0,-1.0,299831...|
| <=50K| >50K|[48.0,1.0,279724....|
| >50K| >50K|[43.0,1.0,346189....|
| <=50K|<=50K|[20.0,6.0,444554....|
| <=50K|<=50K|[43.0,1.0,128354....|
| <=50K|<=50K|[37.0,1.0,60548.0...|
| >50K| >50K|[40.0,1.0,85019.0...|
+--------------+-----+--------------------+
only showing top 20 rows
0.8309072931788737
進(jìn)程已結(jié)束,退出代碼為 0
可以運(yùn)行,甚至效果更好,再看中間8000條的結(jié)果
D:\ProgramData\Anaconda3\envs\py36\python.exe D:/code/geoDataWork/day5.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+--------------+-----+--------------------+
|predictedLabel|label| features|
+--------------+-----+--------------------+
| <=50K|<=50K|[25.0,1.0,226802....|
| <=50K|<=50K|[38.0,1.0,89814.0...|
| <=50K| >50K|[28.0,5.0,336951....|
| >50K| >50K|[44.0,1.0,160323....|
| <=50K|<=50K|[18.0,-1.0,103497...|
| <=50K|<=50K|[34.0,1.0,198693....|
| <=50K|<=50K|[29.0,-1.0,227026...|
| >50K| >50K|[63.0,2.0,104626....|
| <=50K|<=50K|[24.0,1.0,369667....|
| <=50K|<=50K|[55.0,1.0,104996....|
| <=50K| >50K|[65.0,1.0,184454....|
| >50K|<=50K|[36.0,4.0,212465....|
| <=50K|<=50K|[26.0,1.0,82091.0...|
| <=50K|<=50K|[58.0,-1.0,299831...|
| <=50K| >50K|[48.0,1.0,279724....|
| >50K| >50K|[43.0,1.0,346189....|
| <=50K|<=50K|[20.0,6.0,444554....|
| <=50K|<=50K|[43.0,1.0,128354....|
| <=50K|<=50K|[37.0,1.0,60548.0...|
| >50K| >50K|[40.0,1.0,85019.0...|
+--------------+-----+--------------------+
only showing top 20 rows
0.8308268997562765
進(jìn)程已結(jié)束,退出代碼為 0
依然沒有問題
一些運(yùn)行截圖


