A tutorial on using the rminer R package for data mining tasks

數(shù)據(jù)挖掘

這是一個(gè)數(shù)據(jù)挖掘的常規(guī)流程:

  1. 業(yè)務(wù)理解 :背景是什么,問題的目的是什么
  2. 數(shù)據(jù)理解 :有哪些數(shù)據(jù),那些數(shù)據(jù)相關(guān),數(shù)據(jù)是否充分,數(shù)據(jù)對不對
  3. 數(shù)據(jù)預(yù)處理:數(shù)據(jù)的清洗,數(shù)據(jù)的轉(zhuǎn)換,包括特征的選擇
  4. 建立模型:建立分類模型,回歸模型
  5. 評估模型:模型效果如何,ks ,auc
  6. 模型部署,使用建立好的模型


    image.png

數(shù)據(jù)處理

輸出數(shù)據(jù)的行列

# simple show rows x columns function
nelems=function(d) paste(nrow(d),"x",ncol(d))

缺失值處理

# 1.直接刪除
bank4=na.omit(bank3)

# 2.用平均值填充
bank5=imputation("value",bank3,"age",Value=meanage)

# 3.substitute NA values by the values found in most similar case (1-nearestneighbor):
bank6=imputation("hotdeck",bank3,"age")

建模

fit函數(shù):訓(xùn)練模型,調(diào)參數(shù)
predict: 函數(shù),進(jìn)行預(yù)測
mining :根據(jù)驗(yàn)證方法和運(yùn)行次數(shù)執(zhí)行幾次擬合并預(yù)測執(zhí)行。

library(rminer)
# ctree
B2=fit(schoolsup~.,math[,c(inputs,bout)],model="ctree")
# rpart 
B1=fit(schoolsup~.,math[,c(inputs,bout)],model="rpart")

B3=fit(schoolsup~.,math[,c(inputs,bout)],model="mlpe") 

B4=fit(schoolsup~.,math[,c(inputs,bout)],model="ksvm")

C3=fit(Mjob~.,cmath,model="randomForest")

你修改model就好了

評估

B1=fit(schoolsup~.,math[,c(inputs,bout)],model="rpart")
test <- math[,c(inputs,bout)]
y <- test$schoolsup.1
P1=predict(B1,test)

m=mmetric(y,P1,metric=c("ALL"))

這樣就會得出所有的指標(biāo)

如何查看model有哪些模型:

  • naive most common class (classification) or mean output value (regression)

  • ctree – conditional inference tree (classification and regression, uses [ctree](http://127.0.0.1:10074/help/library/rminer/help/ctree)from party package)

  • cv.glmnet – generalized linear model with lasso or elasticnet regularization (classification and regression, uses [cv.glmnet](http://127.0.0.1:10074/help/library/rminer/help/cv.glmnet) from glmnet package; note: cross-validation is used to automatically set the lambda parameter that is needed to compute the predictions)

  • rpart or dt – decision tree (classification and regression, uses [rpart](http://127.0.0.1:10074/help/library/rminer/help/rpart) from rpart package)

  • kknn or knn – k-nearest neighbor (classification and regression, uses [kknn](http://127.0.0.1:10074/help/library/rminer/help/kknn)from kknn package)

  • ksvm or svm – support vector machine (classification and regression, uses [ksvm](http://127.0.0.1:10074/help/library/rminer/help/ksvm) from kernlab package)

  • mlp – multilayer perceptron with one hidden layer (classification and regression, uses [nnet](http://127.0.0.1:10074/help/library/rminer/help/nnet) from nnet package)

  • mlpe – multilayer perceptron ensemble (classification and regression, uses [nnet](http://127.0.0.1:10074/help/library/rminer/help/nnet) from nnet package)

  • randomForest or randomforest – random forest algorithm (classification and regression, uses [randomForest](http://127.0.0.1:10074/help/library/rminer/help/randomForest) from randomForest package)

  • xgboost – eXtreme Gradient Boosting (Tree) (classification and regression, uses [xgboost](http://127.0.0.1:10074/help/library/rminer/help/xgboost) from xgboost package; note: nrounds parameter is set by default to 2)

  • bagging – bagging (classification, uses [bagging](http://127.0.0.1:10074/help/library/rminer/help/bagging) from adabag package)

  • boosting – boosting (classification, uses [boosting](http://127.0.0.1:10074/help/library/rminer/help/boosting) from adabag package)

  • lda – linear discriminant analysis (classification, uses [lda](http://127.0.0.1:10074/help/library/rminer/help/lda) from MASS package)

  • multinom or lr – logistic regression (classification, uses [multinom](http://127.0.0.1:10074/help/library/rminer/help/multinom) from nnet package)

  • naiveBayes or naivebayes – naive bayes (classification, uses [naiveBayes](http://127.0.0.1:10074/help/library/rminer/help/naiveBayes)from e1071 package)

  • qda – quadratic discriminant analysis (classification, uses [qda](http://127.0.0.1:10074/help/library/rminer/help/qda) from MASSpackage)

  • cubist – M5 rule-based model (regression, uses [cubist](http://127.0.0.1:10074/help/library/rminer/help/cubist) from Cubistpackage)

  • lm – standard multiple/linear regression (uses [lm](http://127.0.0.1:10074/help/library/rminer/help/lm))

  • mr – multiple regression (regression, equivalent to [lm](http://127.0.0.1:10074/help/library/rminer/help/lm) but uses [nnet](http://127.0.0.1:10074/help/library/rminer/help/nnet) from nnet package with zero hidden nodes and linear output function)

  • mars – multivariate adaptive regression splines (regression, uses [mars](http://127.0.0.1:10074/help/library/rminer/help/mars) from mda package)

  • pcr – principal component regression (regression, uses [pcr](http://127.0.0.1:10074/help/library/rminer/help/pcr) from plspackage)

  • plsr – partial least squares regression (regression, uses [plsr](http://127.0.0.1:10074/help/library/rminer/help/plsr) from plspackage)

  • cppls – canonical powered partial least squares (regression, uses [cppls](http://127.0.0.1:10074/help/library/rminer/help/cppls) from pls package)

  • rvm – relevance vector machine (regression, uses [rvm](http://127.0.0.1:10074/help/library/rminer/help/rvm) from kernlabpackage)

分享資料:

https://repositorium.sdum.uminho.pt/bitstream/1822/36210/1/rminer-tutorial.pdf

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容