一、 特征選擇的幾個(gè)常見(jiàn)問(wèn)題
為什么?
(1)降低維度,選擇重要的特征,避免維度災(zāi)難,降低計(jì)算成本
(2)去除不相關(guān)的冗余特征(噪聲)來(lái)降低學(xué)習(xí)的難度,去除噪聲的干擾,留下關(guān)鍵因素,提高預(yù)測(cè)精度
(3)獲得更多有物理意義的,有價(jià)值的特征不同模型有不同的特征適用類(lèi)型?
(1)lr模型適用于擬合離散特征(見(jiàn)附錄)
(2)gbdt模型適用于擬合連續(xù)數(shù)值特征
(3)一般說(shuō)來(lái),特征具有較大的方差說(shuō)明蘊(yùn)含較多信息,也是比較有價(jià)值的特征特征子集的搜索:
(1)子集搜索問(wèn)題。
比如逐漸添加相關(guān)特征(前向forward搜索)或逐漸去掉無(wú)關(guān)特征(后向backward搜索),還有雙向搜索。
缺點(diǎn)是,該策略為貪心算法,本輪最優(yōu)并不一定是全局最優(yōu),若不能窮舉搜索,則無(wú)法避免該問(wèn)題。
該子集搜索策略屬于最大相關(guān)(maximum-relevance)的選擇策略。
(2)特征子集評(píng)價(jià)與度量。
信息增益,交叉熵,相關(guān)性,余弦相似度等評(píng)級(jí)準(zhǔn)則。-
典型的特征選擇方法
二、從決策樹(shù)模型的特征重要性說(shuō)起
決策樹(shù)可以看成是前向搜索與信息熵相結(jié)合的算法,樹(shù)節(jié)點(diǎn)的劃分屬性所組成的集合就是選擇出來(lái)的特征子集。
(1)決策樹(shù)劃分屬性的依據(jù)

(2)通過(guò)gini不純度計(jì)算特征重要性
不管是scikit-learn還是mllib,其中的隨機(jī)森林和gbdt算法都是基于決策樹(shù)算法,一般的,都是使用了cart樹(shù)算法,通過(guò)gini指數(shù)來(lái)計(jì)算特征的重要性的。
比如scikit-learn的sklearn.feature_selection.SelectFromModel可以實(shí)現(xiàn)根據(jù)特征重要性分支進(jìn)行特征的轉(zhuǎn)換。
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 2)
(3)mllib中集成學(xué)習(xí)算法計(jì)算特征重要性的源碼
在spark 2.0之后,mllib的決策樹(shù)算法都引入了計(jì)算特征重要性的方法featureImportances,而隨機(jī)森林算法(RandomForestRegressionModel和RandomForestClassificationModel類(lèi))和gbdt算法(GBTClassificationModel和GBTRegressionModel類(lèi))均利用決策樹(shù)算法中計(jì)算特征不純度和特征重要性的方法來(lái)得到所使用模型的特征重要性。
而這些集成方法的實(shí)現(xiàn)類(lèi)都集成了TreeEnsembleModel[M <: DecisionTreeModel]這個(gè)特質(zhì)(trait),即featureImportances是在該特質(zhì)中實(shí)現(xiàn)的。
featureImportances方法的基本計(jì)算思路是:
- 針對(duì)每一棵決策樹(shù)而言,特征j的重要性指標(biāo)為所有通過(guò)特征j進(jìn)行劃分的樹(shù)結(jié)點(diǎn)的增益的和
- 將一棵樹(shù)的特征重要性歸一化到1
- 將集成模型的特征重要性向量歸一化到1
以下是源碼分析:
def featureImportances[M <: DecisionTreeModel](trees: Array[M], numFeatures: Int): Vector = {
val totalImportances = new OpenHashMap[Int, Double]()
// 針對(duì)每一棵決策樹(shù)模型進(jìn)行遍歷
trees.foreach { tree =>
// Aggregate feature importance vector for this tree
val importances = new OpenHashMap[Int, Double]()
// 從根節(jié)點(diǎn)開(kāi)始,遍歷整棵樹(shù)的中間節(jié)點(diǎn),將同一特征的特征重要性累加起來(lái)
computeFeatureImportance(tree.rootNode, importances)
// Normalize importance vector for this tree, and add it to total.
// TODO: In the future, also support normalizing by tree.rootNode.impurityStats.count?
// 將一棵樹(shù)的特征重要性進(jìn)行歸一化
val treeNorm = importances.map(_._2).sum
if (treeNorm != 0) {
importances.foreach { case (idx, impt) =>
val normImpt = impt / treeNorm
totalImportances.changeValue(idx, normImpt, _ + normImpt)
}
}
}
// Normalize importances
// 歸一化總體的特征重要性
normalizeMapValues(totalImportances)
// Construct vector
// 構(gòu)建最終輸出的特征重要性向量
val d = if (numFeatures != -1) {
numFeatures
} else {
// Find max feature index used in trees
val maxFeatureIndex = trees.map(_.maxSplitFeatureIndex()).max
maxFeatureIndex + 1
}
if (d == 0) {
assert(totalImportances.size == 0, s"Unknown error in computing feature" +
s" importance: No splits found, but some non-zero importances.")
}
val (indices, values) = totalImportances.iterator.toSeq.sortBy(_._1).unzip
Vectors.sparse(d, indices.toArray, values.toArray)
}
其中computeFeatureImportance方法為:
// 這是計(jì)算一棵決策樹(shù)特征重要性的遞歸方法
def computeFeatureImportance(
node: Node,
importances: OpenHashMap[Int, Double]): Unit = {
node match {
// 如果是中間節(jié)點(diǎn),即進(jìn)行特征劃分的節(jié)點(diǎn)
case n: InternalNode =>
// 得到特征標(biāo)記
val feature = n.split.featureIndex
// 計(jì)算得到比例化的特征增益值,信息增益乘上該節(jié)點(diǎn)使用的訓(xùn)練數(shù)據(jù)數(shù)量
val scaledGain = n.gain * n.impurityStats.count
importances.changeValue(feature, scaledGain, _ + scaledGain)
// 前序遍歷二叉決策樹(shù)
computeFeatureImportance(n.leftChild, importances)
computeFeatureImportance(n.rightChild, importances)
case n: LeafNode =>
// do nothing
}
}
(4)mllib中決策樹(shù)算法計(jì)算特征不純度的源碼
InternalNode類(lèi)使用ImpurityCalculator類(lèi)的私有實(shí)例impurityStats來(lái)記錄不純度的信息和狀態(tài),具體使用哪一種劃分方式通過(guò)getCalculator方法來(lái)進(jìn)行選擇:
def getCalculator(impurity: String, stats: Array[Double]): ImpurityCalculator = {
impurity match {
case "gini" => new GiniCalculator(stats)
case "entropy" => new EntropyCalculator(stats)
case "variance" => new VarianceCalculator(stats)
case _ =>
throw new IllegalArgumentException(
s"ImpurityCalculator builder did not recognize impurity type: $impurity")
}
}
以gini指數(shù)為例,其信息計(jì)算的代碼如下:
@Since("1.1.0")
@DeveloperApi
override def calculate(counts: Array[Double], totalCount: Double): Double = {
if (totalCount == 0) {
return 0
}
val numClasses = counts.length
var impurity = 1.0
var classIndex = 0
while (classIndex < numClasses) {
val freq = counts(classIndex) / totalCount
impurity -= freq * freq
classIndex += 1
}
impurity
}
以上源碼解讀即是從集成方法來(lái)計(jì)算特征重要性到?jīng)Q策樹(shù)算法具體計(jì)算節(jié)點(diǎn)特征不純度方法的過(guò)程。
三、最大相關(guān)最小冗余(mRMR)算法
(1)互信息
互信息可以看成是一個(gè)隨機(jī)變量中包含的關(guān)于另一個(gè)隨機(jī)變量的信息量,或者說(shuō)是一個(gè)隨機(jī)變量由于已知另一個(gè)隨機(jī)變量而減少的不確定性?;バ畔⒈緛?lái)是信息論中的一個(gè)概念,用于表示信息之間的關(guān)系, 是兩個(gè)隨機(jī)變量統(tǒng)計(jì)相關(guān)性的測(cè)度。


(2)mRMR
之所以出現(xiàn)mRMR算法來(lái)進(jìn)行特征選擇,主要是為了解決通過(guò)最大化特征與目標(biāo)變量的相關(guān)關(guān)系度量得到的最好的m個(gè)特征,并不一定會(huì)得到最好的預(yù)測(cè)精度的問(wèn)題。
前面介紹的評(píng)價(jià)特征的方法基本都是基于是否與目標(biāo)變量具有強(qiáng)相關(guān)性的特征,但是這些特征里還可能包含一些冗余特征(比如目標(biāo)變量是立方體體積,特征為底面的長(zhǎng)度、底面的寬度、底面的面積,其實(shí)底面的面積可以由長(zhǎng)與寬得到,所以可被認(rèn)為是一種冗余信息),mRMR算法就是用來(lái)在保證最大相關(guān)性的同時(shí),又去除了冗余特征的方法,相當(dāng)于得到了一組“最純凈”的特征子集(特征之間差異很大,而同目標(biāo)變量的相關(guān)性也很大)。
作為一個(gè)特例,變量之間的相關(guān)性(correlation)可以用統(tǒng)計(jì)學(xué)的依賴(lài)關(guān)系(dependency)來(lái)替代,而互信息(mutual information)是一種評(píng)價(jià)該依賴(lài)關(guān)系的度量方法。
mRMR可認(rèn)為是最大化特征子集的聯(lián)合分布與目標(biāo)變量之間依賴(lài)關(guān)系的一種近似。
mRMR本身還是屬于filter型特征選擇方法。

可以通過(guò)max(V-W)或max(V/W)來(lái)統(tǒng)籌考慮相關(guān)性和冗余性,作為特征評(píng)價(jià)的標(biāo)準(zhǔn)。

(3)mRMR的spark實(shí)現(xiàn)源碼
mRMR算法包含幾個(gè)步驟:
- 將數(shù)據(jù)進(jìn)行處理轉(zhuǎn)換的過(guò)程(注:為了計(jì)算兩個(gè)特征的聯(lián)合分布和邊緣分布,需要將數(shù)據(jù)歸一化到[0,255]之間,并且將每一維特征使用合理的數(shù)據(jù)結(jié)構(gòu)進(jìn)行存儲(chǔ))
- 計(jì)算特征之間、特征與響應(yīng)變量之間的分布及互信息
- 對(duì)特征進(jìn)行mrmr得分,并進(jìn)行排序

private[feature] def run(
data: RDD[LabeledPoint],
nToSelect: Int,
numPartitions: Int) = {
val nPart = if(numPartitions == 0) data.context.getConf.getInt(
"spark.default.parallelism", 500) else numPartitions
val requireByteValues = (l: Double, v: Vector) => {
val values = v match {
case SparseVector(size, indices, values) =>
values
case DenseVector(values) =>
values
}
val condition = (value: Double) => value <= 255 &&
value >= 0
if (!values.forall(condition(_)) || !condition(l)) {
throw new SparkException(s"Info-Theoretic Framework requires positive values in range [0, 255]")
}
}
val nAllFeatures = data.first.features.size + 1
// 將數(shù)據(jù)排列成欄狀,其實(shí)是為每個(gè)數(shù)據(jù)都編上號(hào)
val columnarData: RDD[(Long, Short)] = data.zipWithIndex().flatMap ({
case (LabeledPoint(label, values: SparseVector), r) =>
requireByteValues(label, values)
// Not implemented yet!
throw new NotImplementedError()
case (LabeledPoint(label, values: DenseVector), r) =>
requireByteValues(label, values)
val rindex = r * nAllFeatures
val inputs = for(i <- 0 until values.size) yield (rindex + i, values(i).toShort)
val output = Array((rindex + values.size, label.toShort))
inputs ++ output
}).sortByKey(numPartitions = nPart) // put numPartitions parameter
columnarData.persist(StorageLevel.MEMORY_AND_DISK_SER)
require(nToSelect < nAllFeatures)
// 計(jì)算mrmr過(guò)程及對(duì)特征進(jìn)行排序
val selected = selectFeatures(columnarData, nToSelect, nAllFeatures)
columnarData.unpersist()
// Print best features according to the mRMR measure
val out = selected.map{case F(feat, rel) => (feat + 1) + "\t" + "%.4f".format(rel)}.mkString("\n")
println("\n*** mRMR features ***\nFeature\tScore\n" + out)
// Features must be sorted
new SelectorModel(selected.map{case F(feat, rel) => feat}.sorted.toArray)
}
下面是基于互信息及mrmr的特征選擇過(guò)程:
/**
* Perform a info-theory selection process.
*
* @param data Columnar data (last element is the class attribute).
* @param nToSelect Number of features to select.
* @param nFeatures Number of total features in the dataset.
* @return A list with the most relevant features and its scores.
*
*/
private[feature] def selectFeatures(
data: RDD[(Long, Short)],
nToSelect: Int,
nFeatures: Int) = {
// 特征的下標(biāo)
val label = nFeatures - 1
// 因?yàn)閐ata是(編號(hào),每個(gè)特征),所以這是數(shù)據(jù)數(shù)量
val nInstances = data.count() / nFeatures
// 將同一類(lèi)特征放在一起,根據(jù)同一key進(jìn)行分組,然后取出最大值加1(用于后續(xù)構(gòu)建分布直方圖的參數(shù))
val counterByKey = data.map({ case (k, v) => (k % nFeatures).toInt -> v})
.distinct().groupByKey().mapValues(_.max + 1).collectAsMap().toMap
// calculate relevance
val MiAndCmi = IT.computeMI(
data, 0 until label, label, nInstances, nFeatures, counterByKey)
// 互信息池,用于mrmr判定,pool是(feat, Mrmr)
var pool = MiAndCmi.map{case (x, mi) => (x, new MrmrCriterion(mi))}
.collectAsMap()
// Print most relevant features
// Print most relevant features
val strRels = MiAndCmi.collect().sortBy(-_._2)
.take(nToSelect)
.map({case (f, mi) => (f + 1) + "\t" + "%.4f" format mi})
.mkString("\n")
println("\n*** MaxRel features ***\nFeature\tScore\n" + strRels)
// get maximum and select it
// 得到了分?jǐn)?shù)最高的那個(gè)特征及其mrmr
val firstMax = pool.maxBy(_._2.score)
var selected = Seq(F(firstMax._1, firstMax._2.score))
// 將firstMax對(duì)應(yīng)的key從pool這個(gè)map中去掉
pool = pool - firstMax._1
while (selected.size < nToSelect) {
// update pool
val newMiAndCmi = IT.computeMI(data, pool.keys.toSeq,
selected.head.feat, nInstances, nFeatures, counterByKey)
.map({ case (x, crit) => (x, crit) })
.collectAsMap()
pool.foreach({ case (k, crit) =>
// 從pool里拿出第k個(gè)特征,然后從newMiAndCmi中得到對(duì)應(yīng)的mi
newMiAndCmi.get(k) match {
case Some(_) => crit.update(_)
case None =>
}
})
// get maximum and save it
val max = pool.maxBy(_._2.score)
// select the best feature and remove from the whole set of features
selected = F(max._1, max._2.score) +: selected
pool = pool - max._1
}
selected.reverse
}
具體計(jì)算互信息的代碼如下:
/**
* Method that calculates mutual information (MI) and conditional mutual information (CMI)
* simultaneously for several variables. Indexes must be disjoint.
*
* @param rawData RDD of data (first element is the class attribute)
* @param varX Indexes of primary variables (must be disjoint with Y and Z)
* @param varY Indexes of secondary variable (must be disjoint with X and Z)
* @param nInstances Number of instances
* @param nFeatures Number of features (including output ones)
* @return RDD of (primary var, (MI, CMI))
*
*/
def computeMI(
rawData: RDD[(Long, Short)],
varX: Seq[Int],
varY: Int,
nInstances: Long,
nFeatures: Int,
counter: Map[Int, Int]) = {
// Pre-requisites
require(varX.size > 0)
// Broadcast variables
val sc = rawData.context
val label = nFeatures - 1
// A boolean vector that indicates the variables involved on this computation
// 對(duì)應(yīng)每個(gè)數(shù)據(jù)不同維度的特征的一個(gè)boolean數(shù)組
val fselected = Array.ofDim[Boolean](nFeatures)
fselected(varY) = true // output feature
varX.map(fselected(_) = true) // 將fselected置為true
val bFeatSelected = sc.broadcast(fselected)
val getFeat = (k: Long) => (k % nFeatures).toInt
// Filter data by these variables
// 根據(jù)bFeatSelected來(lái)過(guò)濾rawData
val data = rawData.filter({ case (k, _) => bFeatSelected.value(getFeat(k))})
// Broadcast Y vector
val yCol: Array[Short] = if(varY == label){
// classCol corresponds with output attribute, which is re-used in the iteration
classCol = data.filter({ case (k, _) => getFeat(k) == varY}).values.collect()
classCol
} else {
data.filter({ case (k, _) => getFeat(k) == varY}).values.collect()
}
// data是所有選擇維度的特征,(varY, yCol)是y所在的列和y值數(shù)組
// 生成特征與y的對(duì)應(yīng)關(guān)系的直方圖
val histograms = computeHistograms(data, (varY, yCol), nFeatures, counter)
// 這里只是對(duì)數(shù)據(jù)規(guī)約成占比的特征和目標(biāo)變量的聯(lián)合分布
val jointTable = histograms.mapValues(_.map(_.toFloat / nInstances))
// sum(h(*, ::))計(jì)算每一行數(shù)據(jù)之和
val marginalTable = jointTable.mapValues(h => sum(h(*, ::)).toDenseVector)
// If y corresponds with output feature, we save for CMI computation
if(varY == label) {
marginalProb = marginalTable.cache()
jointProb = jointTable.cache()
}
val yProb = marginalTable.lookup(varY)(0)
// Remove output feature from the computations
val fdata = histograms.filter{case (k, _) => k != label}
// fdata是特征與y的聯(lián)合分布,yProb是一個(gè)值
computeMutualInfo(fdata, yProb, nInstances)
}
計(jì)算數(shù)據(jù)分布直方圖的方法:
private def computeHistograms(
data: RDD[(Long, Short)],
yCol: (Int, Array[Short]),
nFeatures: Long,
counter: Map[Int, Int]) = {
val maxSize = 256
val byCol = data.context.broadcast(yCol._2)
val bCounter = data.context.broadcast(counter)
// 得到y(tǒng)的最大值
val ys = counter.getOrElse(yCol._1, maxSize).toInt
// mapPartitions是對(duì)rdd每個(gè)分區(qū)進(jìn)行操作,it為分區(qū)迭代器
// map得到的是(feature, matrix)的Map
data.mapPartitions({ it =>
var result = Map.empty[Int, BDM[Long]]
for((k, x) <- it) {
val feat = (k % nFeatures).toInt; val inst = (k / nFeatures).toInt
// 取得具體特征的最大值
val xs = bCounter.value.getOrElse(feat, maxSize).toInt
val m = result.getOrElse(feat, BDM.zeros[Long](xs, ys)) // 創(chuàng)建(xMax,yMax)的矩陣
m(x, byCol.value(inst)) += 1
result += feat -> m
}
result.toIterator
}).reduceByKey(_ + _)
}
計(jì)算互信息的公式:
private def computeMutualInfo(
data: RDD[(Int, BDM[Long])],
yProb: BDV[Float],
n: Long) = {
val byProb = data.context.broadcast(yProb)
data.mapValues({ m =>
var mi = 0.0d
// Aggregate by row (x)
val xProb = sum(m(*, ::)).map(_.toFloat / n)
for(i <- 0 until m.rows){
for(j <- 0 until m.cols){
val pxy = m(i, j).toFloat / n
val py = byProb.value(j); val px = xProb(i)
if(pxy != 0 && px != 0 && py != 0) // To avoid NaNs
// I(x,y) = sum[p(x,y)log(p(x,y)/(p(x)p(y)))]
mi += pxy * (math.log(pxy / (px * py)) / math.log(2))
}
}
mi
})
}
附錄
邏輯回歸模型與離散特征
將連續(xù)特征離散化(像是獨(dú)熱編碼之類(lèi)的技巧)交給邏輯回歸模型,這樣做的優(yōu)勢(shì)有以下幾點(diǎn):
- 稀疏向量?jī)?nèi)積乘法運(yùn)算速度快,計(jì)算結(jié)果方便存儲(chǔ),容易擴(kuò)展。
- 離散化后的特征對(duì)異常數(shù)據(jù)有很強(qiáng)的魯棒性:比如一個(gè)特征是年齡>30是1,否則0。如果特征沒(méi)有離散化,一個(gè)異常數(shù)據(jù)“年齡300歲”會(huì)給模型造成很大的干擾。
- 邏輯回歸屬于廣義線(xiàn)性模型,表達(dá)能力受限;單變量離散化為N個(gè)后,每個(gè)變量有單獨(dú)的權(quán)重,相當(dāng)于為模型引入了非線(xiàn)性,能夠提升模型表達(dá)能力,加大擬合。
- 離散化后可以進(jìn)行特征交叉,由M+N個(gè)變量變?yōu)镸*N個(gè)變量,進(jìn)一步引入非線(xiàn)性,提升表達(dá)能力。
- 特征離散化后,模型會(huì)更穩(wěn)定,比如如果對(duì)用戶(hù)年齡離散化,20-30作為一個(gè)區(qū)間,不會(huì)因?yàn)橐粋€(gè)用戶(hù)年齡長(zhǎng)了一歲就變成一個(gè)完全不同的人。當(dāng)然處于區(qū)間相鄰處的樣本會(huì)剛好相反,所以怎么劃分區(qū)間是門(mén)學(xué)問(wèn)。
李沐少帥指出,模型是使用離散特征還是連續(xù)特征,其實(shí)是一個(gè)“海量離散特征+簡(jiǎn)單模型” 同 “少量連續(xù)特征+復(fù)雜模型”的權(quán)衡。既可以離散化用線(xiàn)性模型,也可以用連續(xù)特征加深度學(xué)習(xí)。
參考資料
- 機(jī)器學(xué)習(xí),周志華
- 互信息的理解 http://www.fuzihao.org/blog/2015/01/17/%E4%BA%92%E4%BF%A1%E6%81%AF%E7%9A%84%E7%90%86%E8%A7%A3/
- Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy
- An improved implementation of the classical feature selection method: minimum Redundancy and Maximum Relevance (mRMR). https://github.com/sramirez/fast-mRMR
轉(zhuǎn)載請(qǐng)注明作者Jason Ding及其出處
jasonding.top
Github博客主頁(yè)(http://blog.jasonding.top/)
CSDN博客(http://blog.csdn.net/jasonding1354)
簡(jiǎn)書(shū)主頁(yè)(http://www.itdecent.cn/users/2bd9b48f6ea8/latest_articles)
Google搜索jasonding1354進(jìn)入我的博客主頁(yè)

