python開發(fā):特征工程代碼模版(一)

作為一個(gè)算法工程師,我們接的業(yè)務(wù)需求不會(huì)比數(shù)據(jù)分析挖掘工程師少,作為一個(gè)愛偷懶的人,總機(jī)械重復(fù)的完成一樣的預(yù)處理工作,我是不能忍的,所以在最近幾天,我正在完善一些常規(guī)的、通用的預(yù)處理的code,方便我們以后在每次分析之前直接import快速搞定,省的每次都要去做一樣的事情。

如果大家有什么想實(shí)現(xiàn)但是懶得去弄的預(yù)處理的步驟也可以私信我,我相對而言閑暇還是有的(畢竟工資少工作也不多,攤手:《),我開發(fā)完成后直接貼出來,大家以后一起用就行了

我們需要預(yù)加載這些包,而且接下來所有的操作均在dataframe格式下完成,所以我們需要將數(shù)據(jù)先處理成dataframe格式

from __future__ import division
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import NearestNeighbors

__author__ = 'slade_sal'
__time__ = '20171128'


def change_data_format(data):
    # 以下預(yù)處理都是基于dataframe格式進(jìn)行的
    data_new = pd.DataFrame(data)
    return data_new

接下來就開始我們的正題了,首先,我們需要判斷哪些列是空值過多的,當(dāng)一列數(shù)據(jù)的空值占列數(shù)的40%以上(經(jīng)驗(yàn)值),這列能夠帶給我們的信息就不多了,所以我們需要把某個(gè)閥值(rate_base)以上的空值個(gè)數(shù)的列干掉,如下:


# 去除空值過多的feature
def nan_remove(data, rate_base=0.4):
    all_cnt = data.shape[0]
    avaiable_index = []
    # 針對每一列feature統(tǒng)計(jì)nan的個(gè)數(shù),個(gè)數(shù)大于全量樣本的rate_base的認(rèn)為是異常feature,進(jìn)行剔除
    for i in range(data.shape[1]):
        rate = np.isnan(np.array(data.iloc[:, i])).sum() / all_cnt
        if rate <= rate_base:
            avaiable_index.append(i)
    data_available = data.iloc[:, avaiable_index]
    return data_available, avaiable_index

把空值過多的列去完之后,我們需要考慮將一些特別離群的點(diǎn)去掉,這邊需要注意兩點(diǎn):

  • 異常值分析類的場景禁止使用這步,比如信用卡評分,爬蟲識別等,你如果采取了這步,還怎么去分離出這些異常啊
  • 容忍度高的算法不建議使用這步,比如svm里面已經(jīng)有了支持向量機(jī)這個(gè)東西,你如果采取了這步的離群識別的操作會(huì)改變原分布而且svm里面決定超平面的核心與離群點(diǎn)無關(guān),后接函數(shù)會(huì)引發(fā)意想不到的彩蛋~

這邊采取蓋帽法與額定的分位點(diǎn)方法,建議組合使用,用changed_feature_box定義需要采用蓋帽法的列的index_num,代碼如下:

# 離群點(diǎn)蓋帽
def outlier_remove(data, limit_value=10, method='box', percentile_limit_set=90, changed_feature_box=[]):
    # limit_value是最小處理樣本個(gè)數(shù)set,當(dāng)獨(dú)立樣本大于limit_value我們認(rèn)為非可onehot字段
    feature_cnt = data.shape[1]
    feature_change = []
    if method == 'box':
        for i in range(feature_cnt):
            if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                q1 = np.percentile(np.array(data.iloc[:, i]), 25)
                q3 = np.percentile(np.array(data.iloc[:, i]), 75)
                # q3+3/2*qi為上截距點(diǎn),詳細(xì)百度分箱圖
                top = q3 + 1.5 * (q3 - q1)
                data.iloc[:, i][data.iloc[:, i] > top] = top
                feature_change.append(i)
        return data, feature_change
    if method == 'self_def':
        # 快速截?cái)?        if len(changed_feature_box) == 0:
            # 當(dāng)方法選擇為自定義,且沒有定義changed_feature_box則全量數(shù)據(jù)全部按照percentile_limit_set的分位點(diǎn)大小進(jìn)行截?cái)?            for i in range(feature_cnt):
                if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                    q_limit = np.percentile(np.array(data.iloc[:, i]), percentile_limit_set)
                    data.iloc[:, i][data.iloc[:, i] > q_limit] = q_limit
                    feature_change.append(i)
        else:
            # 如果定義了changed_feature_box,則將changed_feature_box里面的按照box方法,changed_feature_box的feature index按照percentile_limit_set的分位點(diǎn)大小進(jìn)行截?cái)?            for i in range(feature_cnt):
                if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                    if i in changed_feature_box:
                        q1 = np.percentile(np.array(data.iloc[:, i]), 25)
                        q3 = np.percentile(np.array(data.iloc[:, i]), 75)
                        # q3+3/2*qi為上截距點(diǎn),詳細(xì)百度分箱圖
                        top = q3 + 1.5 * (q3 - q1)
                        data.iloc[:, i][data.iloc[:, i] > top] = top
                        feature_change.append(i)
                    else:
                        q_limit = np.percentile(np.array(data.iloc[:, i]), percentile_limit_set)
                        data.iloc[:, i][data.iloc[:, i] > q_limit] = q_limit
                        feature_change.append(i)
            return data, feature_change

在此之后,我們需要對空值進(jìn)行填充,這邊方法就很多很多了,我這邊實(shí)現(xiàn)的是基本的,分了連續(xù)feature和分類feature,分別針對continuous feature采取mean,min,max方式,class feature采取one_hot_encoding的方式;除此之外還可以做分層填充,差分填充等等,那個(gè)比較定制化,如果有需要,我也可以搞一套,但是個(gè)人覺得意義不大。

# 空feature填充
def nan_fill(data, limit_value=10, countinuous_dealed_method='mean'):
    feature_cnt = data.shape[1]
    normal_index = []
    continuous_feature_index = []
    class_feature_index = []
    continuous_feature_df = pd.DataFrame()
    class_feature_df = pd.DataFrame()
    # 當(dāng)存在空值且每個(gè)feature下獨(dú)立的樣本數(shù)小于limit_value,我們認(rèn)為是class feature采取one_hot_encoding;
    # 當(dāng)存在空值且每個(gè)feature下獨(dú)立的樣本數(shù)大于limit_value,我們認(rèn)為是continuous feature采取mean,min,max方式
    for i in range(feature_cnt):
        if np.isnan(np.array(data.iloc[:, i])).sum() > 0:
            if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                if countinuous_dealed_method == 'mean':
                    continuous_feature_df = pd.concat(
                        [continuous_feature_df, data.iloc[:, i].fillna(data.iloc[:, i].mean())], axis=1)
                    continuous_feature_index.append(i)
                elif countinuous_dealed_method == 'max':
                    continuous_feature_df = pd.concat(
                        [continuous_feature_df, data.iloc[:, i].fillna(data.iloc[:, i].max())], axis=1)
                    continuous_feature_index.append(i)
                elif countinuous_dealed_method == 'min':
                    continuous_feature_df = pd.concat(
                        [continuous_feature_df, data.iloc[:, i].fillna(data.iloc[:, i].min())], axis=1)
                    continuous_feature_index.append(i)
            elif len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) > 0 and len(
                    pd.DataFrame(data.iloc[:, i]).drop_duplicates()) < limit_value:
                class_feature_df = pd.concat(
                    [class_feature_df, pd.get_dummies(data.iloc[:, i], prefix=data.columns[i])], axis=1)
                class_feature_index.append(i)
        else:
            normal_index.append(i)
    data_update = pd.concat([data.iloc[:, normal_index], continuous_feature_df, class_feature_df], axis=1)
    return data_update

分類feature的one hot encoding過程,常見操作,不多說

# onehotencoding
def ohe(data, limit_value=10):
    feature_cnt = data.shape[1]
    class_index = []
    class_df = pd.DataFrame()
    normal_index = []
    # limit_value以下的均認(rèn)為是class feature,進(jìn)行ohe過程
    for i in range(feature_cnt):
        if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) < limit_value:
            class_index.append(i)
            class_df = pd.concat([class_df, pd.get_dummies(data.iloc[:, i], prefix=data.columns[i])], axis=1)
        else:
            normal_index.append(i)
    data_update = pd.concat([data.iloc[:, normal_index], class_df], axis=1)
    return data_update

正負(fù)樣本不平衡的解決,這邊我寫的是smote,理論部分建議參考:Python:SMOTE算法,其實(shí)簡單的欠抽樣和過抽樣就可以解決,建議參考這邊文章:Python:數(shù)據(jù)抽樣平衡方法重寫。都是一些老生常談的問題了,不多說了,上代碼:

# smote unbalance dataset
def smote(data, tag_label='tag_1', amount_personal=0, std_rate=5, k=5,method = 'mean'):
    cnt = data[tag_label].groupby(data[tag_label]).count()
    rate = max(cnt) / min(cnt)
    location = []
    if rate < 5:
        print('不需要smote過程')
        return data
    else:
        # 拆分不同大小的數(shù)據(jù)集合
        less_data = np.array(data[data[tag_label] == np.array(cnt[cnt == min(cnt)].index)[0]])
        more_data = np.array(data[data[tag_label] == np.array(cnt[cnt == max(cnt)].index)[0]])
        # 找出每個(gè)少量數(shù)據(jù)中每條數(shù)據(jù)k個(gè)鄰居
        neighbors = NearestNeighbors(n_neighbors=k).fit(less_data)
        for i in range(len(less_data)):
            point = less_data[i, :]
            location_set = neighbors.kneighbors([less_data[i]], return_distance=False)[0]
            location.append(location_set)
        # 確定需要將少量數(shù)據(jù)補(bǔ)充到上限額度
        # 判斷有沒有設(shè)定生成數(shù)據(jù)個(gè)數(shù),如果沒有按照std_rate(預(yù)期正負(fù)樣本比)比例生成
        if amount_personal > 0:
            amount = amount_personal
        else:
            amount = int(max(cnt) / std_rate)
        # 初始化,判斷連續(xù)還是分類變量采取不同的生成邏輯
        times = 0
        continue_index = []  # 連續(xù)變量
        class_index = []  # 分類變量
        for i in range(less_data.shape[1]):
            if len(pd.DataFrame(less_data[:, i]).drop_duplicates()) > 10:
                continue_index.append(i)
            else:
                class_index.append(i)
        case_update = pd.DataFrame()
        while times < amount:
            # 連續(xù)變量取附近k個(gè)點(diǎn)的重心,認(rèn)為少數(shù)樣本的附近也是少數(shù)樣本
            new_case = []
            pool = np.random.permutation(len(location))[0]
            neighbor_group = less_data[location[pool], :]
            if method == 'mean':
                new_case1 = neighbor_group[:, continue_index].mean(axis=0)
            # 連續(xù)樣本的附近點(diǎn)向量上的點(diǎn)也是異常點(diǎn)
            if method =='random':
                new_case1 =less_data[pool][continue_index]  + np.random.rand()*(less_data[pool][continue_index]-neighbor_group[0][continue_index])
            # 分類變量取mode
            new_case2 = []
            for i in class_index:
                L = pd.DataFrame(neighbor_group[:, i])
                new_case2.append(np.array(L.mode()[0])[0])
            new_case.extend(new_case1)
            new_case.extend(new_case2)
            case_update = pd.concat([case_update, pd.DataFrame(new_case)], axis=1)
            print('已經(jīng)生成了%s條新數(shù)據(jù),完成百分之%.2f' % (times, times * 100 / amount))
            times = times + 1
        data_res = np.vstack((more_data, np.array(case_update.T)))
        data_res = pd.DataFrame(data_res)
        data_res.columns = data.columns
    return data_res

一期的內(nèi)容就這樣吧,我感覺也沒有啥好說的,都是數(shù)據(jù)分析挖掘的一些基本操作,我只是為了以后能夠復(fù)用模版化了,下面貼一個(gè)全量我做預(yù)處理的過程,沒啥差異,整合了一下:

from __future__ import division
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import NearestNeighbors
import sys

__author__ = 'slade_sal'
__time__ = '20171128'


def change_data_format(data):
    # 以下預(yù)處理都是基于dataframe格式進(jìn)行的
    data_new = pd.DataFrame(data)
    return data_new


# 去除空值過多的feature
def nan_remove(data, rate_base=0.4):
    all_cnt = data.shape[0]
    avaiable_index = []
    # 針對每一列feature統(tǒng)計(jì)nan的個(gè)數(shù),個(gè)數(shù)大于全量樣本的rate_base的認(rèn)為是異常feature,進(jìn)行剔除
    for i in range(data.shape[1]):
        rate = np.isnan(np.array(data.iloc[:, i])).sum() / all_cnt
        if rate <= rate_base:
            avaiable_index.append(i)
    data_available = data.iloc[:, avaiable_index]
    return data_available, avaiable_index


# 離群點(diǎn)蓋帽
def outlier_remove(data, limit_value=10, method='box', percentile_limit_set=90, changed_feature_box=[]):
    # limit_value是最小處理樣本個(gè)數(shù)set,當(dāng)獨(dú)立樣本大于limit_value我們認(rèn)為非可onehot字段
    feature_cnt = data.shape[1]
    feature_change = []
    if method == 'box':
        for i in range(feature_cnt):
            if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                q1 = np.percentile(np.array(data.iloc[:, i]), 25)
                q3 = np.percentile(np.array(data.iloc[:, i]), 75)
                # q3+3/2*qi為上截距點(diǎn),詳細(xì)百度分箱圖
                top = q3 + 1.5 * (q3 - q1)
                data.iloc[:, i][data.iloc[:, i] > top] = top
                feature_change.append(i)
        return data, feature_change
    if method == 'self_def':
        # 快速截?cái)?        if len(changed_feature_box) == 0:
            # 當(dāng)方法選擇為自定義,且沒有定義changed_feature_box則全量數(shù)據(jù)全部按照percentile_limit_set的分位點(diǎn)大小進(jìn)行截?cái)?            for i in range(feature_cnt):
                if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                    q_limit = np.percentile(np.array(data.iloc[:, i]), percentile_limit_set)
                    data.iloc[:, i][data.iloc[:, i] > q_limit] = q_limit
                    feature_change.append(i)
        else:
            # 如果定義了changed_feature_box,則將changed_feature_box里面的按照box方法,changed_feature_box的feature index按照percentile_limit_set的分位點(diǎn)大小進(jìn)行截?cái)?            for i in range(feature_cnt):
                if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                    if i in changed_feature_box:
                        q1 = np.percentile(np.array(data.iloc[:, i]), 25)
                        q3 = np.percentile(np.array(data.iloc[:, i]), 75)
                        # q3+3/2*qi為上截距點(diǎn),詳細(xì)百度分箱圖
                        top = q3 + 1.5 * (q3 - q1)
                        data.iloc[:, i][data.iloc[:, i] > top] = top
                        feature_change.append(i)
                    else:
                        q_limit = np.percentile(np.array(data.iloc[:, i]), percentile_limit_set)
                        data.iloc[:, i][data.iloc[:, i] > q_limit] = q_limit
                        feature_change.append(i)
            return data, feature_change


# 空feature填充
def nan_fill(data, limit_value=10, countinuous_dealed_method='mean'):
    feature_cnt = data.shape[1]
    normal_index = []
    continuous_feature_index = []
    class_feature_index = []
    continuous_feature_df = pd.DataFrame()
    class_feature_df = pd.DataFrame()
    # 當(dāng)存在空值且每個(gè)feature下獨(dú)立的樣本數(shù)小于limit_value,我們認(rèn)為是class feature采取one_hot_encoding;
    # 當(dāng)存在空值且每個(gè)feature下獨(dú)立的樣本數(shù)大于limit_value,我們認(rèn)為是continuous feature采取mean,min,max方式
    for i in range(feature_cnt):
        if np.isnan(np.array(data.iloc[:, i])).sum() > 0:
            if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) >= limit_value:
                if countinuous_dealed_method == 'mean':
                    continuous_feature_df = pd.concat(
                        [continuous_feature_df, data.iloc[:, i].fillna(data.iloc[:, i].mean())], axis=1)
                    continuous_feature_index.append(i)
                elif countinuous_dealed_method == 'max':
                    continuous_feature_df = pd.concat(
                        [continuous_feature_df, data.iloc[:, i].fillna(data.iloc[:, i].max())], axis=1)
                    continuous_feature_index.append(i)
                elif countinuous_dealed_method == 'min':
                    continuous_feature_df = pd.concat(
                        [continuous_feature_df, data.iloc[:, i].fillna(data.iloc[:, i].min())], axis=1)
                    continuous_feature_index.append(i)
            elif len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) > 0 and len(
                    pd.DataFrame(data.iloc[:, i]).drop_duplicates()) < limit_value:
                class_feature_df = pd.concat(
                    [class_feature_df, pd.get_dummies(data.iloc[:, i], prefix=data.columns[i])], axis=1)
                class_feature_index.append(i)
        else:
            normal_index.append(i)
    data_update = pd.concat([data.iloc[:, normal_index], continuous_feature_df, class_feature_df], axis=1)
    return data_update


# onehotencoding
def ohe(data, limit_value=10):
    feature_cnt = data.shape[1]
    class_index = []
    class_df = pd.DataFrame()
    normal_index = []
    # limit_value以下的均認(rèn)為是class feature,進(jìn)行ohe過程
    for i in range(feature_cnt):
        if len(pd.DataFrame(data.iloc[:, i]).drop_duplicates()) < limit_value:
            class_index.append(i)
            class_df = pd.concat([class_df, pd.get_dummies(data.iloc[:, i], prefix=data.columns[i])], axis=1)
        else:
            normal_index.append(i)
    data_update = pd.concat([data.iloc[:, normal_index], class_df], axis=1)
    return data_update


# smote unbalance dataset
def smote(data, tag_label='tag_1', amount_personal=0, std_rate=5, k=5,method = 'mean'):
    cnt = data[tag_label].groupby(data[tag_label]).count()
    rate = max(cnt) / min(cnt)
    location = []
    if rate < 5:
        print('不需要smote過程')
        return data
    else:
        # 拆分不同大小的數(shù)據(jù)集合
        less_data = np.array(data[data[tag_label] == np.array(cnt[cnt == min(cnt)].index)[0]])
        more_data = np.array(data[data[tag_label] == np.array(cnt[cnt == max(cnt)].index)[0]])
        # 找出每個(gè)少量數(shù)據(jù)中每條數(shù)據(jù)k個(gè)鄰居
        neighbors = NearestNeighbors(n_neighbors=k).fit(less_data)
        for i in range(len(less_data)):
            point = less_data[i, :]
            location_set = neighbors.kneighbors([less_data[i]], return_distance=False)[0]
            location.append(location_set)
        # 確定需要將少量數(shù)據(jù)補(bǔ)充到上限額度
        # 判斷有沒有設(shè)定生成數(shù)據(jù)個(gè)數(shù),如果沒有按照std_rate(預(yù)期正負(fù)樣本比)比例生成
        if amount_personal > 0:
            amount = amount_personal
        else:
            amount = int(max(cnt) / std_rate)
        # 初始化,判斷連續(xù)還是分類變量采取不同的生成邏輯
        times = 0
        continue_index = []  # 連續(xù)變量
        class_index = []  # 分類變量
        for i in range(less_data.shape[1]):
            if len(pd.DataFrame(less_data[:, i]).drop_duplicates()) > 10:
                continue_index.append(i)
            else:
                class_index.append(i)
        case_update = pd.DataFrame()
        while times < amount:
            # 連續(xù)變量取附近k個(gè)點(diǎn)的重心,認(rèn)為少數(shù)樣本的附近也是少數(shù)樣本
            new_case = []
            pool = np.random.permutation(len(location))[0]
            neighbor_group = less_data[location[pool], :]
            if method == 'mean':
                new_case1 = neighbor_group[:, continue_index].mean(axis=0)
            # 連續(xù)樣本的附近點(diǎn)向量上的點(diǎn)也是異常點(diǎn)
            if method =='random':
                new_case1 =less_data[pool][continue_index]  + np.random.rand()*(less_data[pool][continue_index]-neighbor_group[0][continue_index])
            # 分類變量取mode
            new_case2 = []
            for i in class_index:
                L = pd.DataFrame(neighbor_group[:, i])
                new_case2.append(np.array(L.mode()[0])[0])
            new_case.extend(new_case1)
            new_case.extend(new_case2)
            case_update = pd.concat([case_update, pd.DataFrame(new_case)], axis=1)
            print('已經(jīng)生成了%s條新數(shù)據(jù),完成百分之%.2f' % (times, times * 100 / amount))
            times = times + 1
        data_res = np.vstack((more_data, np.array(case_update.T)))
        data_res = pd.DataFrame(data_res)
        data_res.columns = data.columns
    return data_res


# 數(shù)據(jù)分列
def reload(data):
    feature = pd.concat([data.iloc[:, :2], data.iloc[:, 4:]], axis=1)
    tag = data.iloc[:, 3]
    return feature, tag


# 數(shù)據(jù)切割
def split_data(feature, tag):
    X_train, X_test, y_train, y_test = train_test_split(feature, tag, test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test


if __name__ == '__main__':
    path = sys.argv[0]
    data_all = pd.read_table(str(path))
    print('數(shù)據(jù)讀取完成!')
    # 更改數(shù)據(jù)格式
    data_all = change_data_format(data_all)
    # 刪除電話號碼列
    data_all = data_all.iloc[:, 1:]
    data_all, data_avaiable_index = nan_remove(data_all)
    print('空值列處理完畢!')
    data_all, _ = outlier_remove(data_all)
    print('異常點(diǎn)處理完成!')
    data_all = nan_fill(data_all)
    print('空值填充完成!')
    data_all = ohe(data_all)
    print('onehotencoding 完成!')
    data_all = smote(data_all)
    print('smote過程完成!')
    feature, tag = reload(data_all)
    X_train, X_test, y_train, y_test = split_data(feature, tag)
    print('數(shù)據(jù)預(yù)處理完成!')

大家自取自用,這個(gè)也沒啥好轉(zhuǎn)載的,沒啥干貨,只是方便大家日常工作,就別轉(zhuǎn)了,謝謝各位編輯大哥了。

最后,感謝大家閱讀,謝謝。


歡迎大家關(guān)注我的個(gè)人bolg,更多代碼內(nèi)容歡迎follow我的個(gè)人Github,如果有任何算法、代碼疑問都?xì)g迎通過公眾號發(fā)消息給我哦。

少年,掃一下嘛

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容