久久久久久久6,野战毛片三一3,日韩电影AV

背景

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection
這次比賽主要是通過日志來抓手機點擊app的“點擊欺詐”的一個反欺詐項目。其實就是給你一堆數(shù)據(jù)，主要是點擊者的ip、手機型號（device）、手機系統(tǒng)（os）、通過何種廣告渠道（channel）在何時（click_time）點擊了哪個app。最后讓你預(yù)測是否下載了這個app（is_attributed），下載了就是1，沒下載就是0。
評價標準是auc

比賽數(shù)據(jù)的size,train.csv為7.01G，test.csv為823M。

代碼

從click_time中提取week，year，刪除datetime

def timeFeatures(df):
    # Make some new features with click_time column
    df['datetime'] = pd.to_datetime(df['click_time'])
    df['dow']      = df['datetime'].dt.dayofweek
    df["doy"]      = df["datetime"].dt.dayofyear
    df.drop(['click_time', 'datetime'], axis=1, inplace=True)
    return df

train_columns = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'is_attributed']
test_columns  = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'click_id']
dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'is_attributed' : 'uint8',
        'click_id'      : 'uint32'
        }

讀取數(shù)據(jù)集
去除is_attributed特征
去除測試集click_id特征

train = pd.read_csv(path+"train.csv", skiprows=range(1,123903891), nrows=61000000, usecols=train_columns, dtype=dtypes)
test = pd.read_csv(path+"test_supplement.csv", usecols=test_columns, dtype=dtypes)

print('[{}] Finished to load data'.format(time.time() - start_time))

# Drop the IP and the columns from target
y = train['is_attributed']
train.drop(['is_attributed'], axis=1, inplace=True)

# Drop IP and ID from test rows
sub = pd.DataFrame()
#sub['click_id'] = test['click_id'].astype('int')
test.drop(['click_id'], axis=1, inplace=True)
#清理內(nèi)存
gc.collect()

拼接train，test

nrow_train = train.shape[0]
merge = pd.concat([train, test])

計算不同channel下的點擊數(shù)（ip: ip address of click.）為clicks_by_ip
DataFrame 數(shù)據(jù)合并，連接（merge,join,concat)
分割train，test

# Count the number of clicks by ip
ip_count = merge.groupby(['ip'])['channel'].count().reset_index()
ip_count.columns = ['ip', 'clicks_by_ip']
merge = pd.merge(merge, ip_count, on='ip', how='left', sort=False)
merge['clicks_by_ip'] = merge['clicks_by_ip'].astype('uint16')
merge.drop('ip', axis=1, inplace=True)

train = merge[:nrow_train]
test = merge[nrow_train:]

xgboost參數(shù)調(diào)節(jié)
max_depth是0其實就是不設(shè)限制。

 params = {'eta': 0.3,
          'tree_method': "hist",
          'grow_policy': "lossguide",
          'max_leaves': 1400,  
          'max_depth': 0, 
          'subsample': 0.9, 
          'colsample_bytree': 0.7, 
          'colsample_bylevel':0.7,
          'min_child_weight':0,
          'alpha':4,
          'objective': 'binary:logistic', 
          'scale_pos_weight':9,
          'eval_metric': 'auc', 
          'nthread':8,
          'random_state': 99, 
          'silent': True}

訓練

if (is_valid == True):
    # Get 10% of train dataset to use as validation
    #10%交叉驗證集
    x1, x2, y1, y2 = train_test_split(train, y, test_size=0.1, random_state=99)
    #xgb矩陣賦值
    dtrain = xgb.DMatrix(x1, y1)
    dvalid = xgb.DMatrix(x2, y2)
    del x1, y1, x2, y2 
    gc.collect()
    watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
    model = xgb.train(params, dtrain, 200, watchlist, maximize=True, early_stopping_rounds = 25, verbose_eval=5)
    del dvalid
else:
    dtrain = xgb.DMatrix(train, y)
    del train, y
    gc.collect()
    watchlist = [(dtrain, 'train')]
    model = xgb.train(params, dtrain, 30, watchlist, maximize=True, verbose_eval=1)

tips（節(jié)省內(nèi)存）

1.及時刪除無用變量并垃圾回收
del
gc.collect()

2.預(yù)定義數(shù)據(jù)類型
dtypes = {
'ip' : 'uint32',
'app' : 'uint16',
'device' : 'uint16',
'os' : 'uint16',
'channel' : 'uint16',
'is_attributed' : 'uint8',
'click_id' : 'uint32'
}
pandas一般會推斷數(shù)據(jù)類型，預(yù)定義數(shù)據(jù)類型節(jié)省了超過一半的空間。

3.去除csv文件里的指定行
指定行數(shù)，nrows=xxx
跳過行數(shù)，skiprows=xxx
sampling，subprocess。在統(tǒng)計特征的時候，還是按照全局統(tǒng)計，統(tǒng)計出特征后對負樣本進行了5%的采樣，來減少內(nèi)存的消耗。

train一共有l(wèi)ines=184903891 行，選取6100w行。
train = pd.read_csv(path+"train.csv", skiprows=range(1,123903891), nrows=61000000, usecols=train_columns, dtype=dtypes)

4.只載入若干列

5.選擇和test同時段的數(shù)據(jù)
train數(shù)據(jù)是全時段數(shù)據(jù)，test是固定幾個時段的數(shù)據(jù)，選擇同時段的數(shù)據(jù)來減少訓練集的size。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

XGBOOST

XGBOOST

背景

代碼

訓練

tips（節(jié)省內(nèi)存）

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

XGBOOST

背景

代碼

訓練

tips（節(jié)省內(nèi)存）

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av