一、程序代碼簡介
1.1. 文獻(xiàn):
1.2. 代碼:gitee倉庫(大陸推薦)、github倉庫
1.3. 功能:
集成急性經(jīng)口全身毒性二元/多元分類回歸模型的層次H-QSAR建模方法的實現(xiàn)
1.4. 數(shù)據(jù)集
本研究中使用的大鼠急性經(jīng)口毒性數(shù)據(jù)由國家毒理學(xué)計劃機(jī)構(gòu)間替代毒理學(xué)方法評價中心(NICEATM)和美國環(huán)保署國家計算毒理學(xué)中心(NCCT)從許多公共可用數(shù)據(jù)集和資源中收集。完整的描述和實際的數(shù)據(jù)集可以在這里找到。整個數(shù)據(jù)集由11992個化合物組成,被半隨機(jī)分成一個訓(xùn)練集(75%)和一個外部測試集(25%),項目組織者對LD50分布的覆蓋率相當(dāng)。
訓(xùn)練集和測試數(shù)據(jù)在train_test_sets文件夾中。
1.5. 建模策略

image
上圖顯示了構(gòu)建層次QSAR模型的總體工作流程。利用機(jī)器學(xué)習(xí)算法和化學(xué)描述符/指紋的不同組合,建立了基本回歸模型、二元模型和多類模型(共60個模型)。通過10次交叉驗證生成基礎(chǔ)模型的偏離預(yù)測。折疊外的預(yù)測被連接在一起,并作為輸入(元特征)用于構(gòu)建分層回歸、二元和多類模型。
1.6. Jupyter Notebook代碼本使用流程
1.6.1 準(zhǔn)備標(biāo)簽:
- (1)labels.ipynb
1.6.2 基礎(chǔ)模型:
- (2)計算基本模型的化學(xué)描述符/指紋:descriptors.ipynb
- (3)描述符選擇:descriptors_selections.ipynb
- (4)基礎(chǔ)模型的超參數(shù)搜索:Base_models_selection.ipynb
- (5)用最優(yōu)超參數(shù)建立基礎(chǔ)模型:Base_models.ipynb
1.6.3 分層模型
- (6)元特征:Hierarchical_features.ipynb
- (7)層次模型的超參數(shù)調(diào)整:Hierarchical_models_selection.ipynb
- (8)用最優(yōu)超參數(shù)建立層次模型:Hierarchical_models.ipynb
1.6.4 模型評價
- (9)評估交叉驗證和測試集性能:Model_evaluation.ipynb
二、代碼運行與理解
2.1. labels.ipynb
- (1)導(dǎo)入庫或模塊
# 導(dǎo)入庫/模塊
import pandas as pd
import numpy as np
import math
import joblib
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
from sklearn.preprocessing import LabelEncoder
- (2)導(dǎo)入訓(xùn)練集和測試集
train = pd.read_csv('../data/train_test_sets/train.csv', index_col = 'CASRN') # 訓(xùn)練集
test = pd.read_csv('../data/train_test_sets/test.csv', index_col = 'CASRN') # 測試集
# 探索數(shù)據(jù)
train # 或者用下述語句中的任一條
train.head(1)
train.head()
train.shape
- (3)添加RoMol列
# 利用PandasTools中AddMoleculeColumnToFrame方法添加分子結(jié)構(gòu)圖列
PandasTools.AddMoleculeColumnToFrame(train,smilesCol='SMILES')
PandasTools.AddMoleculeColumnToFrame(test,smilesCol='SMILES')
- (4)添加Mw描述符列
train['MW'] = train.apply(lambda x: Descriptors.MolWt(x['ROMol']), axis=1)
test['MW'] = test.apply(lambda x: Descriptors.MolWt(x['ROMol']), axis=1)
- (5)定義logLD50轉(zhuǎn)換函數(shù)(單位由mgkg轉(zhuǎn)換為mmolkg)
def logLD50(df):
logLD50_mmolkg = []
for i in range(df.shape[0]):
if pd.isna(df['LD50_mgkg'][i]):
logLD50_mmolkg.append(df['LD50_mgkg'][i])
else:
logLD50_mmolkg.append(math.log10(float(df['LD50_mgkg'][i])/float(df['MW'][i])))
return logLD50_mmolkg
- (6)轉(zhuǎn)換logLD50,并新增1列
train['logLD50_mmolkg'] = logLD50(train)
test['logLD50_mmolkg'] = logLD50(test)
- (7)定義二元毒性和二元劇毒函數(shù)
def binary_toxic(df):
toxic = []
for i in range(df.shape[0]):
if pd.isna(df['nontoxic'][i]):
toxic.append(df['nontoxic'][i])
else:
toxic.append(1 if df['nontoxic'][i] is False else 0)
return toxic
def binary_verytoxic(df):
verytoxic = []
for i in range(df.shape[0]):
if pd.isna(df['very_toxic'][i]):
verytoxic.append(df['very_toxic'][i])
else:
verytoxic.append(0 if df['very_toxic'][i] is False else 1)
return verytoxic
- (8)將化合毒性和劇毒性質(zhì)轉(zhuǎn)換為0/1值,并新增兩列
train['toxic'] = binary_toxic(train)
test['toxic'] = binary_toxic(test)
train['verytoxic'] = binary_verytoxic(train)
test['verytoxic'] = binary_verytoxic(test)
- (9)刪除不需要的列
train = train.drop(['nontoxic', 'very_toxic', 'ROMol', 'MW', 'LD50_mgkg'], axis=1)
test = test.drop(['nontoxic', 'very_toxic', 'ROMol', 'MW', 'LD50_mgkg'], axis=1)
- (10)數(shù)據(jù)集的列重排
columnsTitles = ['SMILES', 'logLD50_mmolkg', 'verytoxic', 'toxic', 'EPA_category', 'GHS_category'] #新列順序
train = train.reindex(columns=columnsTitles)
test = test.reindex(columns=columnsTitles)
- (11)保存數(shù)據(jù)(csv格式)
train.to_csv('../data/train_test_sets/train_labels.csv')
test.to_csv('../data/train_test_sets/test_labels.csv')
- (12)定義編碼函數(shù)
def get_labelencoder(df_labels):
encoder_verytoxic = LabelEncoder().fit(df_labels[~df_labels['verytoxic'].isnull()]['verytoxic'].values)
encoder_toxic = LabelEncoder().fit(df_labels[~df_labels['toxic'].isnull()]['toxic'].values)
encoder_epa = LabelEncoder().fit(df_labels[~df_labels['EPA_category'].isnull()]['EPA_category'].values)
encoder_ghs = LabelEncoder().fit(df_labels[~df_labels['GHS_category'].isnull()]['GHS_category'].values)
return encoder_verytoxic, encoder_toxic, encoder_epa, encoder_ghs
- (13)編碼