2022 iFLYTEK AI-糖尿病遗传风险检测挑战赛

一年一度的科大讯飞AI开发者大赛又开始了,好消息是赛道种类多了,坏消息是单个赛题的奖金少了不少 (导致卷的动力不是很大 : )。往年,我们组做的相关赛道主要偏视觉,因此今年想尝试一下数据挖掘赛道,学习使用传统机器学习方法去解决实际问题。正好,赛题中有一个医疗相关的题目——糖尿病遗传风险检测挑战赛,赛题数据规模较小,拿来练手很不错,名次的话,再说😅。

赛题简介

赛题背景就不多说了,赛题简单概括一下:基于患者的临床检验数据(结构化),训练算法预测患者是否患有糖尿病。

赛方提供的数据包含以下字段(9+1):

  • 编号:标识个体身份的数字;

  • 性别:1表示男性,0表示女性;

  • 出生年份:出生的年份;

  • 体重指数:体重除以身高的平方,单位kg/m2;

  • 糖尿病家族史:标识糖尿病的遗传特性,记录家族里面患有糖尿病的家属,分成三种标识,分别是父母有一方患有糖尿病、叔叔或者姑姑有一方患有糖尿病、无记录;

  • 舒张压:心脏舒张时,动脉血管弹性回缩时,产生的压力称为舒张压,单位mmHg;

  • 口服耐糖量测试:诊断糖尿病的一种实验室检查方法。比赛数据采用120分钟耐糖测试后的血糖值,单位mmol/L;

  • 胰岛素释放实验:空腹时定量口服葡萄糖刺激胰岛β细胞释放胰岛素。比赛数据采用服糖后120分钟的血浆胰岛素水平,单位pmol/L;

  • 肱三头肌皮褶厚度:在右上臂后面肩峰与鹰嘴连线的重点处,夹取与上肢长轴平行的皮褶,纵向测量,单位cm;

  • 患有糖尿病标识:数据标签,1表示患有糖尿病,0表示未患有糖尿病。

训练集包含5070条数据,测试集1000条,评估指标为F1-Score

特征工程

每条数据总共包含9项基础特征,其中编号显然是无关特征,数据的性别分布也较均匀,出生年份可以转化为新特征年龄,通过查询资料,发现其余特征跟临床糖尿病诊断有直接关联,但是缺失值比较多,原始数据以0-1NaN填充,需考虑缺失值的填充方式。

Step 1: 特征处理

糖尿病家族史特征做one-hot编码,根据出生年份生成年龄特征:

1
2
3
4
5
6
7
8
9
10
11
12
13
train_path = './dataset/diabetes/train.csv'
train_df = pd.read_csv(train_path,encoding='gbk')

test_path = './dataset/diabetes/test.csv'
test_df = pd.read_csv(test_path,encoding='gbk')

data = pd.concat([train_df, test_df], axis=0, ignore_index=True)

data['age'] = data['出生年份'].apply(lambda x:int(2022 - x))
data['糖尿病家族史'] = data['糖尿病家族史'].apply(
lambda x:'叔叔或姑姑有一方患有糖尿病' if x=='叔叔或者姑姑有一方患有糖尿病' else x)
df = pd.get_dummies(data['糖尿病家族史']).astype('int')
data = pd.concat([data,df],axis=1)

口服耐糖量测试,胰岛素释放实验,肱三头肌皮褶厚度,体重指数做缺失值填充,这里采用中值填充(这一步可能不是很合理,先用着)。

1
2
3
for i in ['口服耐糖量测试','胰岛素释放实验','肱三头肌皮褶厚度','体重指数']:
data[i] = data[i].apply(lambda x:np.nan if x==0 or x==-1 else x)
data.fillna(data.median(axis=0), inplace=True)

Step 2: 特征构造

根据《中国2型糖尿病防治指南(2017年版)》,糖尿病的诊断标准是具有典型糖尿病症状(烦渴多饮、多尿、多食、不明原因的体重下降)且随机静脉血浆葡萄糖≥11.1mmol/L或空腹静脉血浆葡萄糖≥7.0mmol/L或口服葡萄糖耐量试验(OGTT)负荷后2h血浆葡萄糖≥11.1mmol/L,另外参考百度百科,知口服耐糖量测试(Oral Glucose Tolerance Test, OGTT)在临床中对糖尿病的诊断具有重要参考意义,因此构造以下特征:

1
2
3
data['OGTT>7.1'] = data['口服耐糖量测试'].apply(lambda x:int(x > 7.1))
data['OGTT>8.5'] = data['口服耐糖量测试'].apply(lambda x:int(x > 8.5))
data['OGTT>11.1'] = data['口服耐糖量测试'].apply(lambda x:int(x > 11.1))

其中,另外,根据2013年临床统计分析结果,不同BMI的患病率有明显差异:BMI < 25$kg/m^2$者糖尿病患病率为 $7.8\%$, 25$\le$BMI < 30 $kg/m^2$者患病率为$15.4 \%%$, BMI $\ge$ 30$kg/m^2$者患病率为$21.2\%$,构造以下特征:

1
2
3
4
data['BMI>25'] = data['体重指数'].apply(lambda x:int(x > 25))
data['30>BMI>=25'] = data['体重指数'].apply(lambda x:int(x >= 25 and x < 30))
data['BMI>=30'] = data['体重指数'].apply(lambda x:int(x >= 30))
data['BMI>50'] = data['体重指数'].apply(lambda x:int(x > 50))

训练模型

作为Baseline,我选择了随机森林作为基础模型,使用5折交叉训练,参数配置如下:

1
2
3
4
'random forest':{
'n_estimators':[50,100,200],
'criterion':['entropy']
},

为了方便调参,我使用了sklearn提供的网格搜索,通过指定搜索空间,实现对参数组合的遍历。实验代码如下:

脚本文件diabetes_cls.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import os
import pandas as pd
import pickle
from cls_trainer import ML_Classifier
import numpy as np

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score

_AVAIL_CLF = ['random forest']

params_dict = {
'random forest':{
'n_estimators':[50,100,200],
'criterion':['entropy']
}

METRICS_CLS = {
'Accuracy':make_scorer(accuracy_score),
'Recall':make_scorer(recall_score,average='macro',zero_division=0),
'Precision':make_scorer(precision_score,average='macro',zero_division=0),
'F1':make_scorer(f1_score,average='macro',zero_division=0),
}

SETUP_TRAINER = {
'target_key':'患有糖尿病标识',
'random_state':21,
'metric':METRICS_CLS,
'k_fold':5,
'scale_flag':False,
'sub_col':['uuid','label'],
'id_name':['编号']
}


def fea_encoder(train_df,test_df,label=None):

target = train_df[label]
del train_df[label]
data = pd.concat([train_df, test_df], axis=0, ignore_index=True)

data['糖尿病家族史'] = data['糖尿病家族史'].apply(
lambda x:'叔叔或姑姑有一方患有糖尿病' if x=='叔叔或者姑姑有一方患有糖尿病' else x)
df = pd.get_dummies(data['糖尿病家族史']).astype('int')
data = pd.concat([data,df],axis=1)

for i in ['口服耐糖量测试','胰岛素释放实验','肱三头肌皮褶厚度','体重指数']:
data[i] = data[i].apply(lambda x:np.nan if x==0 or x==-1 else x)

data.fillna(data.median(axis=0), inplace=True)

data['age'] = data['出生年份'].apply(lambda x:int(2022 - x))
data['BMI>25'] = data['体重指数'].apply(lambda x:int(x > 25))
data['30>BMI>=25'] = data['体重指数'].apply(lambda x:int(x >= 25 and x < 30))
data['BMI>=30'] = data['体重指数'].apply(lambda x:int(x >= 30))
data['BMI>50'] = data['体重指数'].apply(lambda x:int(x > 50))

data['OGTT<7.8'] = data['口服耐糖量测试'].apply(lambda x:int(x < 7.8))
data['OGTT>8.5'] = data['口服耐糖量测试'].apply(lambda x:int(x > 8.5))
data['OGTT>11.1'] = data['口服耐糖量测试'].apply(lambda x:int(x > 11.1))
# print(data['age'])

train_df = data[:train_df.shape[0]]
train_df[label] = target
test_df = data[train_df.shape[0]:]

return train_df,test_df


if __name__ == "__main__":

train_path = './dataset/diabetes/train.csv'
train_df = pd.read_csv(train_path,encoding='gbk')

test_path = './dataset/diabetes/test.csv'
test_df = pd.read_csv(test_path,encoding='gbk')

train_df,test_df = fea_encoder(train_df,test_df,'患有糖尿病标识')

fea_list = [f for f in train_df.columns if f not in ['编号','患有糖尿病标识','糖尿病家族史']]
test_df = test_df[fea_list]
train_df = train_df[fea_list + ['患有糖尿病标识']]

save_path = './result/diabetes'
if not os.path.exists(save_path):
os.makedirs(save_path)
for clf_name in _AVAIL_CLF:
import copy
tmp_train_df = copy.copy(train_df)
tmp_test_df = copy.copy(test_df)
print('********** %s **********'%clf_name)
classifier = ML_Classifier(clf_name=clf_name,params=params_dict[clf_name])
model = classifier.trainer(train_df=tmp_train_df,**SETUP_TRAINER,pred_flag=True,test_df=tmp_test_df,test_csv=test_path,save_path=save_path)

为了方便扩展,定义了一个专门的类ML_Classifier封装训练过程,文件命名为cls_trainer.py,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings("ignore")

def f1(true,pred):
return f1_score(true,pred,average='macro',zero_division=0)


METRICS_CLS = {
'Accuracy':'accuracy',
'Recall':'recall',
'Precision':'precision',
'F1':'f1',
}

class ML_Classifier(object):
'''
Machine Learning Classifier for the classification
Args:
- clf_name, string, __all__ = ['random forest']
- params_dict, dict, parameters setting of the specified classifer
'''
def __init__(self,clf_name=None,params=None):
super(ML_Classifier,self).__init__()
self.clf_name = clf_name
self.params = params
self.clf = self._get_clf()


def trainer(self,train_df,target_key,random_state=21,metric=None,k_fold=5,pred_flag=False,\ test_df=None,test_csv=None,save_path=None,scale_flag=False,sub_col=None,id_name=None):
params = self.params
fea_list= [f for f in train_df.columns if f != target_key]
print(fea_list)

Y = np.asarray(train_df[target_key])
num_classes = len(set(Y))
X = np.asarray(train_df[fea_list])
test = np.asarray(test_df[fea_list])

if scale_flag:
X_len = X.shape[0]
data_scaler = StandardScaler()
cat_data = np.concatenate([X,test],axis=0)
cat_data= data_scaler.fit_transform(cat_data)

X = cat_data[:X_len]
test = cat_data[X_len:]

kfold = KFold(n_splits=k_fold,shuffle=True,random_state=random_state)

print(X)
print(test)
print(X.shape,test.shape)
test_score_list = []
predictions = []
predictions_prob = []
for fold_num,(train_index,val_index) in enumerate(kfold.split(X)):
print(f'***********fold {fold_num+1} start!!***********')
x_train, x_val = X[train_index], X[val_index]
y_train, y_val = Y[train_index], Y[val_index]
model = GridSearchCV(estimator=self.clf,
param_grid=params,
cv=kfold,
scoring=metric,
refit='F1',
verbose=True,
return_train_score=True)
model = model.fit(x_train,y_train)

best_score = model.best_score_
best_model = model.best_estimator_
train_pred = model.predict(x_train)
train_score = f1(y_train,train_pred)
test_pred = model.predict(x_val)
test_score = f1(y_val,test_pred)
test_score_list.append(test_score)
print("F1 Evaluation:")
print("fold {} Best score:{}".format(fold_num + 1,best_score))
print("fold {} Train score:{}".format(fold_num + 1,train_score))
print("fold {} Test score:{}".format(fold_num + 1,test_score))
print("fold {} Best parameter:".format(fold_num + 1))

if pred_flag and test is not None:
pred = model.predict(test)
predictions.append(pred)
# prob
pred_prob = model.predict_proba(test)
predictions_prob.append(pred_prob)

final_result = []
vote_array = np.asarray(predictions).astype(int)
final_result.extend([max(list(vote_array[:,i]),key=list(vote_array[:,i]).count) for i in range(vote_array.shape[1])])

prob_array = np.mean(predictions_prob,axis=0) # N*C
prob_final_result = np.argmax(prob_array,axis=1).tolist() # N

test_df = pd.read_csv(test_csv,encoding='gbk')

pred_df = {
sub_col[0]:[case for case in test_df[id_name].values.tolist()]
}
pred_df = pd.DataFrame(data=pred_df)

pred_df[sub_col[0]] = test_df[id_name]
# vote
pred_df[sub_col[1]] = final_result
pred_df.to_csv(f'{save_path}/{self.clf_name}_vote_result.csv',index=False)

# prob
for i in range(num_classes):
pred_df[f'prob_{str(i)}'] = prob_array[:,i].tolist()
pred_df[sub_col[1]] = prob_final_result
pred_df.to_csv(f'{save_path}/{self.clf_name}_prob_result.csv',index=False)
return best_model

def _get_clf(self):
if self.clf_name == 'random forest':
classifier = RandomForestClassifier(random_state=21,bootstrap=True)

return classifier

实验结果

测试发现,对正例使用0.4作为5折平均概率的分类边界要比默认的0.5要好一点,最终结果为0.9672,冲。

后期方向

  • 多模融合
  • 交叉特征,目前没啥思路
  • 改善缺失值填充方法
  • 数据分箱,比如按性别、BMI等分箱
打赏
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2022 Shi Jun
  • 访问人数: | 浏览次数:

请我喝杯咖啡吧~

支付宝
微信