Diaob / Loan-Default-Prediction

零基础入门金融风控-贷款违约预测

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

贷款违约预测

天池:零基础入门金融风控-贷款违约预测 学习资料:FinancialRiskControl

特征

模型

lightgbm

  • 基础特征 线下:lgb_0.8072649999999999.csv 线上0.7342

  • 添加 5折cv转化率特征 0.8074600000000001 线上 0.7368

  • 加入多组特征: 线下0.8074874999999999 线上0.7385

[0.80766875, 0.80690625, 0.80741875, 0.80745, 0.80799375]
train_model_classification cost time:478.3310031890869
0.8074874999999999
  • 提高类别转化率特征数量:线下0.8075150000000001 线上
[0.80770625, 0.80708125, 0.8080625, 0.80725, 0.807475]
CV mean score: 0.8075, std: 0.0003.
train_model_classification cost time:436.97244095802307
0.8075150000000001
  • 删除重复列 线上 0.7384 del train['n2.1'] del test['n2.1'],test['n2.2'],test['n2.3']
[0.7391476718117582, 0.7369324532976902, 0.7398767611322199, 0.7398380589746957, 0.7384306421534355]
CV mean score: 0.7388, std: 0.0011.
CV mean score: 0.8076, std: 0.0002.
train_model_classification cost time:426.4979546070099
0.7388451174739599
  • 添加amount_feas 基本聚合特征和基本交叉特征 线下lgb_acc0.80758375auc0.7401055866296563.csv 线上0.7398
[0.7407006310419633, 0.7382068413277062, 0.7420021529449097, 0.7404786259822548]
[0.8077125, 0.80695, 0.80773125, 0.807575, 0.80795]
[0.7407006310419633, 0.7382068413277062, 0.7420021529449097, 0.7404786259822548, 0.7391396818514473]
CV mean score: 0.7401, std: 0.0013.
CV mean score: 0.8076, std: 0.0003.
train_model_classification cost time:950.9784574508667
0.7401055866296563
  • 添加rank特征以及匿名信息特征 线下lgb_acc0.807745auc0.7402264822844884.csv 线上0.7397 特征个数:900
[0.80796875, 0.80708125, 0.807675, 0.8078375, 0.8081625]
[0.7410432486212892, 0.7384594844744258, 0.7415899647190336, 0.7405553938789996, 0.7394843197286938]
CV mean score: 0.7402, std: 0.0011.
CV mean score: 0.8077, std: 0.0004.
train_model_classification cost time:1233.052993774414
0.7402264822844884
  • 线下 lgb_acc0.80779auc0.7403928902618715.csv 线上 0.7399
(1)去除:data['issueDate_hour'] = data['issueDate'].dt.hour
(2)earliesCreditLine 转化为时间格式
    def ym(x):
        month, year = x.split('-')
        month = month_maps[month]
        return year + '-' + str(month)

    data['earliesCreditLine'] = data['earliesCreditLine'].progress_apply(lambda x: ym(x))
    data['earliesCreditLine'] = pd.to_datetime(data['earliesCreditLine'], format='%Y-%m')
(3)添加lag特征
(4) 调整基本聚合特征:只保留mean和std
(5) 增加cat_list特征  增加转化率特征:
    #cat_list = [i for i in train.columns if i not in ['id', 'isDefault', 'policyCode']]
    cat_list = [i for i in data.columns if i not in ['id', 'isDefault', 'policyCode']]
[0.80796875, 0.80694375, 0.80818125, 0.80779375, 0.8080625]
[0.7407775246570047, 0.7391459325455945, 0.7418385574863872, 0.7403826784630242, 0.7398197581573464]
CV mean score: 0.7404, std: 0.0009.
CV mean score: 0.8078, std: 0.0004.
train_model_classification cost time:1030.9157075881958
0.7403928902618715
  • 线下 lgb_acc0.807715auc0.7404322941427292.csv 线上 0.7399 1016
[0.80795, 0.807075, 0.807875, 0.8076625, 0.8080125]
[0.7410214905507795, 0.738570364832511, 0.7415226626273519, 0.7407998724294863, 0.7402470802735167]
CV mean score: 0.7404, std: 0.0010.
CV mean score: 0.8077, std: 0.0003.
train_model_classification cost time:988.2512986660004
0.7404322941427292
# 缺失值统计特征
    # 缺失值统计,统计存在缺失值的特征,构造缺失值相关计数特征
    for i in tqdm(n_feas, desc="缺失值统计"):
        a = data.loc[data[i] == -1]
        e = a.groupby(['grade'])['id'].count().reset_index(name=i + '_grade_count')
        data = data.merge(e, on='grade', how='left')

        d = a.groupby(['subGrade'])['id'].count().reset_index(name=i + '_subGrade_count')
        data = data.merge(d, on='subGrade', how='left')

        m = a.groupby(['issueDate'])['id'].count().reset_index(name=i + '_issueDate_count')
        data = data.merge(m, on='issueDate', how='left')

        data['gradeloss_' + i] = data[i + '_grade_count'] / data['grade_count']
        data['subGradeloss_' + i] = data[i + '_subGrade_count'] / data['subGrade_count']
        data['issueDateloss_' + i] = data[i + '_issueDate_count'] / data['issueDate_count']
    # ===================== 五折转化率特征 ====================
  • 删除无用特征 线下lgb_acc0.8079175auc0.7404045216502018.csv 线上0.7400
Early stopping, best iteration is:
[487]	training's binary_logloss: 0.414764	training's auc: 0.782244	valid_1's binary_logloss: 0.438547	valid_1's auc: 0.739725
[0.80831875, 0.8075, 0.80803125, 0.80783125, 0.80790625]
[0.7409709557088971, 0.7385389215613893, 0.742264346768772, 0.7405231586315258, 0.7397252255804245]
CV mean score: 0.7404, std: 0.0012.
CV mean score: 0.8079, std: 0.0003.
train_model_classification cost time:1985.6873161792755
0.7404045216502018
  • 加入word2vec特征 线上7402
  data['rid'] = data.apply(lambda x: [i + 'n' + str(x[i]) for i in n_feas], axis=1)
        data = w2v_feat(data)
        del data['rid']

xgboost

[0.80745625, 0.8065875, 0.80711875, 0.8072125, 0.8070375]
CV mean score: 0.8071, std: 0.0003.
train_model_classification cost time:2882.082005262375
0.8070824999999999
  • 综合特征
Stopping. Best iteration:
[683]	validation_0-logloss:0.41535	validation_1-logloss:0.43947

[0.80765625, 0.806775, 0.8080125, 0.80755625, 0.8076625]
CV mean score: 0.8075, std: 0.0004.
train_model_classification cost time:9510.13922739029
0.8075325000000001
  • 去除部分特征
[0.8077875, 0.8069625, 0.80764375, 0.8078875, 0.80765625]
CV mean score: 0.8076, std: 0.0003.
train_model_classification cost time:25125.02781534195
0.8075875

线上 0.7402

catboost

[0.80724375, 0.8059, 0.80655625, 0.80650625, 0.8068625]
CV mean score: 0.8066, std: 0.0004.
train_model_classification cost time:972.1086599826813
0.80661375
  • 线下:catboost 0.80752375.csv 线上 0.7389

  • 线下 catboost0.807885.csv 线上0.7407

[0.808, 0.80755, 0.80818125, 0.80801875, 0.807675]
CV mean score: 0.8079, std: 0.0002.
train_model_classification cost time:2310.5629098415375
0.807885

模型融合

lgb = pd.read_csv('result/lgb_auc0.7388451174739599.csv')
xgb = pd.read_csv('result/xgb_0.8070824999999999.csv')
ctb = pd.read_csv('result/catboost0.80661375.csv')
sub = xgb.copy()
sub['isDefault'] = (lgb['isDefault'].rank()**(0.7)*xgb['isDefault'].rank()**(0.15) * ctb['isDefault'].rank()**(0.15))/200000
sub['isDefault'] = sub['isDefault'].round(2)
sub.to_csv("result/submission.csv",index=False)

线上:score:0.7384

lgb = pd.read_csv('result/lgb_acc0.80779auc0.7403928902618715.csv')
# xgb = pd.read_csv('result/xgb_0.8070824999999999.csv')
ctb = pd.read_csv('result/catboost0.8077625000000002.csv')
sub = lgb.copy()
sub['isDefault'] = (lgb['isDefault'].rank()**(0.68) * ctb['isDefault'].rank()**(0.32))/200000
sub['isDefault'] = sub['isDefault'].round(2)
sub.to_csv("result/submission.csv",index=False)

线上:score:0.7405


lgb = pd.read_csv('result/lgb_acc0.8079175auc0.7404045216502018.csv')
xgb = pd.read_csv('result/xgb_0.8075875.csv')
ctb = pd.read_csv('result/catboost0.807885.csv')
sub = lgb.copy()
sub['isDefault'] = (lgb['isDefault'].rank() ** (0.4) * xgb['isDefault'].rank() ** (0.3) * ctb['isDefault'].rank() ** (
    0.3)) / 200000

sub['isDefault'] = sub['isDefault'].round(2)
sub.to_csv("result/submission.csv", index=False)

线上:score:0.7408

lgb = pd.read_csv('result/lgb_acc0.8079224999999999auc0.7402744005225992.csv')
xgb = pd.read_csv('result/xgb_0.8075875.csv')
ctb = pd.read_csv('result/catboost0.807885.csv')
sub = lgb.copy()
sub['isDefault'] = (lgb['isDefault'].rank()**(0.15)*xgb['isDefault'].rank() ** (0.15) * ctb['isDefault'].rank() ** (
                0.7)) / 200000

sub['isDefault'] = sub['isDefault'].round(4)
sub.to_csv("result/submission.csv", index=False)

线上:score:0.7410

About

零基础入门金融风控-贷款违约预测


Languages

Language:Jupyter Notebook 82.7%Language:Python 17.3%