본문 바로가기

Machine Learning/머신러닝 완벽가이드 for Python

ch4.08 분류실습 _ 산탄데르 고객 만족 예측

XGBoost
LightGBM

GridsearchCV으로 하이퍼 파라미터 튜닝을 통해서 성능을 올리는 것보다는, 
차라리 ★피처 엔지니어링 하는 것이 성능 향상에 더 중요할 때가 많다. 이상치 제거, 표준화 등 중요

 

0.0.1  데이터 전처리

 

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib

 

1  1. 데이터 확인

cust_df = pd.read_csv("./train_santander.csv", encoding='latin-1')  # utf-8-sig를 사용하면 한글이 안깨진다.
 
print('dataset shape:', cust_df.shape)
cust_df.head(5)

>>> dataset shape: (76020, 371)

 

cust_df.info()

>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB

 

2  2. 데이터 전처리

# 만족:0, 불만족:1
print(cust_df['TARGET'].value_counts())

>>> 0    73012
    1     3008
    Name: TARGET, dtype: int64

 

# 전체 응답 중 불만족 비율
total_cnt = cust_df['TARGET'].count()

unsatisfied_cnt = cust_df[cust_df['TARGET'] == 1]['TARGET'].count()

print('unsatisfied 비율은 {0:.2f}'.format((unsatisfied_cnt / total_cnt)))

>>> unsatisfied 비율은 0.04

 

cust_df.describe( )

-> var3 값들의 분포는 뭔가 이상하다.

print(cust_df['var3'].value_counts( )[:10])

>>>
     2         74165
     8           138
    -999999      116
     9           110
     3           108
     1           105
     13           98
     7            97
     4            86
     12           85
    Name: var3, dtype: int64

 

-> -999999의 값들은 가장 많은 2로 대체한다.

# var3 피처 값 대체
cust_df['var3'].replace(-999999, 2, inplace=True)

# ID는 필요 없으므로 drop
cust_df.drop('ID', axis=1, inplace=True)

 

# features, target 분리
X_features = cust_df.iloc[:, :-1]
y_labels = cust_df.iloc[:, -1]
print('피처 데이터 shape:{0}'.format(X_features.shape))

>>> 피처 데이터 shape:(76020, 369)

 

from sklearn.model_selection import train_test_split

# train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels,
                                                    test_size=0.2, random_state=0, stratify=y_labels)

 

# 잘 분리되었는지 분포 확인
train_cnt = y_train.count()
test_cnt = y_test.count()
print('학습 세트 Shape:{0}, 테스트 세트 Shape:{1}'.format(X_train.shape, X_test.shape), '\n')

print(' 학습 세트 레이블 값 분포 비율')
print(y_train.value_counts()/train_cnt)

print('\n 테스트 세트 레이블 값 분포 비율')
print(y_test.value_counts()/test_cnt)

>>>
    학습 세트 Shape:(60816, 369), 테스트 세트 Shape:(15204, 369) 

     학습 세트 레이블 값 분포 비율
    0    0.960438
    1    0.039562
    Name: TARGET, dtype: float64

     테스트 세트 레이블 값 분포 비율
    0    0.960405
    1    0.039595
    Name: TARGET, dtype: float64

 

-> 어느 정도 train, test 데이터 각각 레이블 값 분포 비율이 비슷하게 쪼개졌다.

 

3  3. xgboost 학습

from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

# XGBClassifier 객체 생성
xgb_clf = XGBClassifier(n_estimators=100, random_state=156)

# 학습 : 성능 평가 지표를 auc로 설정하고 학습 수행. 
xgb_clf.fit(X_train, y_train, early_stopping_rounds=30,
            # 원래 검증 셋은 test 말고 다른 데이터 셋으로 해야하지만 여기서는 그냥 test 셋으로 진행함
            eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])

 

[0]	validation_0-auc:0.82570	validation_1-auc:0.79283
[1]	validation_0-auc:0.84010	validation_1-auc:0.80737
[2]	validation_0-auc:0.84361	validation_1-auc:0.81021
[3]	validation_0-auc:0.84783	validation_1-auc:0.81287
[4]	validation_0-auc:0.85123	validation_1-auc:0.81469
[5]	validation_0-auc:0.85518	validation_1-auc:0.81860
[6]	validation_0-auc:0.85922	validation_1-auc:0.81977
[7]	validation_0-auc:0.86238	validation_1-auc:0.82034
[8]	validation_0-auc:0.86570	validation_1-auc:0.82147
[9]	validation_0-auc:0.86798	validation_1-auc:0.82301
[10]	validation_0-auc:0.87104	validation_1-auc:0.82379
[11]	validation_0-auc:0.87448	validation_1-auc:0.82456
[12]	validation_0-auc:0.87687	validation_1-auc:0.82401
[13]	validation_0-auc:0.87918	validation_1-auc:0.82467
[14]	validation_0-auc:0.88081	validation_1-auc:0.82508
[15]	validation_0-auc:0.88331	validation_1-auc:0.82379
[16]	validation_0-auc:0.88569	validation_1-auc:0.82457
[17]	validation_0-auc:0.88674	validation_1-auc:0.82453
[18]	validation_0-auc:0.88885	validation_1-auc:0.82354
[19]	validation_0-auc:0.89038	validation_1-auc:0.82328
[20]	validation_0-auc:0.89090	validation_1-auc:0.82305
[21]	validation_0-auc:0.89327	validation_1-auc:0.82305
[22]	validation_0-auc:0.89397	validation_1-auc:0.82287
[23]	validation_0-auc:0.89712	validation_1-auc:0.82256
[24]	validation_0-auc:0.89745	validation_1-auc:0.82237
[25]	validation_0-auc:0.89779	validation_1-auc:0.82237
[26]	validation_0-auc:0.89844	validation_1-auc:0.82248
[27]	validation_0-auc:0.89912	validation_1-auc:0.82199
[28]	validation_0-auc:0.89987	validation_1-auc:0.82157
[29]	validation_0-auc:0.90071	validation_1-auc:0.82106
[30]	validation_0-auc:0.90210	validation_1-auc:0.82075
[31]	validation_0-auc:0.90433	validation_1-auc:0.82011
[32]	validation_0-auc:0.90495	validation_1-auc:0.82001
[33]	validation_0-auc:0.90548	validation_1-auc:0.82032
[34]	validation_0-auc:0.90610	validation_1-auc:0.81969
[35]	validation_0-auc:0.90661	validation_1-auc:0.81968
[36]	validation_0-auc:0.90682	validation_1-auc:0.81939
[37]	validation_0-auc:0.90842	validation_1-auc:0.81969
[38]	validation_0-auc:0.90869	validation_1-auc:0.81962
[39]	validation_0-auc:0.90888	validation_1-auc:0.81957
[40]	validation_0-auc:0.91045	validation_1-auc:0.81893
[41]	validation_0-auc:0.91171	validation_1-auc:0.81869
[42]	validation_0-auc:0.91239	validation_1-auc:0.81802
[43]	validation_0-auc:0.91329	validation_1-auc:0.81773
[44]	validation_0-auc:0.91353	validation_1-auc:0.81771
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=156,
              reg_alpha=0, reg_lambda=1, ...)

 

 

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

>>> ROC AUC: 0.8251

 

 

4  4. GridSearchCV 적용 후 xgboost 학습

 

%%time

from sklearn.model_selection import GridSearchCV

# 하이퍼 파라미터 테스트의 수행 속도를 향상시키기 위해 n_estimators를 100으로 감소
xgb_clf = XGBClassifier(n_estimators=100)

# colsample_bytree : 컬럼이 너무 많으니 일부 비율로 해서 과적합을 조정하겠다.
# min_child_weight : 트리에서 가지를 추가로 치기 위해 필요한 최소 샘플 수.(> 0)
params = {'max_depth':[5, 7] , 'min_child_weight':[1, 3] , 'colsample_bytree':[0.5, 0.75] }

# 하이퍼 파라미터 테스트의 수행속도를 향상 시키기 위해 cv는 2로 부여함
gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=2)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train, y_train), (X_test, y_test)])

print('GridSearchCV 최적 파라미터:', gridcv.best_params_) 

xgb_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))


>>>>
GridSearchCV 최적 파라미터: {'colsample_bytree': 0.5, 'max_depth': 5, 'min_child_weight': 3}
ROC AUC: 0.8237
CPU times: total: 3min 24s
Wall time: 1min 40s

 

-> 교차검증을 해도 안 했을 때와 크게 차이는 없다

 

4.0.1  하이퍼 파라미터 튜닝을 통해서 성능을 올리는 것보다는,

4.0.2  차라리 ★ 피처 엔지니어링 하는 것이 성능 향상에 더 중요할 때가 많다. 이상치 제거, 표준정규화

 

5  시도 : n_estimators는 1000으로 증가, learning_rate=0.02로 감소, reg_alpha=0.03으로 추가

 

%%time
# 수행시간 1분
# 새로운 하이퍼 파라미터 적용된 XGBClassifier 객체 생성
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=5,
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)
# reg_alpha : 규제 관련
# 인자 소개 참고 : http://okminseok.blogspot.com/2017/09/ml-xgboost.html

# 학습
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])

# 평가(roc auc)
xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1])
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
ROC AUC: 0.8271
CPU times: total: 3min 47s
Wall time: 1min 15s

 

# 직전 모델에 max_depth만 7로 변경
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# evaluation metric을 auc로, early stopping은 200 으로 설정하고 학습 수행. 
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

>>> ROC AUC: 0.8269

 

from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)

F score : 트리를 split하는데 얼마나 자주 사용되는지 나타내는 계수

-> 피처 이름이 나오는 이유는 xgb_clf.fit(x_train) 판다스 데이터프레임이 인자로 들어갔기 때문이다.

 

6  LightGBM 모델 학습과 하이퍼 파라미터 튜닝

 

pip install lightgbm
import lightgbm

 

,

from lightgbm import LGBMClassifier

# LGBMClassifier 객체 생성
lgbm_clf = LGBMClassifier(n_estimators=500)

# 검증 데이터 지정
evals = [(X_test, y_test)]

# 학습
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
                verbose=True)
# 평가(roc auc)
lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
[1]	valid_0's auc: 0.795963	valid_0's binary_logloss: 0.159288
[2]	valid_0's auc: 0.801789	valid_0's binary_logloss: 0.155038
[3]	valid_0's auc: 0.803367	valid_0's binary_logloss: 0.15185
[4]	valid_0's auc: 0.805168	valid_0's binary_logloss: 0.14961
[5]	valid_0's auc: 0.809401	valid_0's binary_logloss: 0.147695
[6]	valid_0's auc: 0.810671	valid_0's binary_logloss: 0.146234
[7]	valid_0's auc: 0.815356	valid_0's binary_logloss: 0.144877
[8]	valid_0's auc: 0.816777	valid_0's binary_logloss: 0.143783
[9]	valid_0's auc: 0.817821	valid_0's binary_logloss: 0.143084
[10]	valid_0's auc: 0.818637	valid_0's binary_logloss: 0.142272
[11]	valid_0's auc: 0.81939	valid_0's binary_logloss: 0.141623
[12]	valid_0's auc: 0.821106	valid_0's binary_logloss: 0.141043
[13]	valid_0's auc: 0.822008	valid_0's binary_logloss: 0.140628
[14]	valid_0's auc: 0.822584	valid_0's binary_logloss: 0.140198
[15]	valid_0's auc: 0.822291	valid_0's binary_logloss: 0.139909
[16]	valid_0's auc: 0.821862	valid_0's binary_logloss: 0.13967
[17]	valid_0's auc: 0.822242	valid_0's binary_logloss: 0.139418
[18]	valid_0's auc: 0.821997	valid_0's binary_logloss: 0.139196
[19]	valid_0's auc: 0.821995	valid_0's binary_logloss: 0.139074
[20]	valid_0's auc: 0.821819	valid_0's binary_logloss: 0.138961
[21]	valid_0's auc: 0.82203	valid_0's binary_logloss: 0.138881
[22]	valid_0's auc: 0.82178	valid_0's binary_logloss: 0.138831
[23]	valid_0's auc: 0.82121	valid_0's binary_logloss: 0.138769
[24]	valid_0's auc: 0.821765	valid_0's binary_logloss: 0.138673
[25]	valid_0's auc: 0.822462	valid_0's binary_logloss: 0.138592
[26]	valid_0's auc: 0.822608	valid_0's binary_logloss: 0.138429
[27]	valid_0's auc: 0.823151	valid_0's binary_logloss: 0.138356
[28]	valid_0's auc: 0.822983	valid_0's binary_logloss: 0.138295
[29]	valid_0's auc: 0.822815	valid_0's binary_logloss: 0.138256
[30]	valid_0's auc: 0.8224	valid_0's binary_logloss: 0.138291
[31]	valid_0's auc: 0.822417	valid_0's binary_logloss: 0.138282
[32]	valid_0's auc: 0.822743	valid_0's binary_logloss: 0.138242
[33]	valid_0's auc: 0.822874	valid_0's binary_logloss: 0.138243
[34]	valid_0's auc: 0.82294	valid_0's binary_logloss: 0.138245
[35]	valid_0's auc: 0.822962	valid_0's binary_logloss: 0.138245
[36]	valid_0's auc: 0.82295	valid_0's binary_logloss: 0.138258
[37]	valid_0's auc: 0.823193	valid_0's binary_logloss: 0.138293
[38]	valid_0's auc: 0.82352	valid_0's binary_logloss: 0.138206
[39]	valid_0's auc: 0.823274	valid_0's binary_logloss: 0.138248
[40]	valid_0's auc: 0.823622	valid_0's binary_logloss: 0.138203
[41]	valid_0's auc: 0.823716	valid_0's binary_logloss: 0.1382
[42]	valid_0's auc: 0.823582	valid_0's binary_logloss: 0.138227
[43]	valid_0's auc: 0.823891	valid_0's binary_logloss: 0.138182
[44]	valid_0's auc: 0.823821	valid_0's binary_logloss: 0.138205
[45]	valid_0's auc: 0.823827	valid_0's binary_logloss: 0.138218
[46]	valid_0's auc: 0.82363	valid_0's binary_logloss: 0.138281
[47]	valid_0's auc: 0.823169	valid_0's binary_logloss: 0.138356
[48]	valid_0's auc: 0.822579	valid_0's binary_logloss: 0.138486
[49]	valid_0's auc: 0.822526	valid_0's binary_logloss: 0.138442
[50]	valid_0's auc: 0.822413	valid_0's binary_logloss: 0.138447
[51]	valid_0's auc: 0.822634	valid_0's binary_logloss: 0.138443
[52]	valid_0's auc: 0.822593	valid_0's binary_logloss: 0.138458
[53]	valid_0's auc: 0.822446	valid_0's binary_logloss: 0.138513
[54]	valid_0's auc: 0.82236	valid_0's binary_logloss: 0.138534
[55]	valid_0's auc: 0.822185	valid_0's binary_logloss: 0.138537
[56]	valid_0's auc: 0.821939	valid_0's binary_logloss: 0.138539
[57]	valid_0's auc: 0.821903	valid_0's binary_logloss: 0.138537
[58]	valid_0's auc: 0.821741	valid_0's binary_logloss: 0.13857
[59]	valid_0's auc: 0.821511	valid_0's binary_logloss: 0.138621
[60]	valid_0's auc: 0.821386	valid_0's binary_logloss: 0.138661
[61]	valid_0's auc: 0.821112	valid_0's binary_logloss: 0.138732
[62]	valid_0's auc: 0.820872	valid_0's binary_logloss: 0.13879
[63]	valid_0's auc: 0.820854	valid_0's binary_logloss: 0.138797
[64]	valid_0's auc: 0.820634	valid_0's binary_logloss: 0.138854
[65]	valid_0's auc: 0.820345	valid_0's binary_logloss: 0.138907
[66]	valid_0's auc: 0.820225	valid_0's binary_logloss: 0.138951
[67]	valid_0's auc: 0.819931	valid_0's binary_logloss: 0.139008
[68]	valid_0's auc: 0.819816	valid_0's binary_logloss: 0.139055
[69]	valid_0's auc: 0.819712	valid_0's binary_logloss: 0.139084
[70]	valid_0's auc: 0.819595	valid_0's binary_logloss: 0.139154
[71]	valid_0's auc: 0.819525	valid_0's binary_logloss: 0.139178
[72]	valid_0's auc: 0.819243	valid_0's binary_logloss: 0.139267
[73]	valid_0's auc: 0.819003	valid_0's binary_logloss: 0.139362
[74]	valid_0's auc: 0.818834	valid_0's binary_logloss: 0.139418
[75]	valid_0's auc: 0.818683	valid_0's binary_logloss: 0.139461
[76]	valid_0's auc: 0.818202	valid_0's binary_logloss: 0.139515
[77]	valid_0's auc: 0.817946	valid_0's binary_logloss: 0.139603
[78]	valid_0's auc: 0.817773	valid_0's binary_logloss: 0.139637
[79]	valid_0's auc: 0.817531	valid_0's binary_logloss: 0.139694
[80]	valid_0's auc: 0.81722	valid_0's binary_logloss: 0.139759
[81]	valid_0's auc: 0.817155	valid_0's binary_logloss: 0.139801
[82]	valid_0's auc: 0.817079	valid_0's binary_logloss: 0.139797
[83]	valid_0's auc: 0.816882	valid_0's binary_logloss: 0.13985
[84]	valid_0's auc: 0.816712	valid_0's binary_logloss: 0.139922
[85]	valid_0's auc: 0.816481	valid_0's binary_logloss: 0.139974
[86]	valid_0's auc: 0.816117	valid_0's binary_logloss: 0.140075
[87]	valid_0's auc: 0.816099	valid_0's binary_logloss: 0.140059
[88]	valid_0's auc: 0.816001	valid_0's binary_logloss: 0.140106
[89]	valid_0's auc: 0.815796	valid_0's binary_logloss: 0.140168
[90]	valid_0's auc: 0.815783	valid_0's binary_logloss: 0.140193
[91]	valid_0's auc: 0.815826	valid_0's binary_logloss: 0.14019
[92]	valid_0's auc: 0.815679	valid_0's binary_logloss: 0.140253
[93]	valid_0's auc: 0.815516	valid_0's binary_logloss: 0.1403
[94]	valid_0's auc: 0.815257	valid_0's binary_logloss: 0.140362
[95]	valid_0's auc: 0.815118	valid_0's binary_logloss: 0.140395
[96]	valid_0's auc: 0.814988	valid_0's binary_logloss: 0.140461
[97]	valid_0's auc: 0.814916	valid_0's binary_logloss: 0.140491
[98]	valid_0's auc: 0.814719	valid_0's binary_logloss: 0.140519
[99]	valid_0's auc: 0.814513	valid_0's binary_logloss: 0.140596
[100]	valid_0's auc: 0.814615	valid_0's binary_logloss: 0.140596
[101]	valid_0's auc: 0.81455	valid_0's binary_logloss: 0.140632
[102]	valid_0's auc: 0.814524	valid_0's binary_logloss: 0.140652
[103]	valid_0's auc: 0.814332	valid_0's binary_logloss: 0.140729
[104]	valid_0's auc: 0.814406	valid_0's binary_logloss: 0.140729
[105]	valid_0's auc: 0.814494	valid_0's binary_logloss: 0.140736
[106]	valid_0's auc: 0.814446	valid_0's binary_logloss: 0.140757
[107]	valid_0's auc: 0.814092	valid_0's binary_logloss: 0.140874
[108]	valid_0's auc: 0.814111	valid_0's binary_logloss: 0.140904
[109]	valid_0's auc: 0.813941	valid_0's binary_logloss: 0.140935
[110]	valid_0's auc: 0.813914	valid_0's binary_logloss: 0.140942
[111]	valid_0's auc: 0.813801	valid_0's binary_logloss: 0.140986
[112]	valid_0's auc: 0.813293	valid_0's binary_logloss: 0.141086
[113]	valid_0's auc: 0.813284	valid_0's binary_logloss: 0.141107
[114]	valid_0's auc: 0.812815	valid_0's binary_logloss: 0.141209
[115]	valid_0's auc: 0.812663	valid_0's binary_logloss: 0.141244
[116]	valid_0's auc: 0.812822	valid_0's binary_logloss: 0.14121
[117]	valid_0's auc: 0.812623	valid_0's binary_logloss: 0.141262
[118]	valid_0's auc: 0.812538	valid_0's binary_logloss: 0.141281
[119]	valid_0's auc: 0.812578	valid_0's binary_logloss: 0.141293
[120]	valid_0's auc: 0.812801	valid_0's binary_logloss: 0.141211
[121]	valid_0's auc: 0.812663	valid_0's binary_logloss: 0.141287
[122]	valid_0's auc: 0.812485	valid_0's binary_logloss: 0.141327
[123]	valid_0's auc: 0.812555	valid_0's binary_logloss: 0.141351
[124]	valid_0's auc: 0.813007	valid_0's binary_logloss: 0.141304
[125]	valid_0's auc: 0.81289	valid_0's binary_logloss: 0.141299
[126]	valid_0's auc: 0.812579	valid_0's binary_logloss: 0.141388
[127]	valid_0's auc: 0.812699	valid_0's binary_logloss: 0.141409
[128]	valid_0's auc: 0.81256	valid_0's binary_logloss: 0.141436
[129]	valid_0's auc: 0.812456	valid_0's binary_logloss: 0.141466
[130]	valid_0's auc: 0.812543	valid_0's binary_logloss: 0.141488
[131]	valid_0's auc: 0.812427	valid_0's binary_logloss: 0.141521
[132]	valid_0's auc: 0.812236	valid_0's binary_logloss: 0.141618
[133]	valid_0's auc: 0.812159	valid_0's binary_logloss: 0.141665
[134]	valid_0's auc: 0.811863	valid_0's binary_logloss: 0.141757
[135]	valid_0's auc: 0.811658	valid_0's binary_logloss: 0.141818
[136]	valid_0's auc: 0.811612	valid_0's binary_logloss: 0.141817
[137]	valid_0's auc: 0.811529	valid_0's binary_logloss: 0.14184
[138]	valid_0's auc: 0.811442	valid_0's binary_logloss: 0.141896
[139]	valid_0's auc: 0.811524	valid_0's binary_logloss: 0.141931
[140]	valid_0's auc: 0.811309	valid_0's binary_logloss: 0.142004
[141]	valid_0's auc: 0.811321	valid_0's binary_logloss: 0.142035
[142]	valid_0's auc: 0.811275	valid_0's binary_logloss: 0.142077
[143]	valid_0's auc: 0.811453	valid_0's binary_logloss: 0.142082
ROC AUC: 0.8239

 

-> XGBoost에 비하면 수행시간이 엄청 빠르다!

 

6.1  그리드서치CV로 lgbm_clf 수행

%%time
# 그리드서치CV로 lgbm_clf 수행
from sklearn.model_selection import GridSearchCV

# 하이퍼 파라미터 테스트의 수행 속도를 향상시키기 위해 n_estimators를 100으로 감소
LGBM_clf = LGBMClassifier(n_estimators=200)

params = {'num_leaves': [32, 64],
          'max_depth':[128, 160],
          'min_child_samples':[60, 100],
          'subsample':[0.8, 1]}

# 하이퍼 파라미터 테스트의 수행속도를 향상 시키기 위해 cv 를 지정하지 않습니다. 
gridcv = GridSearchCV(lgbm_clf, param_grid=params)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train, y_train), (X_test, y_test)])

print('GridSearchCV 최적 파라미터:', gridcv.best_params_)
lgbm_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
GridSearchCV 최적 파라미터: {'max_depth': 128, 'min_child_samples': 100, 'num_leaves': 32, 'subsample': 0.8}
ROC AUC: 0.8263
CPU times: total: 2min 31s
Wall time: 1min 43s

 

>>> 그리드서치 결과 
lgbm GridSearchCV 최적 파라미터: {'max_depth': 128, 'min_child_samples': 100, 'num_leaves': 32, 'subsample': 0.8}

 

# 위 하이퍼 파라미터를 적용하고 n_estimators는 1000으로 증가, early_stopping_rounds는 100으로 증가
lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=32, sumbsample=0.8, min_child_samples=100,
                          max_depth=128)

evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
                verbose=True)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1])
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))

>>>
ROC AUC: 0.8263

-> 그리드서치를 이용해서 하이퍼 파라미터를 튜닝해서 성능을 극대화하기는 힘들다. XGBoost같은 좋은 알고리즘일 때 더욱 그렇다.
특히 트리 계열은 하이퍼 파라미터 종류가 너무 많아서 시간 대비 결과가 만족스럽지 않은 경우가 많다.

피처 엔지니어링의 중요성 : 이상치, 표준정규화 등을 통해 데이터를 잘 가공하는 것이 중요하다!