피마 인디언 당뇨병 (Pima Indian Diabetes) 데이터 세트등 이용해 당뇨병 여부를 판단하는 머신러닝 예측 모델을 수립하고, 지금까지 설명한 평가 지표를 적용해보겠습니다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# 워닝 무시
import warnings
warnings.filterwarnings(action='ignore')

1 1. 데이터 로드 및 확인

# 피마 인디언 데이터 로드
diabetes_data = pd.read_csv("C:/Users/JIN SEONG EUN/OneDrive/바탕 화면/빅데이터 분석가 과정/머신러닝/실강/CH03/diabetes.csv")

# 데이터 확인
print(diabetes_data.shape)
diabetes_data.head()

Pregnancies: 임신 횟수
Glucose: 포도당 부하 검사 수치
BloodPressure: 혈압(mm Hg)
SkinThickness: 팔 삼두근 뒤쪽의 피하지방 측정값(mm)
Insulin: 혈청 인슐린(mu U/ml)
BMI: 체질량지수(체중(kg)/(키(m))^2)
DiabetesPedigreeFunction: 당뇨 내력 가중치 값
Age: 나이
Outcome: 클래스 결정 값( 0또는 1)

# 클래스 분포 확인
print(diabetes_data['Outcome'].value_counts())

>>>

0    500
1    268
Name: Outcome, dtype: int64

diabetes_data.info( )

>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

-> 컬럼 별 null값은 없음.

2 2. 학습 및 예측 수행 : Logistic Regression

# ROC-AUC가 추가된 get_clf_eval 함수 
# : 모델의 평가지표들(오차 행렬, 정확도, 정밀도, 재현율, f1 score, ROC AUC)을 보여준다.
def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
          F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc), '\n')

# 피처 데이터 세트 X, 레이블 데이터 세트 y를 추출. 
# 맨 끝이 Outcome 컬럼으로 레이블 값임. 컬럼 위치 -1을 이용해 추출 
X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

# train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 156, stratify=y)

# 로지스틱 회귀로 학습
lr_clf = LogisticRegression()
lr_clf.fit(X_train , y_train)

# 예측
pred = lr_clf.predict(X_test)

# roc_auc_score 구하기 위한 분류 결정 예측 확률
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

# 평가지표들 구하기
get_clf_eval(y_test , pred, pred_proba)

>>>
오차 행렬
[[88 12]
 [23 31]]
정확도: 0.7727, 정밀도: 0.7209, 재현율: 0.5741,          F1: 0.6392, AUC:0.7919

-> 정밀도에 비해 재현율이 많이 낮은 상태. 재현율을 높이는 방향으로 모델을 세팅하는 것으로.

도메인 지식이 중요하다. 도메인에 따라서 정밀도와 재현도를 어떻게 조절할지 생각해 봐야한다.

3 3. 모델 평가

앞 예제에서 사용된 get_clf_eval( )과 precision_recall_curve_plot( ) 재 로딩

# 수정된 get_clf_eval() 함수
def get_clf_eval(y_test, pred=None, pred_proba=None):  # 평가지표들을 계산해줌
    confusion = confusion_matrix( y_test, pred)        # 오차 행렬
    accuracy = accuracy_score(y_test , pred)           # 정확도
    precision = precision_score(y_test , pred)         # 정밀도
    recall = recall_score(y_test , pred)               # 재현율
    f1 = f1_score(y_test,pred)                         # f1 score
    roc_auc = roc_auc_score(y_test, pred_proba)        # ROC AUC
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc), '\n')
    
    

# 정밀도-재현율 trade-off 관계 그래프
def precision_recall_curve_plot(y_test=None, pred_proba_c1=None):
    # threshold ndarray와 이 threshold에 따른 정밀도, 재현율 ndarray 추출. 
    precisions, recalls, thresholds = precision_recall_curve( y_test, pred_proba_c1)
    
    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 Plot 수행. 정밀도는 점선으로 표시
    plt.figure(figsize=(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle='--', label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary],label='recall')
    
    # threshold 값 X 축의 Scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    
    # x축, y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Threshold value'); plt.ylabel('Precision and Recall value')
    plt.legend(); plt.grid()
    plt.show()

3.0.1 precision recall 곡선 그래프

pred_proba_c1 = lr_clf.predict_proba(X_test)[:, 1]

precision_recall_curve_plot(y_test, pred_proba_c1)

각 피처들의 값 4분위 분포 확인

diabetes_data.describe()

-> min값이 0인 피처들이 꽤 있는데, 이상하다..
먼저 포도당(glucose) 수치의 값 분포(histogram)를 살펴보자

'Glucose' 피처의 분포도

plt.hist(diabetes_data['Glucose'], bins=10)

-> 0 값들이 어느 정도 있다.

3.0.2 0값이 있는 피처들에서 0값의 데이터 건수와 퍼센트 계산

# 0값을 검사할 피처명 리스트 객체 설정
zero_features = ['Glucose', 'BloodPressure','SkinThickness','Insulin','BMI']

# 전체 데이터 건수
total_count = diabetes_data['Glucose'].count()

# 피처 별 반복 하면서 값이 0 인 데이터 건수를 추출하고, 퍼센트도 계산해보자.
for feature in zero_features:
    zero_count = diabetes_data[diabetes_data[feature] == 0][feature].count()
    print('{0:13s}    0 건수는 {1:3d}건, 퍼센트는 {2:.2f} %'.format(feature, zero_count, 100*zero_count/total_count))

Glucose          0 건수는   5건, 퍼센트는 0.65 %
BloodPressure    0 건수는  35건, 퍼센트는 4.56 %
SkinThickness    0 건수는 227건, 퍼센트는 29.56 %
Insulin          0 건수는 374건, 퍼센트는 48.70 %
BMI              0 건수는  11건, 퍼센트는 1.43 %

-> 피처들마다 0 값들이 들어있는 것을 확인할 수 있다.

3.0.3 0값을 평균값으로 대체

# zero_features 리스트 내부에 저장된 개별 피처들에 대해서 0값을 평균 값으로 대체
diabetes_data[zero_features] = diabetes_data[zero_features].replace(0, diabetes_data[zero_features].mean())

4 4. 데이터 전처리 후 다시 학습 및 모델 평가

4.0.1 데이터 전처리 : 피처 스케일링 적용

# feature와 target값 분리
X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

# StandardScaler 클래스를 이용해 피처 데이터 세트에 일괄적으로 스케일링 적용
scaler = StandardScaler( )
X_scaled = scaler.fit_transform(X)

# train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 156, stratify=y)

# 로지스틱 회귀로 학습, 예측 및 평가 수행. 
lr_clf = LogisticRegression()
lr_clf.fit(X_train , y_train)
pred = lr_clf.predict(X_test)

# roc_auc_score 수정에 따른 추가
pred_proba = lr_clf.predict_proba(X_test)[:, 1]
get_clf_eval(y_test , pred, pred_proba)


>>>

오차 행렬
[[90 10]
 [21 33]]
정확도: 0.7987, 정밀도: 0.7674, 재현율: 0.6111,    F1: 0.6804, AUC:0.8433

-> 전처리 전 평가지표들보다 재현율 성능이 나아진 것 확인 가능
(정확도: 0.7727, 정밀도 0.7209, 재현율: 0.5741, F1: 0.6392, AUC:0.7919)

4.0.2 이번에는 분류 결정 임곗값을 변경하면서 성능을 측정해보자

from sklearn.preprocessing import Binarizer

def get_eval_by_threshold(y_test , pred_proba_c1, thresholds):
    # thresholds 리스트 객체내의 값을 차례로 iteration하면서 Evaluation 수행.
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_c1) 
        custom_predict = binarizer.transform(pred_proba_c1)
        print('임곗값:', custom_threshold)
        # roc_auc_score 관련 수정
        get_clf_eval(y_test , custom_predict, pred_proba_c1)

thresholds = [0.3, 0.33 ,0.36, 0.39, 0.42 , 0.45 ,0.48, 0.50]
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:,1].reshape(-1,1), thresholds )

>>>

임곗값: 0.3
오차 행렬
[[67 33]
 [11 43]]
정확도: 0.7143, 정밀도: 0.5658, 재현율: 0.7963,    F1: 0.6615, AUC:0.8433 

임곗값: 0.33
오차 행렬
[[72 28]
 [12 42]]
정확도: 0.7403, 정밀도: 0.6000, 재현율: 0.7778,    F1: 0.6774, AUC:0.8433 

임곗값: 0.36
오차 행렬
[[76 24]
 [15 39]]
정확도: 0.7468, 정밀도: 0.6190, 재현율: 0.7222,    F1: 0.6667, AUC:0.8433 

임곗값: 0.39
오차 행렬
[[78 22]
 [16 38]]
정확도: 0.7532, 정밀도: 0.6333, 재현율: 0.7037,    F1: 0.6667, AUC:0.8433 

임곗값: 0.42
오차 행렬
[[84 16]
 [18 36]]
정확도: 0.7792, 정밀도: 0.6923, 재현율: 0.6667,    F1: 0.6792, AUC:0.8433 

임곗값: 0.45
오차 행렬
[[85 15]
 [18 36]]
정확도: 0.7857, 정밀도: 0.7059, 재현율: 0.6667,    F1: 0.6857, AUC:0.8433 

임곗값: 0.48
오차 행렬
[[88 12]
 [19 35]]
정확도: 0.7987, 정밀도: 0.7447, 재현율: 0.6481,    F1: 0.6931, AUC:0.8433 

임곗값: 0.5
오차 행렬
[[90 10]
 [21 33]]
정확도: 0.7987, 정밀도: 0.7674, 재현율: 0.6111,    F1: 0.6804, AUC:0.8433

-> 정밀도와 재현율을 어느정도 높이는 수준에서 본다면 임계값 0.48에서 정밀도 0.74, 재현율 0.64 정도로
어느정도 괜찮은 수치라 볼 수 있다.

# 임곗값를 0.48로 설정한 Binarizer 생성
binarizer = Binarizer(threshold=0.48)

# 위에서 구한 lr_clf의 predict_proba() 예측 확률 array에서 1에 해당하는 컬럼값을 Binarizer변환. 
pred_th_048 = binarizer.fit_transform(pred_proba[:, 1].reshape(-1,1)) 

# roc_auc_score 관련 수정
get_clf_eval(y_test , pred_th_048, pred_proba[:, 1])

오차 행렬
[[88 12]
 [19 35]]
정확도: 0.7987, 정밀도: 0.7447, 재현율: 0.6481,    F1: 0.6931, AUC:0.8433

저작자표시 비영리 변경금지 (새창열림)

'Machine Learning > 머신러닝 완벽가이드 for Python' 카테고리의 다른 글

ch.4.1~2. 분류의 종류, 결정 트리 (1)	2022.10.06
ch03 요약 (0)	2022.10.06
ch 3.1~3-5_정확도 _ ROC_AUC 예제 (실습) (0)	2022.10.06
ch.3.5 ROC Curve와 AUC (0)	2022.10.06
ch.3.4 F1 Score (0)	2022.10.06

관리회계 & 데이터 분석 스터디

ch.3.6 실습 파마 인디언 당뇨병 예측(실습)

1 1. 데이터 로드 및 확인

2 2. 학습 및 예측 수행 : Logistic Regression

3 3. 모델 평가

3.0.1 precision recall 곡선 그래프

3.0.2 0값이 있는 피처들에서 0값의 데이터 건수와 퍼센트 계산

3.0.3 0값을 평균값으로 대체

4 4. 데이터 전처리 후 다시 학습 및 모델 평가

4.0.1 데이터 전처리 : 피처 스케일링 적용

4.0.2 이번에는 분류 결정 임곗값을 변경하면서 성능을 측정해보자

'Machine Learning > 머신러닝 완벽가이드 for Python' 카테고리의 다른 글

티스토리툴바

ch.3.6 실습 파마 인디언 당뇨병 예측(실습)

1 1. 데이터 로드 및 확인

2 2. 학습 및 예측 수행 : Logistic Regression

3 3. 모델 평가

3.0.1 precision recall 곡선 그래프

3.0.2 0값이 있는 피처들에서 0값의 데이터 건수와 퍼센트 계산

3.0.3 0값을 평균값으로 대체

4 4. 데이터 전처리 후 다시 학습 및 모델 평가

4.0.1 데이터 전처리 : 피처 스케일링 적용

4.0.2 이번에는 분류 결정 임곗값을 변경하면서 성능을 측정해보자

'Machine Learning > 머신러닝 완벽가이드 for Python' 카테고리의 다른 글

'Machine Learning/머신러닝 완벽가이드 for Python' Related Articles

티스토리툴바