1 Iris의 종류 분류(Multiclass)

이번에는 Class가 여러개인 데이터를 Logistic Regression으로 예측해 보겠습니다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)

1.1 1. Data

1.1.1 1.1 Data Load

데이터는 sklearn.datasets 의 load_iris 함수를 이용해 받을 수 있습니다.

from sklearn.datasets import load_iris

iris = load_iris()

데이터에서 사용되는 변수는 암술과 수술의 길이와 넓이입니다.

sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)

iris["feature_names"]

>>>
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

정답은 iris 꽃의 종류입니다.

iris["target_names"]

>>> array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

data, target = iris["data"], iris["target"]

target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

1.1.2 1.2 데이터 EDA

pd.DataFrame(data, columns=iris["feature_names"]).describe()

정답의 종류별 개수는 다음과 같습니다.

pd.Series(target).value_counts()

Out[8]:

2    50
1    50
0    50
dtype: int64

1.1.3 1.3 Data Split

데이터를 train과 test로 나누겠습니다.

from sklearn.model_selection import train_test_split


train_data, test_data, train_target, test_target = train_test_split(
    data, target, train_size=0.7, random_state=2021
)

print("train data 개수:", len(train_data))
print("train data 개수:", len(test_data))

>>>
train data 개수: 105
train data 개수: 45

Train 데이터의 정답 개수를 보면 아래와 같습니다.

pd.Series(train_target).value_counts()

>>>
2    38
1    34
0    33
dtype: int64

Test 데이터의 정답 개수를 보면 아래와 같습니다.

pd.Series(test_target).value_counts()
>>>
0    17
1    16
2    12
dtype: int64

그런데 단순히 데이터를 분류할 경우 원래 데이터의 target 분포를 반영하지 못합니다.
이때 사용하는 것이 startify 옵션입니다.
이 옵션에 데이터의 label을 넣어주면 원본 데이터의 정답 분포를 반영해 데이터를 나눠줍니다.

train_data, test_data, train_target, test_target = train_test_split(
    data, target, train_size=0.7, random_state=2021, stratify=target)
    
pd.Series(train_target).value_counts()
>>>
0    35
2    35
1    35
dtype: int64

pd.Series(test_target).value_counts()
>>>
0    15
2    15
1    15
dtype: int64

1.2 2. Multiclass

from sklearn.linear_model import LogisticRegression

시각화를 위해서 Sepal length와 Sepal width만 사용하겠습니다.

X = train_data[:, :2]

X[0]
>>> array([5.1, 3.3])

데이터를 시각화하면 다음과 같이 그려집니다.

plt.figure(1, figsize=(10, 10))
plt.scatter(X[:, 0], X[:, 1], c=train_target, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(X[:,0].min()-0.5, X[:,0].max()+0.5)
plt.ylim(X[:,1].min()-0.5, X[:,1].max()+0.5)

1.2.1 2.1 One vs Rest

우선 One vs Rest 방법의 Logistic Regression을 학습해 보겠습니다.

ovr_logit = LogisticRegression(multi_class="ovr")
ovr_logit.fit(X, train_target)

x_min, x_max = X[:,0].min() - 0.5, X[:,0].max() + 0.5
y_min, y_max = X[:,1].min() - 0.5, X[:,1].max() + 0.5

plt.figure(1, figsize=(10, 10))

plt.scatter(X[:, 0], X[:, 1], c=ovr_logit.predict(X), edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

coef = ovr_logit.coef_
intercept = ovr_logit.intercept_

def plot_hyperplane(c, color):
    def line(x0):
        return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]
    plt.plot([x_min, x_max], [line(x_min), line(x_max)],
             ls="--", color=color)

for i, color in zip(ovr_logit.classes_, "bry"):
    plot_hyperplane(i, color)

1.2.2 2.2 Multinomial

정답의 분포가 Multinomial 분포를 따른다고 가정한 후 시행하는 Multiclass Logistic Regression 입니다.
LogisticRegression의 기본 값은 "multinomial" 입니다.

multi_logit = LogisticRegression(multi_class="multinomial")
multi_logit.fit(X, train_target)

x_min, x_max = X[:,0].min() - 0.5, X[:,0].max() + 0.5
y_min, y_max = X[:,1].min() - 0.5, X[:,1].max() + 0.5

plt.figure(1, figsize=(10, 10))

plt.scatter(X[:, 0], X[:, 1], c=multi_logit.predict(X), edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

coef = multi_logit.coef_
intercept = multi_logit.intercept_

def plot_hyperplane(c, color):
    def line(x0):
        return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]
    plt.plot([x_min, x_max], [line(x_min), line(x_max)],
             ls="--", color=color)

for i, color in zip(multi_logit.classes_, "bry"):
    plot_hyperplane(i, color)

1.3 3. Logistic Regression (Multinomial)

multi_logit = LogisticRegression()

1.3.1 3.1 학습

multi_logit.fit(train_data, train_target)

1.3.2 3.2 예측

train_pred_proba = multi_logit.predict_proba(train_data)

sample_pred = train_pred_proba[0]
sample_pred

>>> array([9.49694515e-01, 5.03040934e-02, 1.39122335e-06])

print(f"class 0에 속하지 않을 확률: {1 - sample_pred[0]:.4f}")
print(f"class 1과 2에 속할 확률: {sample_pred[1:].sum():.4f}")

>>> class 0에 속하지 않을 확률: 0.0503
    class 1과 2에 속할 확률: 0.0503

train_pred = multi_logit.predict(train_data)
test_pred = multi_logit.predict(test_data)

1.3.3 3.3 평가

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(train_target, train_pred)
test_acc = accuracy_score(test_target, test_pred)

print(f"Train accuracy is : {train_acc:.2f}")
print(f"Test accuracy is : {test_acc:.2f}")
>>>
Train accuracy is : 0.98
Test accuracy is : 0.91

1.4 4. Logistic Regression (OVR)

ovr_logit = LogisticRegression(multi_class="ovr")
>>> LogisticRegression(multi_class='ovr')

1.4.2 3.2 예측

ovr_train_pred = ovr_logit.predict(train_data)
ovr_test_pred = ovr_logit.predict(test_data)

1.4.3 3.3 평가

from sklearn.metrics import accuracy_score

ovr_train_acc = accuracy_score(train_target, ovr_train_pred)
ovr_test_acc = accuracy_score(test_target, ovr_test_pred)

print(f"One vs Rest Train accuracy is : {ovr_train_acc:.2f}")
print(f"One vs Rest Test accuracy is : {ovr_test_acc:.2f}")

>>>
One vs Rest Train accuracy is : 0.95
One vs Rest Test accuracy is : 0.93

저작자표시 비영리 변경금지 (새창열림)

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

CH04_11. 손글씨 분류 (Python) (0)	2022.10.10
CH04_10. Ensemble & Random Forest (0)	2022.10.10
CH04_06. Iris 꽃 종류 분류 (Decision Tree/Python) (0)	2022.10.10
CH04_04. Decision Tree Regression 실습 (Python) (0)	2022.10.10
CH04_02. Decision Tree Classification 실습 (Python) (0)	2022.10.10

관리회계 & 데이터 분석 스터디

CH04_07. Iris 꽃 종류 분류 (multiclass,Logistic Regression)

1 Iris의 종류 분류(Multiclass)

1.1 1. Data

1.1.1 1.1 Data Load

1.1.2 1.2 데이터 EDA

1.1.3 1.3 Data Split

1.2 2. Multiclass

1.2.1 2.1 One vs Rest

1.2.2 2.2 Multinomial

1.3 3. Logistic Regression (Multinomial)

1.3.1 3.1 학습

1.3.2 3.2 예측

1.3.3 3.3 평가

1.4 4. Logistic Regression (OVR)

1.4.2 3.2 예측

1.4.3 3.3 평가

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

티스토리툴바

CH04_07. Iris 꽃 종류 분류 (multiclass,Logistic Regression)

1 Iris의 종류 분류(Multiclass)

1.1 1. Data

1.1.1 1.1 Data Load

1.1.2 1.2 데이터 EDA

1.1.3 1.3 Data Split

1.2 2. Multiclass

1.2.1 2.1 One vs Rest

1.2.2 2.2 Multinomial

1.3 3. Logistic Regression (Multinomial)

1.3.1 3.1 학습

1.3.2 3.2 예측

1.3.3 3.3 평가

1.4 4. Logistic Regression (OVR)

1.4.2 3.2 예측

1.4.3 3.3 평가

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

'Machine Learning/머신러닝 온라인 강의' Related Articles

티스토리툴바