본문 바로가기

Machine Learning/머신러닝 온라인 강의

CH05_02. 스팸 메세지 분류 (Python)

1  스팸 문자를 Naive Bayes를 이용해 분류하기

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2022)

 

1.1  1. Data

1.1.1  1.1 Data Load

sms_spam.csv 데이터는 문자 내용이 스팸인지 아닌지를 구분하기 위한 데이터 입니다.

 

spam = pd.read_csv("sms_spam.csv")

text = spam["text"]
label = spam["type"]

 

1.1.2  1.2 Data EDA

text[0]
>>>

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
Cine there got amore wat...'

 

label[0]
>>> 'ham'


label.value_counts()
>>> 
ham     4827
spam     747
Name: type, dtype: int64

 

1.1.3  1.3 Data Cleaning

정답의 문자를 숫자로 변환시켜줍니다.
ham은 0으로, spam은 1로 변환 시켜주겠습니다.

label = label.map({"ham": 0, "spam": 1})

label.value_counts()
>>>
0    4827
1     747
Name: type, dtype: int64

 

text를 문자만 존재하도록 정리해줍니다.
regex를 통해 영어, 숫자 그리고 띄어쓰기를 제외한 모든 단어를 지우도록 하겠습니다.

 

re_pattern = "[^a-zA-Z0-9\ ]"


text[0]
>>>
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
Cine there got amore wat...'


text.iloc[:1].str.replace(re_pattern, "", regex=True)[0]
>>> 
'Go until jurong point crazy Available only in bugis n great world 
la e buffet Cine there got amore wat'



text = text.str.replace(re_pattern, "", regex=True)
text
>>>
0       Go until jurong point crazy Available only in ...
1                                 Ok lar Joking wif u oni
2       Free entry in 2 a wkly comp to win FA Cup fina...
3             U dun say so early hor U c already then say
4       Nah I dont think he goes to usf he lives aroun...
                              ...                        
5569    This is the 2nd time we have tried 2 contact u...
5570                   Will  b going to esplanade fr home
5571    Pity  was in mood for that Soany other suggest...
5572    The guy did some bitching but I acted like id ...
5573                            Rofl Its true to its name
Name: text, Length: 5574, dtype: object

 

그리고 나서 대문자들을 모두 소문자로 바꿔 줍니다.

text[0]
>>> 'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'

text.iloc[:1].str.lower()[0]
>>> 'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

text = text.str.lower()
text[0]
>>> 'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

 

1.1.4  1.4 Data Split

 

from sklearn.model_selection import train_test_split
train_text, test_text, train_label, test_label = train_test_split(
    text, label, train_size=0.7, random_state=2022)


print(f"train_data size: {len(train_label)}, {len(train_label)/len(text):.2f}")
print(f"test_data size: {len(test_label)}, {len(test_label)/len(text):.2f}")

>>>
train_data size: 3901, 0.70
test_data size: 1673, 0.30

 

 

1.2  2. Count Vectorize

이제 Naive Bayes를 학습시키기 위해서 각 문장에서 단어들이 몇 번 나왔는지로 변환해야 합니다.

 

1.2.1  2.1 word tokenize

문장을 단어로 나누는 데에는 nltk 패키지의 word_tokenize를 이용합니다.

import nltk
from nltk import word_tokenize

nltk.download('punkt')

 

train_text.iloc[0]
>>> 'free entry to the gr8prizes wkly comp 4 a chance to win the latest nokia 8800 psp or 250 cash every wktxt great to 80878 httpwwwgr8prizescom 08715705022'

 

word_tokenize(train_text.iloc[0])
>>>
['free',
 'entry',
 'to',
 'the',
 'gr8prizes',
 'wkly',
 'comp',
 '4',
 'a',
 'chance',
 'to',
 'win',
 'the',
 'latest',
 'nokia',
 '8800',
 'psp',
 'or',
 '250',
 'cash',
 'every',
 'wktxt',
 'great',
 'to',
 '80878',
 'httpwwwgr8prizescom',
 '08715705022']

 

1.2.2  2.2 count vectorize

다음은 sklearn.feature_extraction.text CountVectorizer를 이용해 단어들을 count vector로 만들어 보겠습니다.

 

from sklearn.feature_extraction.text import CountVectorizer

우선 예시로 2개의 문장으로 CountVectorizer를 학습해 보겠습니다.

 

train_text.iloc[:2].values
>>>
array(['free entry to the gr8prizes wkly comp 4 a chance to win the latest nokia 8800 psp or 250 cash every wktxt great to 80878 httpwwwgr8prizescom 08715705022',
       'im good i have been thinking about you'], dtype=object)

 

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_text.iloc[:2])
>>> CountVectorizer(tokenizer=<function word_tokenize at 0x00000248C8CC6700>)

 

문장에서 나온 단어들은 다음과 같습니다.

 

cnt_vectorizer.vocabulary_

>>>
    {'free': 13,
     'entry': 11,
     'to': 27,
     'the': 25,
     'gr8prizes': 15,
     'wkly': 29,
     'comp': 10,
     '4': 2,
     'a': 5,
     'chance': 9,
     'win': 28,
     'latest': 21,
     'nokia': 22,
     '8800': 4,
     'psp': 24,
     'or': 23,
     '250': 1,
     'cash': 8,
     'every': 12,
     'wktxt': 30,
     'great': 16,
     '80878': 3,
     'httpwwwgr8prizescom': 18,
     '08715705022': 0,
     'im': 20,
     'good': 14,
     'i': 19,
     'have': 17,
     'been': 7,
     'thinking': 26,
     'about': 6,
     'you': 31}

 

vocab = sorted(cnt_vectorizer.vocabulary_.items(), key=lambda x: x[1])
vocab = list(map(lambda x: x[0], vocab))
vocab

>>>	
    ['08715705022',
     '250',
     '4',
     '80878',
     '8800',
     'a',
     'about',
     'been',
     'cash',
     'chance',
     'comp',
     'entry',
     'every',
     'free',
     'good',
     'gr8prizes',
     'great',
     'have',
     'httpwwwgr8prizescom',
     'i',
     'im',
     'latest',
     'nokia',
     'or',
     'psp',
     'the',
     'thinking',
     'to',
     'win',
     'wkly',
     'wktxt',
     'you']

 

sample_cnt_vector = cnt_vectorizer.transform(train_text.iloc[:2]).toarray()
sample_cnt_vector
>>>
array([[1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
        1, 1, 1, 2, 0, 3, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)

 

train_text.iloc[:2].values

array(['free entry to the gr8prizes wkly comp 4 a chance to win the latest nokia 8800 psp or 250 cash every wktxt great to 80878 httpwwwgr8prizescom 08715705022',
       'im good i have been thinking about you'], dtype=object)

 

pd.DataFrame(sample_cnt_vector, columns=vocab)

 

1.2.2.1  2.2.1 학습

이제 모든 데이터에 대해서 진행하겠습니다.

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_text)
>>> CountVectorizer(tokenizer=<function word_tokenize at 0x0000011B376821F0>)

 

전체 단어는 7846개가 존재합니다.

len(cnt_vectorizer.vocabulary_)
>>>
7846

 

1.2.2.2  2.2.2 예측

train_matrix = cnt_vectorizer.transform(train_text)
test_matrix = cnt_vectorizer.transform(test_text)

 

만약 존재하지 않는 단어가 들어올 경우 어떻게 될까요?
CountVectorize는 학습한 단어장에 존재하지 않는 단어가 들어오게 될 경우 무시합니다.

cnt_vectorizer.transform(["notavailblewordforcnt"]).toarray().sum()
>>> 0

 

1.3  3. Naive Bayes

분류를 위한 Naive Bayes 모델은 sklearn.naive_bayes BernoulliNB를 사용하면 됩니다.

from sklearn.naive_bayes import BernoulliNB

naive_bayes = BernoulliNB()

 

1.3.1  3.1 학습

naive_bayes.fit(train_matrix, train_label)

 

1.3.2  3.2 예측

train_pred = naive_bayes.predict(train_matrix)
test_pred = naive_bayes.predict(test_matrix)

 

1.3.3  3.3 평가

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(train_label, train_pred)
test_acc = accuracy_score(test_label, test_pred)

print(f"Train Accuracy is {train_acc:.4f}")
print(f"Test Accuracy is {test_acc:.4f}")
>>>
Train Accuracy is 0.9839
Test Accuracy is 0.9701

 

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

CH06_02. KNN 실습 (Python)  (0) 2022.10.11
CH06_01. KNN  (0) 2022.10.11
CH05_01. Naive Bayes  (0) 2022.10.11
CH04_13. 부동산 가격 예측 (Python)  (0) 2022.10.11
CH04_11. 손글씨 분류 (Python)  (0) 2022.10.10