1 스팸 문자를 Naive Bayes를 이용해 분류하기

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2022)

1.1 1. Data

1.1.1 1.1 Data Load

sms_spam.csv 데이터는 문자 내용이 스팸인지 아닌지를 구분하기 위한 데이터 입니다.

spam = pd.read_csv("sms_spam.csv")

text = spam["text"]
label = spam["type"]

1.1.2 1.2 Data EDA

text[0]
>>>

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
Cine there got amore wat...'

label[0]
>>> 'ham'


label.value_counts()
>>> 
ham     4827
spam     747
Name: type, dtype: int64

1.1.3 1.3 Data Cleaning

정답의 문자를 숫자로 변환시켜줍니다.
ham은 0으로, spam은 1로 변환 시켜주겠습니다.

label = label.map({"ham": 0, "spam": 1})

label.value_counts()
>>>
0    4827
1     747
Name: type, dtype: int64

text를 문자만 존재하도록 정리해줍니다.
regex를 통해 영어, 숫자 그리고 띄어쓰기를 제외한 모든 단어를 지우도록 하겠습니다.

re_pattern = "[^a-zA-Z0-9\ ]"


text[0]
>>>
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
Cine there got amore wat...'


text.iloc[:1].str.replace(re_pattern, "", regex=True)[0]
>>> 
'Go until jurong point crazy Available only in bugis n great world 
la e buffet Cine there got amore wat'



text = text.str.replace(re_pattern, "", regex=True)
text
>>>
0       Go until jurong point crazy Available only in ...
1                                 Ok lar Joking wif u oni
2       Free entry in 2 a wkly comp to win FA Cup fina...
3             U dun say so early hor U c already then say
4       Nah I dont think he goes to usf he lives aroun...
                              ...                        
5569    This is the 2nd time we have tried 2 contact u...
5570                   Will  b going to esplanade fr home
5571    Pity  was in mood for that Soany other suggest...
5572    The guy did some bitching but I acted like id ...
5573                            Rofl Its true to its name
Name: text, Length: 5574, dtype: object

그리고 나서 대문자들을 모두 소문자로 바꿔 줍니다.

text[0]
>>> 'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'

text.iloc[:1].str.lower()[0]
>>> 'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

text = text.str.lower()
text[0]
>>> 'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

1.1.4 1.4 Data Split

from sklearn.model_selection import train_test_split
train_text, test_text, train_label, test_label = train_test_split(
    text, label, train_size=0.7, random_state=2022)


print(f"train_data size: {len(train_label)}, {len(train_label)/len(text):.2f}")
print(f"test_data size: {len(test_label)}, {len(test_label)/len(text):.2f}")

>>>
train_data size: 3901, 0.70
test_data size: 1673, 0.30

1.2 2. Count Vectorize

이제 Naive Bayes를 학습시키기 위해서 각 문장에서 단어들이 몇 번 나왔는지로 변환해야 합니다.

1.2.1 2.1 word tokenize

문장을 단어로 나누는 데에는 nltk 패키지의 word_tokenize를 이용합니다.

import nltk
from nltk import word_tokenize

nltk.download('punkt')

train_text.iloc[0]
>>> 'free entry to the gr8prizes wkly comp 4 a chance to win the latest nokia 8800 psp or 250 cash every wktxt great to 80878 httpwwwgr8prizescom 08715705022'

word_tokenize(train_text.iloc[0])
>>>
['free',
 'entry',
 'to',
 'the',
 'gr8prizes',
 'wkly',
 'comp',
 '4',
 'a',
 'chance',
 'to',
 'win',
 'the',
 'latest',
 'nokia',
 '8800',
 'psp',
 'or',
 '250',
 'cash',
 'every',
 'wktxt',
 'great',
 'to',
 '80878',
 'httpwwwgr8prizescom',
 '08715705022']

1.2.2 2.2 count vectorize

다음은 sklearn.feature_extraction.text의 CountVectorizer를 이용해 단어들을 count vector로 만들어 보겠습니다.

from sklearn.feature_extraction.text import CountVectorizer

우선 예시로 2개의 문장으로 CountVectorizer를 학습해 보겠습니다.

train_text.iloc[:2].values
>>>
array(['free entry to the gr8prizes wkly comp 4 a chance to win the latest nokia 8800 psp or 250 cash every wktxt great to 80878 httpwwwgr8prizescom 08715705022',
       'im good i have been thinking about you'], dtype=object)

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_text.iloc[:2])
>>> CountVectorizer(tokenizer=<function word_tokenize at 0x00000248C8CC6700>)

문장에서 나온 단어들은 다음과 같습니다.

cnt_vectorizer.vocabulary_

>>>
    {'free': 13,
     'entry': 11,
     'to': 27,
     'the': 25,
     'gr8prizes': 15,
     'wkly': 29,
     'comp': 10,
     '4': 2,
     'a': 5,
     'chance': 9,
     'win': 28,
     'latest': 21,
     'nokia': 22,
     '8800': 4,
     'psp': 24,
     'or': 23,
     '250': 1,
     'cash': 8,
     'every': 12,
     'wktxt': 30,
     'great': 16,
     '80878': 3,
     'httpwwwgr8prizescom': 18,
     '08715705022': 0,
     'im': 20,
     'good': 14,
     'i': 19,
     'have': 17,
     'been': 7,
     'thinking': 26,
     'about': 6,
     'you': 31}

vocab = sorted(cnt_vectorizer.vocabulary_.items(), key=lambda x: x[1])
vocab = list(map(lambda x: x[0], vocab))
vocab

>>>	
    ['08715705022',
     '250',
     '4',
     '80878',
     '8800',
     'a',
     'about',
     'been',
     'cash',
     'chance',
     'comp',
     'entry',
     'every',
     'free',
     'good',
     'gr8prizes',
     'great',
     'have',
     'httpwwwgr8prizescom',
     'i',
     'im',
     'latest',
     'nokia',
     'or',
     'psp',
     'the',
     'thinking',
     'to',
     'win',
     'wkly',
     'wktxt',
     'you']

sample_cnt_vector = cnt_vectorizer.transform(train_text.iloc[:2]).toarray()
sample_cnt_vector
>>>
array([[1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
        1, 1, 1, 2, 0, 3, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)

train_text.iloc[:2].values

array(['free entry to the gr8prizes wkly comp 4 a chance to win the latest nokia 8800 psp or 250 cash every wktxt great to 80878 httpwwwgr8prizescom 08715705022',
       'im good i have been thinking about you'], dtype=object)

pd.DataFrame(sample_cnt_vector, columns=vocab)

1.2.2.1 2.2.1 학습

이제 모든 데이터에 대해서 진행하겠습니다.

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_text)
>>> CountVectorizer(tokenizer=<function word_tokenize at 0x0000011B376821F0>)

전체 단어는 7846개가 존재합니다.

len(cnt_vectorizer.vocabulary_)
>>>
7846

1.2.2.2 2.2.2 예측

train_matrix = cnt_vectorizer.transform(train_text)
test_matrix = cnt_vectorizer.transform(test_text)

만약 존재하지 않는 단어가 들어올 경우 어떻게 될까요?
CountVectorize는 학습한 단어장에 존재하지 않는 단어가 들어오게 될 경우 무시합니다.

cnt_vectorizer.transform(["notavailblewordforcnt"]).toarray().sum()
>>> 0

1.3 3. Naive Bayes

분류를 위한 Naive Bayes 모델은 sklearn.naive_bayes의 BernoulliNB를 사용하면 됩니다.

from sklearn.naive_bayes import BernoulliNB

naive_bayes = BernoulliNB()

1.3.1 3.1 학습

naive_bayes.fit(train_matrix, train_label)

1.3.2 3.2 예측

train_pred = naive_bayes.predict(train_matrix)
test_pred = naive_bayes.predict(test_matrix)

1.3.3 3.3 평가

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(train_label, train_pred)
test_acc = accuracy_score(test_label, test_pred)

print(f"Train Accuracy is {train_acc:.4f}")
print(f"Test Accuracy is {test_acc:.4f}")
>>>
Train Accuracy is 0.9839
Test Accuracy is 0.9701

저작자표시 비영리 변경금지 (새창열림)

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

CH06_02. KNN 실습 (Python) (0)	2022.10.11
CH06_01. KNN (0)	2022.10.11
CH05_01. Naive Bayes (0)	2022.10.11
CH04_13. 부동산 가격 예측 (Python) (0)	2022.10.11
CH04_11. 손글씨 분류 (Python) (0)	2022.10.10

관리회계 & 데이터 분석 스터디

CH05_02. 스팸 메세지 분류 (Python)

1 스팸 문자를 Naive Bayes를 이용해 분류하기

1.1 1. Data

1.1.1 1.1 Data Load

1.1.2 1.2 Data EDA

1.1.3 1.3 Data Cleaning

1.1.4 1.4 Data Split

1.2 2. Count Vectorize

1.2.1 2.1 word tokenize

1.2.2 2.2 count vectorize

1.2.2.1 2.2.1 학습

1.2.2.2 2.2.2 예측

1.3 3. Naive Bayes

1.3.1 3.1 학습

1.3.2 3.2 예측

1.3.3 3.3 평가

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

티스토리툴바

CH05_02. 스팸 메세지 분류 (Python)

1 스팸 문자를 Naive Bayes를 이용해 분류하기

1.1 1. Data

1.1.1 1.1 Data Load

1.1.2 1.2 Data EDA

1.1.3 1.3 Data Cleaning

1.1.4 1.4 Data Split

1.2 2. Count Vectorize

1.2.1 2.1 word tokenize

1.2.2 2.2 count vectorize

1.2.2.1 2.2.1 학습

1.2.2.2 2.2.2 예측

1.3 3. Naive Bayes

1.3.1 3.1 학습

1.3.2 3.2 예측

1.3.3 3.3 평가

'Machine Learning > 머신러닝 온라인 강의' 카테고리의 다른 글

'Machine Learning/머신러닝 온라인 강의' Related Articles

티스토리툴바