[text-mining] 감성분석 :: 마인드스케일

실습 준비

파일 열기

import pandas as pd
df = pd.read_excel('yelp.xlsx')
df.head()

	review	sentiment
0	Wow... Loved this place.	1
1	Crust is not good.	0
2	Not tasty and the texture was just nasty.	0
3	Stopped by during the late May bank holiday of...	1
4	The selection on the menu was great and so wer...	1

문서 단어 행렬 만들기

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1000, stop_words='english')
dtm = cv.fit_transform(df.review)

x와 y를 지정

x = dtm
y = df.sentiment

데이터 분할

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.2,   # 20%의 데이터를 테스트용으로 유보
    random_state=42) # 유사난수의 씨앗값 seed을 42로 설정

나이브 베이즈 분류

임포트

from sklearn.naive_bayes import BernoulliNB

모델 만들기

model = BernoulliNB()

학습

model.fit(x_train, y_train)

훈련 데이터로 정확도(accuracy) 평가

model.score(x_train, y_train)

0.92125

테스트 데이터로 정확도 평가

model.score(x_test, y_test)

0.765

단어별 확률

prob_df = pd.DataFrame({
    '단어': cv.get_feature_names_out(),
    '비율': model.feature_log_prob_[1] - model.feature_log_prob_[0]
})

상대적으로 긍정 문장에서 많이 나오는 단어

prob_df.sort_values('비율').tail(10)

	단어	비율
26	attentive	1.965811
744	spot	1.965811
221	fantastic	2.099343
365	loved	2.099343
12	amazing	2.159967
472	perfect	2.217126
208	excellent	2.322486
32	awesome	2.504808
153	delicious	3.155395
268	great	4.062952

상대적으로 부정 문장에서 많이 나오는 단어

prob_df.sort_values('비율').head(10)

	단어	비율
37	bad	-2.688149
392	minutes	-2.619156
939	wasn	-2.465005
65	bland	-2.377994
703	slow	-2.177323
847	took	-2.177323
546	probably	-2.177323
179	don	-1.972529
816	terrible	-1.926009
607	rude	-1.926009

로지스틱 회귀분석

임포트

from sklearn.linear_model import LogisticRegressionCV

엘라스틱넷으로 C는 0.001, 0.01, 0.1 세 가지, L1의 비율은 0, 0.5, 1 세 가지를 시도 총 9가지 조합을 시도하여 성능이 가장 좋은 조합을 찾음

model = LogisticRegressionCV(
    penalty='elasticnet', solver='saga', random_state=42,
    Cs=[0.1, 1, 10], l1_ratios=[0, 0.5, 1], max_iter=4000)

학습

model.fit(x_train, y_train)

가장 좋은 C

model.C_

array([1.])

가장 좋은 L1의 비율

model.l1_ratio_

array([0])

훈련 데이터에서 정확도

model.score(x_train, y_train)

0.94125

테스트 데이터에서 정확도

model.score(x_test, y_test)

0.76

단어별 가중치

word_coef = pd.DataFrame({
    '단어': cv.get_feature_names_out(),
    '가중치': model.coef_.flat
})

긍정 단어

word_coef.sort_values('가중치').tail(10)

	단어	가중치
208	excellent	1.340621
221	fantastic	1.439516
250	friendly	1.524526
364	love	1.525609
265	good	1.532660
409	nice	1.555230
32	awesome	1.714438
12	amazing	1.779166
153	delicious	2.201475
268	great	2.867876

부정 단어

word_coef.sort_values('가중치').head(10)

	단어	가중치
37	bad	-1.448292
179	don	-1.397359
65	bland	-1.274375
939	wasn	-1.225014
392	minutes	-1.177965
977	worst	-1.146635
607	rude	-1.040134
889	unfortunately	-1.039067
703	slow	-1.036536
546	probably	-1.015125

예측

y_pred = model.predict(x_test)

확률로 예측

probs = model.predict_proba(x_test)

긍정 확률만

prob = probs[:, 1]

문턱값에 따라 다르게 예측

import numpy as np
threshold = 0.5 # 문턱값
y_pred = np.where(prob > threshold, 1, 0)