Python을 활용한 텍스트 마이닝 6.텍스트 분석-영문 텍스트 마이닝1

Program/Python

Python을 활용한 텍스트 마이닝 6.텍스트 분석-영문 텍스트 마이닝1

HEAD1TON 2017. 9. 11. 17:56

http://www.imdb.com/title/tt0110912/?ref_=nv_sr_1

영어로 된 영화 평점 사이트에서 리뷰를 크롤링하고 간단한 텍스트 마이닝을 해보겠습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import nltk
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
lemmatizer = nltk.wordnet.WordNetLemmatizer()
 
 
with open('dark_knight_review.txt','r',encoding ='utf-8') as f:
    lines = f.readlines()
    f.close()
    
reviewedList=[]
 
for line in lines:
    reviewed=''
    tokens = nltk.word_tokenize(line)   ##토큰화
    
    for token in tokens:
        if token.lower() not in stopWords:      ##stopwords제거
            reviewed += ' '+lemmatizer.lemmatize(token)      ##lemmatize
    reviewedList.append(reviewed)

먼저 토큰화(tokenize), 기본형 형태로 만들기(lemmatize)와 stopwords를 제거하는 작업을 했습니다.

Dark knight라는 영화의 리뷰 100개를 크롤링하여 만든 텍스트 파일을 불러와서 위의 순서로 코딩했습니다. 참고로 stopwords의 경우 소문자의 형태만 인식하기 때문에 .lower()를 통해 토큰을 전부 소문자로 변환했습니다.

1
2
3
4
with open('dark_lemmatized.txt','w',encoding='utf-8') as f:
    for reviewed in reviewedList:
        f.write(reviewed+'\n')
    f.close()

결과를 다시 텍스트 파일로 저장하고…

이제 이 리뷰에서 가장 많이 등장하는 명사를 추출해보겠습니다.

혹시 중간에 nltk패키지의 오류로 실행이 안되면 nltk.download()를 입력해서 필요한 것을 받을 수 있습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import nltk
from collections import Counter
 
nounlist=[]
 
with open('dark_lemmatized.txt','r',encoding ='utf-8') as f:
    lines = f.readlines()
    f.close()
    
for line in lines:
    tokens = nltk.word_tokenize(line)     #토큰으로 만들기 
    tags = nltk.pos_tag(tokens)           #토큰 별 품사를 tagging
    
    for word, tag in tags:
        if tag in ['NN','NNS','NNP','NNPS']:     #명사를 뽑아서 nounlist에 추가
            nounlist.append(word.lower())
 
counts = Counter(nounlist)
print(counts.most_common(10))

[('movie', 340), ('batman', 249), ('film', 247), ('joker', 177), ('dark', 113), ('ledger', 100),('knight', 95), ('heath', 88), ('time', 83), ('action', 67)]

결과를 출력하면 다음과 같이 리뷰에서 가장 많이 등장한 10개의 단어가 빈도 순으로 출력된 것을 볼 수 있습니다.

동사와 형용사도 동일한 방법으로 추출하시면 됩니다.

전체 리뷰의 토큰 개수를 구할 수도 있습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import nltk
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
 
with open('dark_knight_review.txt','r',encoding ='utf-8') as f:
    lines = f.readlines()
    f.close()
 
token_numbers=[]
 
for line in lines:
    line = nltk.word_tokenize(line.lower())
    for word in line:
        if word not in stopWords:
            token_numbers.append(word)
 
corpus = nltk.Text(token_numbers)
 
print(len(corpus.tokens))      #토큰의 수
print(len(set(corpus.tokens)))        #중복 아닌 토큰 수

20575 6069

그래프를 이용해서 토큰의 등장 횟수를 시각화 할 수도 있습니다

1
corpus.plot(50)

‘이나 ‘s와 같은 보통의 상황에서는 불필요한 단어들이 많이 등장하는데 stopwords에서 추가적으로 옵션을 두어 필터링이 되도록 하는 방법도 있습니다.

저작자표시 (새창열림)