Python을 활용한 텍스트 마이닝 7.텍스트 분석-영문 텍스트 마이닝2

HEAD1TON 2017. 9. 11. 17:57

2017. 9. 11. 17:57

첫번째 글에 이어서 이번에는 단어들간의 관계를 살펴보도록 하겠습니다.

similarity를 구해 리뷰의 단어 중 문맥상 서로 유사한 단어를 살펴보겠습니다.

1
2
3
corpus.similar('Batman')
print('-'*60)
corpus.similar('Joker')

film movies superhero seen action movie actors character heath performance modern villain second goodbetter aniconic amore batpod iconic watched ------------------------------------------------------------jack time sequel dramas

리뷰 텍스트 문맥상 ‘Batman’,’Joker’와 각각 유사하다고 판단된 단어들입니다.

텍스트의 연어(Collocation)을 출력해보면 아래와 같습니다.

1
corpus.collocations()

dark knight; heath ledger; christian bale; harvey dent; comic book; christopher nolan; brucewayne; aaron eckhart; gary oldman; two face; maggie gyllenhaal; morgan freeman; rachel dawes;special effects; batman begins; gotham city; 've seen; district attorney; michael caine; superhero

빈번하게 같이 사용되는 단어들이 출력된 것을 확인할 수 있습니다.

마지막으로 연관 단어 그래프를 만들어 보도록 하겠습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import nltk
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
 
 
uniqueNouns = set()
sentences= []
 
with open('dark_knight_review.txt','r',encoding ='utf-8') as f:
    lines = f.readlines()
    f.close()
 
for line in lines:
    tokens = nltk.word_tokenize(line)
    tags = nltk.pos_tag(tokens)
    sentences.append(tags)
    
    for word, tag in tags:
        if tag in ['NN','NNS','NNP','NNPS']:
            uniqueNouns.add(word)

리뷰의 텍스트를 바탕으로 같은 문장에서 등장하는 명사들끼리 연결된 그래프를 만드려고 합니다. 먼저 unique 명사들을 추출해서 uniqueNouns 리스트에 추가하는 과정입니다.

1
2
3
uniqueNouns = list(uniqueNouns)              #set -> list 변환
nounIndex={noun: i for i,noun in enumerate(uniqueNouns)}      #각각의 명사에 index지정
matrix = np.zeros([len(sentences),len(uniqueNouns)])          #행(문장개수) * 열(unique명사 개수) 생성 크기는 0

위와 같이 행렬을 생성한 다음

1
2
3
4
5
6
7
8
9
for i, sentences in enumerate(sentences):
    for word, tag in sentences:
        if tag in ['NN','NNS','NNP','NNPS']:
            index = nounIndex[word]
            matrix[i][index]=1
            
cooccurMat=matrix.T.dot(matrix)
 
graph = nx.Graph()

이중 for문을 이용해 각 원소에 명사가 들어온 경우 1의 값을 반환하게 합니다.

그리고 이 행렬의 전치와 곱하여 co-occurrence matrix(동시 발생 행렬)를 생성합니다

1
2
3
4
for i in range(len(uniqueNouns)):
    for j in range(i+1,len(uniqueNouns)):
        if cooccurMat[i][j]>30:
            graph.add_edge(uniqueNouns[i],uniqueNouns[j]

행렬에서 값이 30이상은 결과 값을 이용해 노드를 만들었습니다.

1
2
3
4
plt.figure(figsize=(15,15))
layout = nx.random_layout(graph)
nx.draw(graph,pos=layout,with_labels=True,font_size=20,alpha=0.3,node_size=3000)
plt.show()

그래프를 출력하면 위와 같이 명사들 간의 연관 관계를 표현한 그래프가 그려집니다.

저작자표시 (새창열림)

'Program > Python' 카테고리의 다른 글

Python을 활용한 텍스트 마이닝 9.텍스트 분석-감성 분석(Sentiment Analysis) 2편 (0)	2017.09.11
Python을 활용한 텍스트 마이닝 8.텍스트 분석-감성 분석(Sentiment Analysis) (0)	2017.09.11
Python을 활용한 텍스트 마이닝 6.텍스트 분석-영문 텍스트 마이닝1 (0)	2017.09.11
Python을 활용한 텍스트 마이닝 5.텍스트 분석-데이터 분석 (0)	2017.09.11
Python을 활용한 텍스트 마이닝 4.텍스트 분석-데이터 시각화 (0)	2017.09.11

(─━┘_└━─)/

Python을 활용한 텍스트 마이닝 7.텍스트 분석-영문 텍스트 마이닝2

'Program > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바