텍스트 데이터 분석 기초 복습 1 (Preprocessing Text Data)

프로그래밍/웹크롤링 & 텍스트 데이터 분석

by 못난명서 2023. 1. 24. 23:12

안녕하세요?

오늘은 구름 인공지능 교육에서 배운 nltk를 활용한 텍스트 데이터 분석 활동을 복습해 보려고 합니다.

(저희가 활용해볼 nltk library는 Natural Language Toolkit으로 주로 영문 텍스트 데이터 분석에 이용되는 파이썬 라이브러리입니다.)

텍스트 데이터 분석의 전체적인 과정은 아래와 같습니다.

오늘은 Preprocessing Text Data 를 중점적으로 복습해보도록 하겠습니다.

The process of data analysis for text data

텍스트 데이터를 str 자료형으로 준비
Preprocessing Text Data
- Tokenizing (토큰화)
- POS tagging (품사 판별)
- Stopwords 제거 (불용어 제거)
- Lemmatize (단어 어근 찾기)
Text Data Exploration
- 품사별 토큰 추출
- 토큰별 등장횟수 시각화
- 특정 단어와 유사한 단어 찾기
- 연달아 등장하는 단어짝 찾기
Text Similarity Analysis
- TF-IDF
- Cosine Similarity
- 영화 리뷰 간 유사도 계산

우선 텍스트 데이터 분석에 활용해볼 nltk 라이브러리를 불러오도록 하겠습니다.

# NLTK는 Anaconda 설치 시 이미 설치되어 있으므로 별도 설치가 불필요합니다.
# !pip install nltk==3.6.1

import nltk

# 아래 명령어를 통해 (혹은 download 대화상자를 열어) 사용할 nltk 패키지를 다운로드 받아야 합니다.
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxnet_treebank_pos_tagger')

다음은 전처리 하고자 하는 문장을 string 변수에 저장하고 그 문장을 토큰화해 출력해보도록 하겠습니다.

# 전처리하고자 하는 문장을 String 변수로 저장
sentence = 'NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.'

# 각 문장을 토큰화한 결과를 출력
nltk.word_tokenize(sentence)  # 문장을 '단어 수준에서' 토큰화해 출력 -> list

실행 결과를 확인해보면 각 문장들이 단어 기준으로 쪼개져 리스트에 담겨져 있는 것을 확인할 수 있습니다.

다음은 이 영어 문장을 품사 태깅(POS tagging)해보록 하겠습니다.

tokens = nltk.word_tokenize(sentence)  # 문장을 토큰화

nltk.pos_tag(tokens)  # 토큰화한 문장을 대상으로 품사를 태깅("POS" Tagging)하여 출력 -> 튜플 리스트
# 앞글자보고 판단 : Noun (명사) -> N, Verb (동사) -> V, Adjective (형용사) -> J or A

지금까지 word_tokenize로 문장을 토큰화하고 pos_tag로 토큰화한 단어들의 품사를 찾아보는 것까지 했습니다.

(word_tokenize는 그냥 리스트, pos_tag는 튜플 리스트로 나오는 것을 확인해주시기 바랍니다.)

그런데 실행결과를 확인해보시면, 문장을 토큰화할 때 불필요한 단어들인 쉼표나 마침표, 그리고 다른 여러 의미없는 단어들까지 토큰화가 되고 품사태깅까지 되는 것을 확인할 수 있습니다.

그런데 이 의미없는 단어들, 즉 텍스트 데이터를 분석하는데 의미가 없는 이 단어들을 불용어(stopwords)라고 부릅니다.

이 불용어들을 텍스트 데이터 분석활동하는데 굳이 같이 데리고 다닐 필요는 없기때문에 이 불용어를 제거해주는 작업을 한번 거치도록 하겠습니다.

# nltk 모듈에서 Stopwords를 직접 불러와 줍니다.
from nltk.corpus import stopwords

stopWords = stopwords.words('english') # 영어 stopwords를 불러와 변수에 저장 (stopwords에 속하는 "단어" 리스트) / stopwords 지원 언어 확인 : stopwords.fileids()
# print(len(stopWords) / print(stopWrods)

# stopwords에 쉼표와 마침표 추가
stop_words.append(',')
stop_words.append('.')

result = []  # stopwords가 제거된 결과를 담기 위한 리스트를 생성

# 반복문을 이용해 stopwords를 제거
for token in tokens: 
    if token.lower() not in stopWords: 
        result.append(token) 
print(result)

다음으로는 WordNetLemmatizaion에 대해서 알아보도록 하겠습니다.

Lemmatization이란 단어의 형태소적 & 사전적 분석을 통해 파생적 의미를 제거하고, 어근에 기반하여 기본 사전형인 lemma를 찾는 것을 의미합니다.

예를들어 'cats'나 'geese'같은 단어들은 'cat'과 'goose'로, 'ran'과 'better'은 각각 'run'과 'good' 으로 그 단어의 기본형을 찾아주는 것입니다.

예시코드를 한번 보겠습니다.

# WordNetLemmatizer 객체 생성
lemmatizer = nltk.wordnet.WordNetLemmatizer()

print(lemmatizer.lemmatize("cats")) 
print(lemmatizer.lemmatize("geese"))

print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("better", pos="a"))

print(lemmatizer.lemmatize("ran"))
print(lemmatizer.lemmatize("ran", 'v'))

# WordNetLemmatize는 더 정확한 분석을 위해 PoS 정보를 추가로 입력받을 수 있음 (n : 명사 v : 동사 a : 형용사 r : 부사)
# default == n(명사) 이므로 'cats', 'geese' 들은 기본명사형인 'cat','geese'로 결과가 출력됨
# 'ran'은 동사를 나타내는 PoS 정보인 'v'를 함께 입력해주어야 제대로 결과를 확인할 수 있음
# 'better'도 마찬가지로, '형용사(a)'라는 정보를 함께 입력해주어야 원형인 'good'을 제대로 출력해줌

이제 어떤 str형 자료가 주어졌을 때 이 데이터를 쪼개어 깔끔하게 준비하는 과정들을 배웠으니 간단히 종합 복습을 해보겠습니다.

2가지 코드를 올릴 것인데 첫번째는 그냥 stopwords를 활용해서 불용어를 제거하는 것이고, 두번재는 품사를 활용해서 필요한 품사형태의 단어만을 골라와 불용어들을 제거하는 방법입니다.

둘 다 출력 결과는 동일할테니 한번 확인해보시기 바랍니다.

(필요한 텍스트파일은 첨부해놓도록 하겠습니다.)

moviereview.txt

0.04MB

# Stopwords
stop_words = stopwords.words("english")
stop_words.append(',')
stop_words.append('.')

file = open('moviereview.txt', 'r', encoding='utf-8') # 읽기 형식('r')로 지정하고 인코딩은 'utf-8'로 설정한다
lines = file.readlines()  # readlines 함수로 텍스트 파일의 내용을 읽어 리스트로 저장한다

sentence = lines[1] 
tokens = nltk.word_tokenize(sentence)  

# for문을 통해 stopwords 제거와 lemmatization을 수행한다
lemmas = []  # lemmatize한 결과를 담기 위한 리스트를 생성한다

for token in tokens:  
    if token.lower() not in stop_words:  # 소문자로 변환한 token이 stopwords에 없으면:
        lemmas.append(lemmatizer.lemmatize(token))  # lemmatize한 결과를 리스트에 첨부한다

print('Lemmas of : ' + sentence)  # lemmatize한 결과를 출력한다
print(lemmas)

# pos 이용해서 해보기

# Stopwords
stop_words = stopwords.words("english")
stop_words.append(',')
stop_words.append('.')

file = open('moviereview.txt', 'r', encoding='utf-8') # 읽기 형식('r')로 지정하고 인코딩은 'utf-8'로 설정한다
lines = file.readlines()  # readlines 함수로 텍스트 파일의 내용을 읽어 리스트로 저장한다

sentence = lines[1] 
tokens = nltk.word_tokenize(sentence)  

tagged_tokens = nltk.pos_tag(tokens)

# for문을 통해 stopwords 제거와 lemmatization을 수행한다
lemmas = []  # lemmatize한 결과를 담기 위한 리스트를 생성한다

for token, pos in tagged_tokens:  
    if token.lower() not in stop_words:  # 소문자로 변환한 token이 stopwords에 없으면:
        
        if pos.startswith('N') :
            lemmas.append(lemmatizer.lemmatize(token, pos='n')) 
        elif pos.startswith('J') :
            lemmas.append(lemmatizer.lemmatize(token, pos='a'))
        elif pos.startswith('V') :
            lemmas.append(lemmatizer.lemmatize(token, pos='v'))
        else :
            lemmas.append(lemmatizer.lemmatize(token))
            
print('Lemmas of : ' + sentence)  # lemmatize한 결과를 출력한다
print(lemmas)

감사합니다.

다음시간에는 Text Data Exploration 단계를 복습해보도록 하겠습니다.

저작자표시 비영리 변경금지

'프로그래밍 > 웹크롤링 & 텍스트 데이터 분석' 카테고리의 다른 글

텍스트 데이터 분석 기초 복습 4 (Text Analysis) (0)	2023.01.27
텍스트 데이터 분석 기초 복습 3 (Text Data Exploration) (1)	2023.01.26
텍스트 데이터 분석 기초 복습 2 (TF-IDF, Cosine Similarity, 정규표현식) (0)	2023.01.25
파이썬 웹크롤링 기초 복습 (2) (0)	2023.01.12
파이썬 웹크롤링 기초 복습 (1) (0)	2023.01.12

못난명서

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

The process of data analysis for text data

'프로그래밍 > 웹크롤링 & 텍스트 데이터 분석' 카테고리의 다른 글

관련글 더보기

추가 정보

인기글

최신글

티스토리툴바