Political Twitter Part 2
Data Cleaning
This post will take the data that was created in Part 1 and perform normal text data preprocessing.
Click Here for full code notebook
Begin by removing items from the text that are not needed because they will add no value to the analysis
- URLs
- The # from hashtags
- The @ from user mentions
- Emojis
- Punctuation
Next we perform some more common NLP Preprocessing tasks:
- Tokenization
- Removal of Stopwords
- Lemmatization
import pandas as pd
import numpy as np
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"
import re
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
import nltk
from nltk import FreqDist
import string
import warnings
warnings.filterwarnings('ignore')
Download the Data from AWS S3.
To keep the size of the code repository on GitLab small, I have stored all of the data in an S3 Object Store.
tweet_df.pkl is a serialized Pandas Dataframe
Send the updated Dataframe to Amazon S3.
with open('outdata/tweet_df.pkl', 'wb') as data:
s3.Bucket(bucket_name).download_fileobj('tweet_df.pkl', data)
tweet_df = pd.read_pickle('outdata/tweet_df.pkl')
os.remove('outdata/tweet_df.pkl')
tweet_df.sample(10)
tweet | class | |
---|---|---|
1056834 | Homeland Security Committee hearing on #TSA &a... | L |
613094 | The last major US nail manufacturer--and their... | L |
405118 | In case you missed it, @SpeakerRyan, @RepKevin... | C |
1224362 | VP Biden is on his last stop of the day in Ohi... | L |
643432 | I fundamentally refuse to let Americans pay mo... | L |
842083 | RT @JasonKander: When I was Secretary of State... | L |
1213864 | RT @cppj: Stay up-to-date on state road and hi... | C |
500238 | RT @NJTVNews: .@SenatorMenendez calls for $2.5... | L |
599936 | RT @DonDaileyAPT: Tonight @ 8 on @CapitolJourn... | C |
284711 | RT @HouseJudiciary: Statement from @GOPLeader,... | C |
Import Stopwords from NLTK and define text cleaning functions.
Stopwords are words that will typically show up the most in a text but add very little substance to an analysis. Examples are: “The”, “an”, “a” etc…
Natural Language Tool Kit (NLTK) is a popular Python library that provides some conventional tooling for Natural Language Processing (NLP), including a library of stopwords.
Adding to the list of stopwords is done on a project by project basis depending on the subject. In our case the corpus came from Twitter so we know a good portion of it will start with “RT” which stands for “retweet”. It adds nothing to the analysis so we will add it to the list of stopwords to be removed.
# import stopwords
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['RT'])
# breaks text up in to a list of individual words
def tokenize(text):
tokens = nltk.word_tokenize(text)
return tokens
# removes stopwords
def remove_stopwords(words):
filtered = filter(lambda word: word not in stopwords, words)
return list(filtered)
# lemmatizes text based on the part of speech tags
def lemmatize(text, nlp=nlp):
doc = nlp(text)
lemmatized = [token.lemma_ for token in doc]
return " ".join(lemmatized)
# applies the lemmatize function to a dataframe
# allows us to use Dask to run function in parallel
def clean_text(df):
df["clean_tweets"] = [lemmatize(x) for x in df['clean_tweets'].tolist()]
print('done')
return df
# Gets rid of emojis and some oddly formatted strings
def remove_emoji(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
Use REGEX and the defined functions to perform preprocessing.
1. Remove URLs
tweet_df['clean_tweets'] =\
tweet_df['tweet'].apply(lambda x: re.sub('http://\S+','', x))
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('https://\S+', '', x))
2. Remove @name mentions and Emojis
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('@\S+', '', x))
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: remove_emoji(x))
3. Remove new line Characters
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('\n', '', x))
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
4. Remove ampersand (&)
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('&', '', x))
tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('&', '', x))
5. Tokenize, Remove Stopwords, join into a string
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: tokenize(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x :\
remove_stopwords(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: " ".join(x))
Use Dask to parallelize the lemmatization of the words.
The goal of lemmatization is to remove the inflection from the words, returning only the base word.
Processing each of the 1.3 million tweets one at a time will take a long time because lemmatizing a sentence is computationally expensive. To speed up this process, we will use the “Dask” package.
Using Dask, we can break the dataframe up into separate partitions and have each of them processed by an independent core of the processor.
We begin by getting the number of cores within the computers processor.
parts = os.cpu_count()
parts
12
We use Dask to break the Pandas Dataframe up in to the same number of partitions as we have cores then map the ‘clean_text’ function to each partition and process.
On my machine a 60 minute operation was reduced to around 15 minutes.
import dask.dataframe as ddf
from dask.diagnostics import ProgressBar
dask_df = ddf.from_pandas(tweet_df, npartitions = parts)
result = dask_df.map_partitions(clean_text, meta = tweet_df)
with ProgressBar():
df = result.compute(scheduler='processes')
[ ] | 0% Completed | 16min 52.9sdone
[### ] | 8% Completed | 17min 1.8sdone
[###### ] | 16% Completed | 17min 10.5sdone
[########## ] | 25% Completed | 17min 18.2sdone
[############# ] | 33% Completed | 17min 24.2sdone
[################ ] | 41% Completed | 17min 27.4sdone
[#################### ] | 50% Completed | 17min 32.4sdone
[####################### ] | 58% Completed | 17min 36.4sdone
[####################### ] | 58% Completed | 17min 38.2sdone
[############################## ] | 75% Completed | 17min 41.0sdone
[############################## ] | 75% Completed | 17min 42.3sdone
[############################## ] | 75% Completed | 17min 42.8sdone
[########################################] | 100% Completed | 17min 44.1s
The result is a new dataframe that contains all of the original data plus a new column that contains the lemmatized text.
Lemmatizing the text will make it easier to get correct word counts and such.
df.sample(20)
tweet | class | clean_tweets | |
---|---|---|---|
1099733 | Great to chat with some of my #TX22 bosses, th... | C | great chat tx22 boss robison family town sprin... |
31218 | Staff participated in National Service Day pro... | L | staff participate national service day program... |
368405 | When I was at #ParamountHighSchool’s Senior Aw... | L | when -PRON- paramounthighschool senior awards ... |
304583 | Wisconsin has lagged in business start-up acti... | L | wisconsin lag business startup activity -PRON-... |
484127 | Enjoyed learning about the @wvuLibraries archi... | C | enjoy learn archive process even get chance ch... |
954665 | RT @neilwymt: The #SOARSummit is wrapping up, ... | C | the soarsummit wrapping coverage continue spec... |
1316896 | RT @CCSTorg: This morning @RepGaramendi met wi... | L | this morning meet alumnus share pride help cre... |
952198 | RT @DarrellIssa: RT @GOPoversight: Contempt re... | C | contempt resolution vote tally 255 yea 67 nay ... |
1185124 | Interesting and personal story about our SOS n... | C | interesting personal story sos nominee everyon... |
506776 | Adam Jobbers-Miller was a patriot, dedicated t... | L | adam jobbersmiller patriot dedicated community... |
141123 | .@realDonaldTrump needs to realize: No one is ... | L | need realize no one right call question legiti... |
325166 | American workers don’t need NAFTA with a new n... | L | american worker do not need nafta new name the... |
1224738 | I’m calling on @realDonaldTrump to sign this i... | C | -PRON- be call sign important legislation prom... |
1148018 | Trump Org to Congress: the Constitution degrad... | L | trump org congress constitution degrade custom... |
907392 | Want to join Team Moulton? Now accepting appli... | L | want join team moulton now accept application ... |
30959 | 5 years ago I watched Pres. Obama sign the Aff... | L | 5 year ago -PRON- watch pre obama sign afforda... |
968455 | Too many students today attend school in crumb... | L | too many student today attend school crumble b... |
274732 | RT @mike_pence: Thanks to today's vote in Cong... | C | thank todays vote congress one step close repe... |
1221963 | For over 230 years, the U.S. Constitution has ... | C | for 230 year us constitution promote value ind... |
1283486 | RT @Jim_Jordan: This isn’t impeachment. This i... | C | this be not impeachment this political campaig... |
with open('outdata/tweets_clean_df.pkl', 'wb') as f:
pickle.dump(df, f)
s3.meta.client.upload_file('outdata/tweets_clean_df.pkl',
bucket_name,
'tweets_clean_df.pkl')
os.remove('outdata/tweets_clean_df.pkl')
Next Steps
In the next post we will perform Exploratory Data Analysis on the cleaned text.