Political Twitter Part 2

6 minute read

Data Cleaning

This post will take the data that was created in Part 1 and perform normal text data preprocessing.

Click Here for full code notebook

Begin by removing items from the text that are not needed because they will add no value to the analysis

URLs
The # from hashtags
The @ from user mentions
Emojis
Punctuation

Next we perform some more common NLP Preprocessing tasks:

Tokenization
Removal of Stopwords
Lemmatization

import pandas as pd

import numpy as np
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"

import re
import spacy

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])


import nltk
from nltk import FreqDist
import string

import warnings
warnings.filterwarnings('ignore')

Download the Data from AWS S3.

To keep the size of the code repository on GitLab small, I have stored all of the data in an S3 Object Store.

tweet_df.pkl is a serialized Pandas Dataframe

Send the updated Dataframe to Amazon S3.

with open('outdata/tweet_df.pkl', 'wb') as data:
    s3.Bucket(bucket_name).download_fileobj('tweet_df.pkl', data)

tweet_df = pd.read_pickle('outdata/tweet_df.pkl')

os.remove('outdata/tweet_df.pkl')

tweet_df.sample(10)

	tweet	class
1056834	Homeland Security Committee hearing on #TSA &a...	L
613094	The last major US nail manufacturer--and their...	L
405118	In case you missed it, @SpeakerRyan, @RepKevin...	C
1224362	VP Biden is on his last stop of the day in Ohi...	L
643432	I fundamentally refuse to let Americans pay mo...	L
842083	RT @JasonKander: When I was Secretary of State...	L
1213864	RT @cppj: Stay up-to-date on state road and hi...	C
500238	RT @NJTVNews: .@SenatorMenendez calls for $2.5...	L
599936	RT @DonDaileyAPT: Tonight @ 8 on @CapitolJourn...	C
284711	RT @HouseJudiciary: Statement from @GOPLeader,...	C

Import Stopwords from NLTK and define text cleaning functions.

Stopwords are words that will typically show up the most in a text but add very little substance to an analysis. Examples are: “The”, “an”, “a” etc…

Natural Language Tool Kit (NLTK) is a popular Python library that provides some conventional tooling for Natural Language Processing (NLP), including a library of stopwords.

Adding to the list of stopwords is done on a project by project basis depending on the subject. In our case the corpus came from Twitter so we know a good portion of it will start with “RT” which stands for “retweet”. It adds nothing to the analysis so we will add it to the list of stopwords to be removed.

# import stopwords
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['RT'])

# breaks text up in to a list of individual words
def tokenize(text):

    tokens = nltk.word_tokenize(text)

    return tokens

# removes stopwords
def remove_stopwords(words):


    filtered = filter(lambda word: word not in stopwords, words)

    return list(filtered)

#  lemmatizes text based on the part of speech tags
def lemmatize(text, nlp=nlp):

    doc = nlp(text)

    lemmatized = [token.lemma_ for token in doc]

    return " ".join(lemmatized)

# applies the lemmatize function to a dataframe
# allows us to use Dask to run function in parallel
def clean_text(df):

    df["clean_tweets"] = [lemmatize(x) for x in df['clean_tweets'].tolist()]
    print('done')
    return df

# Gets rid of emojis and some oddly formatted strings
def remove_emoji(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

Use REGEX and the defined functions to perform preprocessing.

1. Remove URLs

tweet_df['clean_tweets'] =\
tweet_df['tweet'].apply(lambda x: re.sub('http://\S+','', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('https://\S+', '', x))

2. Remove @name mentions and Emojis

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('@\S+', '', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: remove_emoji(x))

3. Remove new line Characters

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('\n', '', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

4. Remove ampersand (&)

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('&amp;', '', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('&amp', '', x))

5. Tokenize, Remove Stopwords, join into a string

tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: tokenize(x))

tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x :\
                                                          remove_stopwords(x))

tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: " ".join(x))

Use Dask to parallelize the lemmatization of the words.

The goal of lemmatization is to remove the inflection from the words, returning only the base word.

Processing each of the 1.3 million tweets one at a time will take a long time because lemmatizing a sentence is computationally expensive. To speed up this process, we will use the “Dask” package.

Using Dask, we can break the dataframe up into separate partitions and have each of them processed by an independent core of the processor.

We begin by getting the number of cores within the computers processor.

parts = os.cpu_count()
parts

We use Dask to break the Pandas Dataframe up in to the same number of partitions as we have cores then map the ‘clean_text’ function to each partition and process.

On my machine a 60 minute operation was reduced to around 15 minutes.

import dask.dataframe as ddf
from dask.diagnostics import ProgressBar

dask_df = ddf.from_pandas(tweet_df, npartitions = parts)
result = dask_df.map_partitions(clean_text, meta = tweet_df)
with ProgressBar():
    df = result.compute(scheduler='processes')

[                                        ] | 0% Completed | 16min 52.9sdone
[###                                     ] | 8% Completed | 17min  1.8sdone
[######                                  ] | 16% Completed | 17min 10.5sdone
[##########                              ] | 25% Completed | 17min 18.2sdone
[#############                           ] | 33% Completed | 17min 24.2sdone
[################                        ] | 41% Completed | 17min 27.4sdone
[####################                    ] | 50% Completed | 17min 32.4sdone
[#######################                 ] | 58% Completed | 17min 36.4sdone
[#######################                 ] | 58% Completed | 17min 38.2sdone
[##############################          ] | 75% Completed | 17min 41.0sdone
[##############################          ] | 75% Completed | 17min 42.3sdone
[##############################          ] | 75% Completed | 17min 42.8sdone
[########################################] | 100% Completed | 17min 44.1s

The result is a new dataframe that contains all of the original data plus a new column that contains the lemmatized text.

Lemmatizing the text will make it easier to get correct word counts and such.

df.sample(20)

	tweet	class	clean_tweets
1099733	Great to chat with some of my #TX22 bosses, th...	C	great chat tx22 boss robison family town sprin...
31218	Staff participated in National Service Day pro...	L	staff participate national service day program...
368405	When I was at #ParamountHighSchool’s Senior Aw...	L	when -PRON- paramounthighschool senior awards ...
304583	Wisconsin has lagged in business start-up acti...	L	wisconsin lag business startup activity -PRON-...
484127	Enjoyed learning about the @wvuLibraries archi...	C	enjoy learn archive process even get chance ch...
954665	RT @neilwymt: The #SOARSummit is wrapping up, ...	C	the soarsummit wrapping coverage continue spec...
1316896	RT @CCSTorg: This morning @RepGaramendi met wi...	L	this morning meet alumnus share pride help cre...
952198	RT @DarrellIssa: RT @GOPoversight: Contempt re...	C	contempt resolution vote tally 255 yea 67 nay ...
1185124	Interesting and personal story about our SOS n...	C	interesting personal story sos nominee everyon...
506776	Adam Jobbers-Miller was a patriot, dedicated t...	L	adam jobbersmiller patriot dedicated community...
141123	.@realDonaldTrump needs to realize: No one is ...	L	need realize no one right call question legiti...
325166	American workers don’t need NAFTA with a new n...	L	american worker do not need nafta new name the...
1224738	I’m calling on @realDonaldTrump to sign this i...	C	-PRON- be call sign important legislation prom...
1148018	Trump Org to Congress: the Constitution degrad...	L	trump org congress constitution degrade custom...
907392	Want to join Team Moulton? Now accepting appli...	L	want join team moulton now accept application ...
30959	5 years ago I watched Pres. Obama sign the Aff...	L	5 year ago -PRON- watch pre obama sign afforda...
968455	Too many students today attend school in crumb...	L	too many student today attend school crumble b...
274732	RT @mike_pence: Thanks to today's vote in Cong...	C	thank todays vote congress one step close repe...
1221963	For over 230 years, the U.S. Constitution has ...	C	for 230 year us constitution promote value ind...
1283486	RT @Jim_Jordan: This isn’t impeachment. This i...	C	this be not impeachment this political campaig...

with open('outdata/tweets_clean_df.pkl', 'wb') as f:
    pickle.dump(df, f)

s3.meta.client.upload_file('outdata/tweets_clean_df.pkl',
                           bucket_name,
                           'tweets_clean_df.pkl')

os.remove('outdata/tweets_clean_df.pkl')

Next Steps

In the next post we will perform Exploratory Data Analysis on the cleaned text.

Share on

Twitter Facebook LinkedIn

Sean Carey