Political Twitter Part 2

6 minute read

Data Cleaning


This post will take the data that was created in Part 1 and perform normal text data preprocessing.

Click Here for full code notebook

Begin by removing items from the text that are not needed because they will add no value to the analysis

  • URLs
  • The # from hashtags
  • The @ from user mentions
  • Emojis
  • Punctuation

Next we perform some more common NLP Preprocessing tasks:

  • Tokenization
  • Removal of Stopwords
  • Lemmatization

import pandas as pd

import numpy as np
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"

import re
import spacy

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])


import nltk
from nltk import FreqDist
import string

import warnings
warnings.filterwarnings('ignore')

Download the Data from AWS S3.

To keep the size of the code repository on GitLab small, I have stored all of the data in an S3 Object Store.

tweet_df.pkl is a serialized Pandas Dataframe

Send the updated Dataframe to Amazon S3.

with open('outdata/tweet_df.pkl', 'wb') as data:
    s3.Bucket(bucket_name).download_fileobj('tweet_df.pkl', data)

tweet_df = pd.read_pickle('outdata/tweet_df.pkl')

os.remove('outdata/tweet_df.pkl')
tweet_df.sample(10)
tweet class
1056834 Homeland Security Committee hearing on #TSA &a... L
613094 The last major US nail manufacturer--and their... L
405118 In case you missed it, @SpeakerRyan, @RepKevin... C
1224362 VP Biden is on his last stop of the day in Ohi... L
643432 I fundamentally refuse to let Americans pay mo... L
842083 RT @JasonKander: When I was Secretary of State... L
1213864 RT @cppj: Stay up-to-date on state road and hi... C
500238 RT @NJTVNews: .@SenatorMenendez calls for $2.5... L
599936 RT @DonDaileyAPT: Tonight @ 8 on @CapitolJourn... C
284711 RT @HouseJudiciary: Statement from @GOPLeader,... C

Import Stopwords from NLTK and define text cleaning functions.

Stopwords are words that will typically show up the most in a text but add very little substance to an analysis. Examples are: “The”, “an”, “a” etc…

Natural Language Tool Kit (NLTK) is a popular Python library that provides some conventional tooling for Natural Language Processing (NLP), including a library of stopwords.

Adding to the list of stopwords is done on a project by project basis depending on the subject. In our case the corpus came from Twitter so we know a good portion of it will start with “RT” which stands for “retweet”. It adds nothing to the analysis so we will add it to the list of stopwords to be removed.

# import stopwords
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['RT'])

# breaks text up in to a list of individual words
def tokenize(text):

    tokens = nltk.word_tokenize(text)

    return tokens

# removes stopwords
def remove_stopwords(words):


    filtered = filter(lambda word: word not in stopwords, words)

    return list(filtered)

#  lemmatizes text based on the part of speech tags
def lemmatize(text, nlp=nlp):

    doc = nlp(text)

    lemmatized = [token.lemma_ for token in doc]

    return " ".join(lemmatized)

# applies the lemmatize function to a dataframe
# allows us to use Dask to run function in parallel
def clean_text(df):

    df["clean_tweets"] = [lemmatize(x) for x in df['clean_tweets'].tolist()]
    print('done')
    return df

# Gets rid of emojis and some oddly formatted strings
def remove_emoji(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

Use REGEX and the defined functions to perform preprocessing.

1. Remove URLs

tweet_df['clean_tweets'] =\
tweet_df['tweet'].apply(lambda x: re.sub('http://\S+','', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('https://\S+', '', x))

2. Remove @name mentions and Emojis

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('@\S+', '', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: remove_emoji(x))

3. Remove new line Characters

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('\n', '', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

4. Remove ampersand (&)

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('&', '', x))

tweet_df['clean_tweets'] =\
tweet_df['clean_tweets'].apply(lambda x: re.sub('&amp', '', x))

5. Tokenize, Remove Stopwords, join into a string

tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: tokenize(x))

tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x :\
                                                          remove_stopwords(x))

tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: " ".join(x))

Use Dask to parallelize the lemmatization of the words.

The goal of lemmatization is to remove the inflection from the words, returning only the base word.

Processing each of the 1.3 million tweets one at a time will take a long time because lemmatizing a sentence is computationally expensive. To speed up this process, we will use the “Dask” package.

Using Dask, we can break the dataframe up into separate partitions and have each of them processed by an independent core of the processor.

We begin by getting the number of cores within the computers processor.


parts = os.cpu_count()
parts
12

We use Dask to break the Pandas Dataframe up in to the same number of partitions as we have cores then map the ‘clean_text’ function to each partition and process.

On my machine a 60 minute operation was reduced to around 15 minutes.

import dask.dataframe as ddf
from dask.diagnostics import ProgressBar

dask_df = ddf.from_pandas(tweet_df, npartitions = parts)
result = dask_df.map_partitions(clean_text, meta = tweet_df)
with ProgressBar():
    df = result.compute(scheduler='processes')
[                                        ] | 0% Completed | 16min 52.9sdone
[###                                     ] | 8% Completed | 17min  1.8sdone
[######                                  ] | 16% Completed | 17min 10.5sdone
[##########                              ] | 25% Completed | 17min 18.2sdone
[#############                           ] | 33% Completed | 17min 24.2sdone
[################                        ] | 41% Completed | 17min 27.4sdone
[####################                    ] | 50% Completed | 17min 32.4sdone
[#######################                 ] | 58% Completed | 17min 36.4sdone
[#######################                 ] | 58% Completed | 17min 38.2sdone
[##############################          ] | 75% Completed | 17min 41.0sdone
[##############################          ] | 75% Completed | 17min 42.3sdone
[##############################          ] | 75% Completed | 17min 42.8sdone
[########################################] | 100% Completed | 17min 44.1s

The result is a new dataframe that contains all of the original data plus a new column that contains the lemmatized text.

Lemmatizing the text will make it easier to get correct word counts and such.

df.sample(20)

tweet class clean_tweets
1099733 Great to chat with some of my #TX22 bosses, th... C great chat tx22 boss robison family town sprin...
31218 Staff participated in National Service Day pro... L staff participate national service day program...
368405 When I was at #ParamountHighSchool’s Senior Aw... L when -PRON- paramounthighschool senior awards ...
304583 Wisconsin has lagged in business start-up acti... L wisconsin lag business startup activity -PRON-...
484127 Enjoyed learning about the @wvuLibraries archi... C enjoy learn archive process even get chance ch...
954665 RT @neilwymt: The #SOARSummit is wrapping up, ... C the soarsummit wrapping coverage continue spec...
1316896 RT @CCSTorg: This morning @RepGaramendi met wi... L this morning meet alumnus share pride help cre...
952198 RT @DarrellIssa: RT @GOPoversight: Contempt re... C contempt resolution vote tally 255 yea 67 nay ...
1185124 Interesting and personal story about our SOS n... C interesting personal story sos nominee everyon...
506776 Adam Jobbers-Miller was a patriot, dedicated t... L adam jobbersmiller patriot dedicated community...
141123 .@realDonaldTrump needs to realize: No one is ... L need realize no one right call question legiti...
325166 American workers don’t need NAFTA with a new n... L american worker do not need nafta new name the...
1224738 I’m calling on @realDonaldTrump to sign this i... C -PRON- be call sign important legislation prom...
1148018 Trump Org to Congress: the Constitution degrad... L trump org congress constitution degrade custom...
907392 Want to join Team Moulton? Now accepting appli... L want join team moulton now accept application ...
30959 5 years ago I watched Pres. Obama sign the Aff... L 5 year ago -PRON- watch pre obama sign afforda...
968455 Too many students today attend school in crumb... L too many student today attend school crumble b...
274732 RT @mike_pence: Thanks to today's vote in Cong... C thank todays vote congress one step close repe...
1221963 For over 230 years, the U.S. Constitution has ... C for 230 year us constitution promote value ind...
1283486 RT @Jim_Jordan: This isn’t impeachment. This i... C this be not impeachment this political campaig...

with open('outdata/tweets_clean_df.pkl', 'wb') as f:
    pickle.dump(df, f)

s3.meta.client.upload_file('outdata/tweets_clean_df.pkl',
                           bucket_name,
                           'tweets_clean_df.pkl')

os.remove('outdata/tweets_clean_df.pkl')

Next Steps

In the next post we will perform Exploratory Data Analysis on the cleaned text.