Political Twitter Part 1

7 minute read

Social Media, as a platform in American politics, has been growing in popularity since the 2008 election cycle. Twitter has become the platform of choice for most members of both the Legislative and the Executive Branches. This four-part blog series will demonstrate the workflow of a Data Science project focused on analyzing and modeling twitter content generated by American Politicians.

Part I: Collection of the Twitter Data
Part II: Cleaning and Preprocessing
Part III: Exploratory Analysis
Part IV: Text Classification Model

Data Collection

At the end of this post we will have a labeled dataset with unprocessed Tweets from American Politicians.

Click Here for the full code notebook.

import pandas as pd
import numpy as np
import tweepy
import tabula
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"

c_key = os.environ.get('tw_c_key')
c_sec = os.environ.get('tw_c_sec')
atk = os.environ.get('tw_ac_tok')
ats = os.environ.get('tw_ac_sec')

auth = tweepy.OAuthHandler(c_key, c_sec)
auth.set_access_token(atk, ats)

api = tweepy.API(auth, wait_on_rate_limit_notify=True,
                wait_on_rate_limit=True)

pd.set_option('display.max_rows', 200)

# read in data
congress_url = "https://theunitedstates.io/congress-legislators/legislators-current.csv"
congress_df = pd.read_csv(congress_url,
                          usecols=['last_name',
                                   'first_name',
                                   'full_name',
                                   'party',
                                   'type',
                                   'state' ,
                                   'twitter'])

Check for missing Twitter Handles

# filter for records with missing twitter handle
 congress_df[congress_df['twitter'].isna()]

	last_name	first_name	full_name	type	state	party	twitter
33	Amash	Justin	Justin Amash	rep	MI	Independent	NaN
62	Clay	Wm.	Wm. Lacy Clay	rep	MO	Democrat	NaN
182	Peterson	Collin	Collin C. Peterson	rep	MN	Democrat	NaN
314	Kaine	Timothy	Tim Kaine	sen	VA	Democrat	NaN
372	Comer	James	James Comer	rep	KY	Republican	NaN
425	Gianforte	Greg	Greg Gianforte	rep	MT	Republican	NaN
534	Bishop	Dan	Dan Bishop	rep	NC	Republican	NaN
535	Murphy	Gregory	Gregory F. Murphy	rep	NC	Republican	NaN
536	Loeffler	Kelly	Kelly Loeffler	sen	GA	Republican	NaN

A few twitter handles are missing. Since it is a small number I will manually fill these in.

Georgia swore in a new Senator to replace retired Sen. Johnny Isakson. I’ve elected to keep Johnny Isakson’s name in the list in order to capture all of his tweets

# fill in missing twitter handles
congress_df.at[33, 'twitter'] = 'justinamash'

congress_df.at[62, 'twitter'] = 'LacyClayMO1'

congress_df.at[182, 'twitter'] = "collinpeterson"

congress_df.at[314, 'twitter'] = "timkaine"

congress_df.at[372, 'twitter'] = "KYComer"

congress_df.at[425, 'twitter'] = "GregForMontana"

congress_df.at[534, 'twitter'] = "jdanbishop"

congress_df.at[535, 'twitter'] = "DrGregMurphy1"

congress_df.at[536, 'twitter'] = "SenatorLoeffler"

print(f"Missing Twitter Handles: {len(congress_df[congress_df['twitter'].isna()])}")

# group by political party
congress_df.groupby(by='party').count()

	last_name	first_name	full_name	type	state	twitter
party
Democrat	281	281	281	281	281	281
Independent	3	3	3	3	3	3
Republican	253	253	253	253	253	253

Independent members of Congress

Three members of this Congress are Independent. Which means that they do not belong to any one political party. They may not belong to a party but they do have Conservative or Liberal political leanings. We can relabel them with the party with which they most closely align. Doing this allows us to relabel them as either Liberal or Conservative.

# filter by Independents
congress_df[congress_df['party']=='Independent']

	last_name	first_name	full_name	type	state	party	twitter
8	Sanders	Bernard	Bernard Sanders	sen	VT	Independent	SenSanders
33	Amash	Justin	Justin Amash	rep	MI	Independent	justinamash
287	King	Angus	Angus S. King, Jr.	sen	ME	Independent	SenAngusKing

# relabel the Independents
congress_df.at[8, 'party'] = 'Democrat'
congress_df.at[33, 'party'] = 'Republican'
congress_df.at[287, 'party'] = 'Democrat'

# create new column of Liberal or Conservative labels.
congress_df['lean'] = np.where(congress_df['party']=='Democrat','L', 'C')
congress_df

	last_name	first_name	full_name	type	state	party	twitter	lean
0	Brown	Sherrod	Sherrod Brown	sen	OH	Democrat	SenSherrodBrown	L
1	Cantwell	Maria	Maria Cantwell	sen	WA	Democrat	SenatorCantwell	L
2	Cardin	Benjamin	Benjamin L. Cardin	sen	MD	Democrat	SenatorCardin	L
3	Carper	Thomas	Thomas R. Carper	sen	DE	Democrat	SenatorCarper	L
4	Casey	Robert	Robert P. Casey, Jr.	sen	PA	Democrat	SenBobCasey	L
...	...	...	...	...	...	...	...	...
532	Golden	Jared	Jared F. Golden	rep	ME	Democrat	repgolden	L
533	Keller	Fred	Fred Keller	rep	PA	Republican	RepFredKeller	C
534	Bishop	Dan	Dan Bishop	rep	NC	Republican	jdanbishop	C
535	Murphy	Gregory	Gregory F. Murphy	rep	NC	Republican	DrGregMurphy1	C
536	Loeffler	Kelly	Kelly Loeffler	sen	GA	Republican	SenatorLoeffler	C

537 rows × 8 columns

Use the Twitter API to collect Tweets

Twitter provides an excellent API for programmatically accessing Tweets from anybody with a Twitter account. Since we already have the Twitter handle for each member of congress we can use the API to pull their “tweets”.

One drawback of pulling tweets from an individual user is that Twitter limits us to the most recent 3200 tweets per user.

def get_tweets(handle):
	# blank list
    print(f'...Getting Tweets for {handle}')
    tweets = []
    try:
    # get the most recent 200 tweets
        new_tweets = api.user_timeline(screen_name = handle,
                                       count=200)

    # add new tweets one by one to end of tweets list
        tweets.extend(new_tweets)

    # get oldest tweet from list
        old = tweets[-1].id - 1

        while len(new_tweets) > 0:


            # get next 200 tweets
            new_tweets = api.user_timeline(screen_name = handle,
                                           count=200,
                                           max_id=old)

        #add current batch to the tweets list
            tweets.extend(new_tweets)

        # update the old var to match the oldest tweet currently in
            old = tweets[-1].id - 1

        tweet_tab = [tweet.text for tweet in tweets]

    except tweepy.TweepError:
        print(f'error with {handle} in function')
        pass    


    return tweet_tab

liberal_handle_list = congress_df[congress_df['lean']=='L'].twitter
conservative_handle_list = congress_df[congress_df['lean']=='C'].twitter

Get Tweets from Democratic members of Congress

The output of this is a list of 728,175 tweets from Democrat Senators and Representatives

lib_tweets = []
for name in liberal_handle_list:

    try:
        tweets_temp = get_tweets(name)
        lib_tweets.extend(tweets_temp)
        with open('outdata/lib_list.pkl', 'wb') as f:
            pickle.dump(lib_tweets, f)
    except:
        print(f"problem with {name} in loop")

Get Tweets from Republican Members of Congress

The output of this is a list of 589,235 tweets from Republican Senators and Representatives.

conservative_tweets = []
for name in conservative_handle_list:
    try:
        tweets_temp = get_tweets(name)
        conservative_tweets.extend(tweets_temp)
        with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
        print(f'con_tweets len: {len(conservative_tweets)}')

    except:
        print(f"problem with {name} in loop")

Manually fix the problems that arose for a few Conservative Handles

Some of the twitter handles did not work with the API. It turns out that the list had wrong info for some of these members. We can manually find the correct handle and collect the data.

# Rep Rob Marshal's twitter handle needed to be updated     
marsh_tweets = get_tweets('RogerMarshallMD')
conservative_tweets.extend(marsh_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)

...Getting Tweets for RogerMarshallMD

# Rep. Lance Gooden's twitter handle needed to be updated.
gooden_tweets = get_tweets('Lancegooden')
conservative_tweets.extend(gooden_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)

...Getting Tweets for Lancegooden

Add Current Executive Branch

The current Executive branch tweets a lot. They should be included in this analysis.

# Add President Trump to the Conservative Tweet List
trump_tweets = get_tweets('realDonaldTrump')
conservative_tweets.extend(trump_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
print(f'{len(conservative_tweets)}')

...Getting Tweets for realDonaldTrump
594399

# Define a function for adding to the conservative list
def add_tweets_to_con_list(handle):
    temp_tweets = get_tweets(handle)
    conservative_tweets.extend(temp_tweets)
    with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
    print(f'Conservative Tweets: {len(conservative_tweets)}')

# add VP Pence
add_tweets_to_con_list('VP')

...Getting Tweets for VP
597644

Add the previous Executive Branch and Current Democratic Presidential Candidates (that aren’t in congress)

The Previous Executive branch pioneered the use of Social Media for Politics. Also, this is an election year. The democrats have a lot of folks vying for the Democratic Nomination. A few of them are current members of congress so we already have their tweets. However, it makes sense to add those who are not members of congress to this dataset.

def add_tweets_to_lib_list(handle):
    temp_tweets = get_tweets(handle)
    lib_tweets.extend(temp_tweets)
    with open('outdata/lib_list.pkl', 'wb') as f:
            pickle.dump(lib_tweets, f)
    print(f'Liberal Tweets: {len(lib_tweets)}')

add_tweets_to_lib_list('BarackObama')
add_tweets_to_lib_list('JoeBiden')
add_tweets_to_lib_list('MikeBloomberg')
add_tweets_to_lib_list('DevalPatrick')
add_tweets_to_lib_list('TomSteyer')
add_tweets_to_lib_list('PeteButtigieg')
add_tweets_to_lib_list('JohnDelaney')
add_tweets_to_lib_list('AndrewYang')

...Getting Tweets for BarackObama
Liberal Tweets: 731403
...Getting Tweets for JoeBiden
Liberal Tweets: 734630
...Getting Tweets for MikeBloomberg
Liberal Tweets: 737863
...Getting Tweets for DevalPatrick
Liberal Tweets: 739782
...Getting Tweets for TomSteyer
Liberal Tweets: 742994
...Getting Tweets for PeteButtigieg
Liberal Tweets: 746216
...Getting Tweets for JohnDelaney
Liberal Tweets: 749432
...Getting Tweets for AndrewYang
Liberal Tweets: 752662

s3.meta.client.upload_file('outdata/lib_list.pkl', bucket_name, 'lib_list.pkl')
s3.meta.client.upload_file('outdata/con_list.pkl', bucket_name, 'con_list.pkl')

Create a Combined Dataframe of tweets with labels

# read in lists of tweets

# create liberal and Conservative Dataframes    
lib_df = pd.DataFrame(columns=['tweet', 'class'])
con_df = pd.DataFrame(columns=['tweet', 'class'])

# Fill liberal dataframe
lib_df['tweet'] = lib_list
lib_df['class'] = "L"

# fill conservative dataframe
con_df['tweet'] = con_list
con_df['class'] = "C"



#combine the liberal and conservative dataframes
tweet_df = pd.concat([lib_df, con_df])

# Randomly shuffle the dataframe
tweet_df = tweet_df.sample(frac=1)

# reset the index of the complete dataframe
tweet_df.reset_index(drop=True, inplace = True)

# view dataframe
tweet_df

	tweet	class
0	RT @aafb: Congrats to ⁦@RepOHalleran⁩ & ⁦@...	L
1	Great to meet the new Lake County Farm Bureau ...	L
2	Congratulations to @waynestcollege women's rug...	C
3	Great to meet with the Erickson Air Crane team...	C
4	Always wonderful to be part of the Back to Sch...	L
...	...	...
1350301	We should be upholding the National Environmen...	L
1350302	If anything is to be investigated, I think we ...	C
1350303	TODAY: Federal judge rules in favor of House R...	C
1350304	In the words of an old proverb, "A hit dog wil...	L
1350305	The new EPA regs are pure fantasy. http://t.co...	C

1350306 rows × 2 columns

with open('outdata/tweet_df.pkl', 'wb') as f:
            pickle.dump(tweet_df, f)

# save dataframe to pickle to AWS S3
s3.meta.client.upload_file('outdata/tweet_df.pkl',
                           bucket_name,
                           'tweet_df.pkl')

os.remove('outdata/con_list.pkl')
os.remove('outdata/lib_list.pkl')
os.remove('outdata/tweet_df.pkl')

Share on

Twitter Facebook LinkedIn

Sean Carey