Political Twitter Part 1

7 minute read

Social Media, as a platform in American politics, has been growing in popularity since the 2008 election cycle. Twitter has become the platform of choice for most members of both the Legislative and the Executive Branches. This four-part blog series will demonstrate the workflow of a Data Science project focused on analyzing and modeling twitter content generated by American Politicians.

  • Part I: Collection of the Twitter Data
  • Part II: Cleaning and Preprocessing
  • Part III: Exploratory Analysis
  • Part IV: Text Classification Model

Data Collection


At the end of this post we will have a labeled dataset with unprocessed Tweets from American Politicians.

Click Here for the full code notebook.


import pandas as pd
import numpy as np
import tweepy
import tabula
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"

c_key = os.environ.get('tw_c_key')
c_sec = os.environ.get('tw_c_sec')
atk = os.environ.get('tw_ac_tok')
ats = os.environ.get('tw_ac_sec')

auth = tweepy.OAuthHandler(c_key, c_sec)
auth.set_access_token(atk, ats)

api = tweepy.API(auth, wait_on_rate_limit_notify=True,
                wait_on_rate_limit=True)

pd.set_option('display.max_rows', 200)

# read in data
congress_url = "https://theunitedstates.io/congress-legislators/legislators-current.csv"
congress_df = pd.read_csv(congress_url,
                          usecols=['last_name',
                                   'first_name',
                                   'full_name',
                                   'party',
                                   'type',
                                   'state' ,
                                   'twitter'])

Check for missing Twitter Handles

# filter for records with missing twitter handle
 congress_df[congress_df['twitter'].isna()]

last_name first_name full_name type state party twitter
33 Amash Justin Justin Amash rep MI Independent NaN
62 Clay Wm. Wm. Lacy Clay rep MO Democrat NaN
182 Peterson Collin Collin C. Peterson rep MN Democrat NaN
314 Kaine Timothy Tim Kaine sen VA Democrat NaN
372 Comer James James Comer rep KY Republican NaN
425 Gianforte Greg Greg Gianforte rep MT Republican NaN
534 Bishop Dan Dan Bishop rep NC Republican NaN
535 Murphy Gregory Gregory F. Murphy rep NC Republican NaN
536 Loeffler Kelly Kelly Loeffler sen GA Republican NaN

A few twitter handles are missing. Since it is a small number I will manually fill these in.

Georgia swore in a new Senator to replace retired Sen. Johnny Isakson. I’ve elected to keep Johnny Isakson’s name in the list in order to capture all of his tweets

# fill in missing twitter handles
congress_df.at[33, 'twitter'] = 'justinamash'

congress_df.at[62, 'twitter'] = 'LacyClayMO1'

congress_df.at[182, 'twitter'] = "collinpeterson"

congress_df.at[314, 'twitter'] = "timkaine"

congress_df.at[372, 'twitter'] = "KYComer"

congress_df.at[425, 'twitter'] = "GregForMontana"

congress_df.at[534, 'twitter'] = "jdanbishop"

congress_df.at[535, 'twitter'] = "DrGregMurphy1"

congress_df.at[536, 'twitter'] = "SenatorLoeffler"

print(f"Missing Twitter Handles: {len(congress_df[congress_df['twitter'].isna()])}")


# group by political party
congress_df.groupby(by='party').count()
last_name first_name full_name type state twitter
party
Democrat 281 281 281 281 281 281
Independent 3 3 3 3 3 3
Republican 253 253 253 253 253 253

Independent members of Congress

Three members of this Congress are Independent. Which means that they do not belong to any one political party. They may not belong to a party but they do have Conservative or Liberal political leanings. We can relabel them with the party with which they most closely align. Doing this allows us to relabel them as either Liberal or Conservative.

# filter by Independents
congress_df[congress_df['party']=='Independent']
last_name first_name full_name type state party twitter
8 Sanders Bernard Bernard Sanders sen VT Independent SenSanders
33 Amash Justin Justin Amash rep MI Independent justinamash
287 King Angus Angus S. King, Jr. sen ME Independent SenAngusKing
# relabel the Independents
congress_df.at[8, 'party'] = 'Democrat'
congress_df.at[33, 'party'] = 'Republican'
congress_df.at[287, 'party'] = 'Democrat'

# create new column of Liberal or Conservative labels.
congress_df['lean'] = np.where(congress_df['party']=='Democrat','L', 'C')
congress_df
last_name first_name full_name type state party twitter lean
0 Brown Sherrod Sherrod Brown sen OH Democrat SenSherrodBrown L
1 Cantwell Maria Maria Cantwell sen WA Democrat SenatorCantwell L
2 Cardin Benjamin Benjamin L. Cardin sen MD Democrat SenatorCardin L
3 Carper Thomas Thomas R. Carper sen DE Democrat SenatorCarper L
4 Casey Robert Robert P. Casey, Jr. sen PA Democrat SenBobCasey L
... ... ... ... ... ... ... ... ...
532 Golden Jared Jared F. Golden rep ME Democrat repgolden L
533 Keller Fred Fred Keller rep PA Republican RepFredKeller C
534 Bishop Dan Dan Bishop rep NC Republican jdanbishop C
535 Murphy Gregory Gregory F. Murphy rep NC Republican DrGregMurphy1 C
536 Loeffler Kelly Kelly Loeffler sen GA Republican SenatorLoeffler C

537 rows × 8 columns

Use the Twitter API to collect Tweets

Twitter provides an excellent API for programmatically accessing Tweets from anybody with a Twitter account. Since we already have the Twitter handle for each member of congress we can use the API to pull their “tweets”.

One drawback of pulling tweets from an individual user is that Twitter limits us to the most recent 3200 tweets per user.

def get_tweets(handle):
	# blank list
    print(f'...Getting Tweets for {handle}')
    tweets = []
    try:
    # get the most recent 200 tweets
        new_tweets = api.user_timeline(screen_name = handle,
                                       count=200)

    # add new tweets one by one to end of tweets list
        tweets.extend(new_tweets)

    # get oldest tweet from list
        old = tweets[-1].id - 1

        while len(new_tweets) > 0:


            # get next 200 tweets
            new_tweets = api.user_timeline(screen_name = handle,
                                           count=200,
                                           max_id=old)

        #add current batch to the tweets list
            tweets.extend(new_tweets)

        # update the old var to match the oldest tweet currently in
            old = tweets[-1].id - 1

        tweet_tab = [tweet.text for tweet in tweets]

    except tweepy.TweepError:
        print(f'error with {handle} in function')
        pass    


    return tweet_tab

liberal_handle_list = congress_df[congress_df['lean']=='L'].twitter
conservative_handle_list = congress_df[congress_df['lean']=='C'].twitter

Get Tweets from Democratic members of Congress

The output of this is a list of 728,175 tweets from Democrat Senators and Representatives


lib_tweets = []
for name in liberal_handle_list:

    try:
        tweets_temp = get_tweets(name)
        lib_tweets.extend(tweets_temp)
        with open('outdata/lib_list.pkl', 'wb') as f:
            pickle.dump(lib_tweets, f)
    except:
        print(f"problem with {name} in loop")

Get Tweets from Republican Members of Congress

The output of this is a list of 589,235 tweets from Republican Senators and Representatives.

conservative_tweets = []
for name in conservative_handle_list:
    try:
        tweets_temp = get_tweets(name)
        conservative_tweets.extend(tweets_temp)
        with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
        print(f'con_tweets len: {len(conservative_tweets)}')

    except:
        print(f"problem with {name} in loop")


Manually fix the problems that arose for a few Conservative Handles

Some of the twitter handles did not work with the API. It turns out that the list had wrong info for some of these members. We can manually find the correct handle and collect the data.


# Rep Rob Marshal's twitter handle needed to be updated     
marsh_tweets = get_tweets('RogerMarshallMD')
conservative_tweets.extend(marsh_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
...Getting Tweets for RogerMarshallMD
# Rep. Lance Gooden's twitter handle needed to be updated.
gooden_tweets = get_tweets('Lancegooden')
conservative_tweets.extend(gooden_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
...Getting Tweets for Lancegooden

Add Current Executive Branch

The current Executive branch tweets a lot. They should be included in this analysis.

# Add President Trump to the Conservative Tweet List
trump_tweets = get_tweets('realDonaldTrump')
conservative_tweets.extend(trump_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
print(f'{len(conservative_tweets)}')
...Getting Tweets for realDonaldTrump
594399
# Define a function for adding to the conservative list
def add_tweets_to_con_list(handle):
    temp_tweets = get_tweets(handle)
    conservative_tweets.extend(temp_tweets)
    with open('outdata/con_list.pkl', 'wb') as f:
            pickle.dump(conservative_tweets, f)
    print(f'Conservative Tweets: {len(conservative_tweets)}')

# add VP Pence
add_tweets_to_con_list('VP')
...Getting Tweets for VP
597644

Add the previous Executive Branch and Current Democratic Presidential Candidates (that aren’t in congress)

The Previous Executive branch pioneered the use of Social Media for Politics. Also, this is an election year. The democrats have a lot of folks vying for the Democratic Nomination. A few of them are current members of congress so we already have their tweets. However, it makes sense to add those who are not members of congress to this dataset.

def add_tweets_to_lib_list(handle):
    temp_tweets = get_tweets(handle)
    lib_tweets.extend(temp_tweets)
    with open('outdata/lib_list.pkl', 'wb') as f:
            pickle.dump(lib_tweets, f)
    print(f'Liberal Tweets: {len(lib_tweets)}')
add_tweets_to_lib_list('BarackObama')
add_tweets_to_lib_list('JoeBiden')
add_tweets_to_lib_list('MikeBloomberg')
add_tweets_to_lib_list('DevalPatrick')
add_tweets_to_lib_list('TomSteyer')
add_tweets_to_lib_list('PeteButtigieg')
add_tweets_to_lib_list('JohnDelaney')
add_tweets_to_lib_list('AndrewYang')


...Getting Tweets for BarackObama
Liberal Tweets: 731403
...Getting Tweets for JoeBiden
Liberal Tweets: 734630
...Getting Tweets for MikeBloomberg
Liberal Tweets: 737863
...Getting Tweets for DevalPatrick
Liberal Tweets: 739782
...Getting Tweets for TomSteyer
Liberal Tweets: 742994
...Getting Tweets for PeteButtigieg
Liberal Tweets: 746216
...Getting Tweets for JohnDelaney
Liberal Tweets: 749432
...Getting Tweets for AndrewYang
Liberal Tweets: 752662

s3.meta.client.upload_file('outdata/lib_list.pkl', bucket_name, 'lib_list.pkl')
s3.meta.client.upload_file('outdata/con_list.pkl', bucket_name, 'con_list.pkl')

Create a Combined Dataframe of tweets with labels

# read in lists of tweets

# create liberal and Conservative Dataframes    
lib_df = pd.DataFrame(columns=['tweet', 'class'])
con_df = pd.DataFrame(columns=['tweet', 'class'])

# Fill liberal dataframe
lib_df['tweet'] = lib_list
lib_df['class'] = "L"

# fill conservative dataframe
con_df['tweet'] = con_list
con_df['class'] = "C"



#combine the liberal and conservative dataframes
tweet_df = pd.concat([lib_df, con_df])

# Randomly shuffle the dataframe
tweet_df = tweet_df.sample(frac=1)

# reset the index of the complete dataframe
tweet_df.reset_index(drop=True, inplace = True)

# view dataframe
tweet_df
tweet class
0 RT @aafb: Congrats to ⁦@RepOHalleran⁩ & ⁦@... L
1 Great to meet the new Lake County Farm Bureau ... L
2 Congratulations to @waynestcollege women's rug... C
3 Great to meet with the Erickson Air Crane team... C
4 Always wonderful to be part of the Back to Sch... L
... ... ...
1350301 We should be upholding the National Environmen... L
1350302 If anything is to be investigated, I think we ... C
1350303 TODAY: Federal judge rules in favor of House R... C
1350304 In the words of an old proverb, "A hit dog wil... L
1350305 The new EPA regs are pure fantasy. http://t.co... C

1350306 rows × 2 columns

with open('outdata/tweet_df.pkl', 'wb') as f:
            pickle.dump(tweet_df, f)
# save dataframe to pickle to AWS S3
s3.meta.client.upload_file('outdata/tweet_df.pkl',
                           bucket_name,
                           'tweet_df.pkl')

os.remove('outdata/con_list.pkl')
os.remove('outdata/lib_list.pkl')
os.remove('outdata/tweet_df.pkl')