Political Twitter Part 1
Social Media, as a platform in American politics, has been growing in popularity since the 2008 election cycle. Twitter has become the platform of choice for most members of both the Legislative and the Executive Branches. This four-part blog series will demonstrate the workflow of a Data Science project focused on analyzing and modeling twitter content generated by American Politicians.
- Part I: Collection of the Twitter Data
- Part II: Cleaning and Preprocessing
- Part III: Exploratory Analysis
- Part IV: Text Classification Model
Data Collection
At the end of this post we will have a labeled dataset with unprocessed Tweets from American Politicians.
Click Here for the full code notebook.
import pandas as pd
import numpy as np
import tweepy
import tabula
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"
c_key = os.environ.get('tw_c_key')
c_sec = os.environ.get('tw_c_sec')
atk = os.environ.get('tw_ac_tok')
ats = os.environ.get('tw_ac_sec')
auth = tweepy.OAuthHandler(c_key, c_sec)
auth.set_access_token(atk, ats)
api = tweepy.API(auth, wait_on_rate_limit_notify=True,
wait_on_rate_limit=True)
pd.set_option('display.max_rows', 200)
# read in data
congress_url = "https://theunitedstates.io/congress-legislators/legislators-current.csv"
congress_df = pd.read_csv(congress_url,
usecols=['last_name',
'first_name',
'full_name',
'party',
'type',
'state' ,
'twitter'])
Check for missing Twitter Handles
# filter for records with missing twitter handle
congress_df[congress_df['twitter'].isna()]
last_name | first_name | full_name | type | state | party | ||
---|---|---|---|---|---|---|---|
33 | Amash | Justin | Justin Amash | rep | MI | Independent | NaN |
62 | Clay | Wm. | Wm. Lacy Clay | rep | MO | Democrat | NaN |
182 | Peterson | Collin | Collin C. Peterson | rep | MN | Democrat | NaN |
314 | Kaine | Timothy | Tim Kaine | sen | VA | Democrat | NaN |
372 | Comer | James | James Comer | rep | KY | Republican | NaN |
425 | Gianforte | Greg | Greg Gianforte | rep | MT | Republican | NaN |
534 | Bishop | Dan | Dan Bishop | rep | NC | Republican | NaN |
535 | Murphy | Gregory | Gregory F. Murphy | rep | NC | Republican | NaN |
536 | Loeffler | Kelly | Kelly Loeffler | sen | GA | Republican | NaN |
A few twitter handles are missing. Since it is a small number I will manually fill these in.
Georgia swore in a new Senator to replace retired Sen. Johnny Isakson. I’ve elected to keep Johnny Isakson’s name in the list in order to capture all of his tweets
# fill in missing twitter handles
congress_df.at[33, 'twitter'] = 'justinamash'
congress_df.at[62, 'twitter'] = 'LacyClayMO1'
congress_df.at[182, 'twitter'] = "collinpeterson"
congress_df.at[314, 'twitter'] = "timkaine"
congress_df.at[372, 'twitter'] = "KYComer"
congress_df.at[425, 'twitter'] = "GregForMontana"
congress_df.at[534, 'twitter'] = "jdanbishop"
congress_df.at[535, 'twitter'] = "DrGregMurphy1"
congress_df.at[536, 'twitter'] = "SenatorLoeffler"
print(f"Missing Twitter Handles: {len(congress_df[congress_df['twitter'].isna()])}")
# group by political party
congress_df.groupby(by='party').count()
last_name | first_name | full_name | type | state | ||
---|---|---|---|---|---|---|
party | ||||||
Democrat | 281 | 281 | 281 | 281 | 281 | 281 |
Independent | 3 | 3 | 3 | 3 | 3 | 3 |
Republican | 253 | 253 | 253 | 253 | 253 | 253 |
Independent members of Congress
Three members of this Congress are Independent. Which means that they do not belong to any one political party. They may not belong to a party but they do have Conservative or Liberal political leanings. We can relabel them with the party with which they most closely align. Doing this allows us to relabel them as either Liberal or Conservative.
# filter by Independents
congress_df[congress_df['party']=='Independent']
last_name | first_name | full_name | type | state | party | ||
---|---|---|---|---|---|---|---|
8 | Sanders | Bernard | Bernard Sanders | sen | VT | Independent | SenSanders |
33 | Amash | Justin | Justin Amash | rep | MI | Independent | justinamash |
287 | King | Angus | Angus S. King, Jr. | sen | ME | Independent | SenAngusKing |
# relabel the Independents
congress_df.at[8, 'party'] = 'Democrat'
congress_df.at[33, 'party'] = 'Republican'
congress_df.at[287, 'party'] = 'Democrat'
# create new column of Liberal or Conservative labels.
congress_df['lean'] = np.where(congress_df['party']=='Democrat','L', 'C')
congress_df
last_name | first_name | full_name | type | state | party | lean | ||
---|---|---|---|---|---|---|---|---|
0 | Brown | Sherrod | Sherrod Brown | sen | OH | Democrat | SenSherrodBrown | L |
1 | Cantwell | Maria | Maria Cantwell | sen | WA | Democrat | SenatorCantwell | L |
2 | Cardin | Benjamin | Benjamin L. Cardin | sen | MD | Democrat | SenatorCardin | L |
3 | Carper | Thomas | Thomas R. Carper | sen | DE | Democrat | SenatorCarper | L |
4 | Casey | Robert | Robert P. Casey, Jr. | sen | PA | Democrat | SenBobCasey | L |
... | ... | ... | ... | ... | ... | ... | ... | ... |
532 | Golden | Jared | Jared F. Golden | rep | ME | Democrat | repgolden | L |
533 | Keller | Fred | Fred Keller | rep | PA | Republican | RepFredKeller | C |
534 | Bishop | Dan | Dan Bishop | rep | NC | Republican | jdanbishop | C |
535 | Murphy | Gregory | Gregory F. Murphy | rep | NC | Republican | DrGregMurphy1 | C |
536 | Loeffler | Kelly | Kelly Loeffler | sen | GA | Republican | SenatorLoeffler | C |
537 rows × 8 columns
Use the Twitter API to collect Tweets
Twitter provides an excellent API for programmatically accessing Tweets from anybody with a Twitter account. Since we already have the Twitter handle for each member of congress we can use the API to pull their “tweets”.
One drawback of pulling tweets from an individual user is that Twitter limits us to the most recent 3200 tweets per user.
def get_tweets(handle):
# blank list
print(f'...Getting Tweets for {handle}')
tweets = []
try:
# get the most recent 200 tweets
new_tweets = api.user_timeline(screen_name = handle,
count=200)
# add new tweets one by one to end of tweets list
tweets.extend(new_tweets)
# get oldest tweet from list
old = tweets[-1].id - 1
while len(new_tweets) > 0:
# get next 200 tweets
new_tweets = api.user_timeline(screen_name = handle,
count=200,
max_id=old)
#add current batch to the tweets list
tweets.extend(new_tweets)
# update the old var to match the oldest tweet currently in
old = tweets[-1].id - 1
tweet_tab = [tweet.text for tweet in tweets]
except tweepy.TweepError:
print(f'error with {handle} in function')
pass
return tweet_tab
liberal_handle_list = congress_df[congress_df['lean']=='L'].twitter
conservative_handle_list = congress_df[congress_df['lean']=='C'].twitter
Get Tweets from Democratic members of Congress
The output of this is a list of 728,175 tweets from Democrat Senators and Representatives
lib_tweets = []
for name in liberal_handle_list:
try:
tweets_temp = get_tweets(name)
lib_tweets.extend(tweets_temp)
with open('outdata/lib_list.pkl', 'wb') as f:
pickle.dump(lib_tweets, f)
except:
print(f"problem with {name} in loop")
Get Tweets from Republican Members of Congress
The output of this is a list of 589,235 tweets from Republican Senators and Representatives.
conservative_tweets = []
for name in conservative_handle_list:
try:
tweets_temp = get_tweets(name)
conservative_tweets.extend(tweets_temp)
with open('outdata/con_list.pkl', 'wb') as f:
pickle.dump(conservative_tweets, f)
print(f'con_tweets len: {len(conservative_tweets)}')
except:
print(f"problem with {name} in loop")
Manually fix the problems that arose for a few Conservative Handles
Some of the twitter handles did not work with the API. It turns out that the list had wrong info for some of these members. We can manually find the correct handle and collect the data.
# Rep Rob Marshal's twitter handle needed to be updated
marsh_tweets = get_tweets('RogerMarshallMD')
conservative_tweets.extend(marsh_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
pickle.dump(conservative_tweets, f)
...Getting Tweets for RogerMarshallMD
# Rep. Lance Gooden's twitter handle needed to be updated.
gooden_tweets = get_tweets('Lancegooden')
conservative_tweets.extend(gooden_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
pickle.dump(conservative_tweets, f)
...Getting Tweets for Lancegooden
Add Current Executive Branch
The current Executive branch tweets a lot. They should be included in this analysis.
# Add President Trump to the Conservative Tweet List
trump_tweets = get_tweets('realDonaldTrump')
conservative_tweets.extend(trump_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
pickle.dump(conservative_tweets, f)
print(f'{len(conservative_tweets)}')
...Getting Tweets for realDonaldTrump
594399
# Define a function for adding to the conservative list
def add_tweets_to_con_list(handle):
temp_tweets = get_tweets(handle)
conservative_tweets.extend(temp_tweets)
with open('outdata/con_list.pkl', 'wb') as f:
pickle.dump(conservative_tweets, f)
print(f'Conservative Tweets: {len(conservative_tweets)}')
# add VP Pence
add_tweets_to_con_list('VP')
...Getting Tweets for VP
597644
Add the previous Executive Branch and Current Democratic Presidential Candidates (that aren’t in congress)
The Previous Executive branch pioneered the use of Social Media for Politics. Also, this is an election year. The democrats have a lot of folks vying for the Democratic Nomination. A few of them are current members of congress so we already have their tweets. However, it makes sense to add those who are not members of congress to this dataset.
def add_tweets_to_lib_list(handle):
temp_tweets = get_tweets(handle)
lib_tweets.extend(temp_tweets)
with open('outdata/lib_list.pkl', 'wb') as f:
pickle.dump(lib_tweets, f)
print(f'Liberal Tweets: {len(lib_tweets)}')
add_tweets_to_lib_list('BarackObama')
add_tweets_to_lib_list('JoeBiden')
add_tweets_to_lib_list('MikeBloomberg')
add_tweets_to_lib_list('DevalPatrick')
add_tweets_to_lib_list('TomSteyer')
add_tweets_to_lib_list('PeteButtigieg')
add_tweets_to_lib_list('JohnDelaney')
add_tweets_to_lib_list('AndrewYang')
...Getting Tweets for BarackObama
Liberal Tweets: 731403
...Getting Tweets for JoeBiden
Liberal Tweets: 734630
...Getting Tweets for MikeBloomberg
Liberal Tweets: 737863
...Getting Tweets for DevalPatrick
Liberal Tweets: 739782
...Getting Tweets for TomSteyer
Liberal Tweets: 742994
...Getting Tweets for PeteButtigieg
Liberal Tweets: 746216
...Getting Tweets for JohnDelaney
Liberal Tweets: 749432
...Getting Tweets for AndrewYang
Liberal Tweets: 752662
s3.meta.client.upload_file('outdata/lib_list.pkl', bucket_name, 'lib_list.pkl')
s3.meta.client.upload_file('outdata/con_list.pkl', bucket_name, 'con_list.pkl')
Create a Combined Dataframe of tweets with labels
# read in lists of tweets
# create liberal and Conservative Dataframes
lib_df = pd.DataFrame(columns=['tweet', 'class'])
con_df = pd.DataFrame(columns=['tweet', 'class'])
# Fill liberal dataframe
lib_df['tweet'] = lib_list
lib_df['class'] = "L"
# fill conservative dataframe
con_df['tweet'] = con_list
con_df['class'] = "C"
#combine the liberal and conservative dataframes
tweet_df = pd.concat([lib_df, con_df])
# Randomly shuffle the dataframe
tweet_df = tweet_df.sample(frac=1)
# reset the index of the complete dataframe
tweet_df.reset_index(drop=True, inplace = True)
# view dataframe
tweet_df
tweet | class | |
---|---|---|
0 | RT @aafb: Congrats to @RepOHalleran & @... | L |
1 | Great to meet the new Lake County Farm Bureau ... | L |
2 | Congratulations to @waynestcollege women's rug... | C |
3 | Great to meet with the Erickson Air Crane team... | C |
4 | Always wonderful to be part of the Back to Sch... | L |
... | ... | ... |
1350301 | We should be upholding the National Environmen... | L |
1350302 | If anything is to be investigated, I think we ... | C |
1350303 | TODAY: Federal judge rules in favor of House R... | C |
1350304 | In the words of an old proverb, "A hit dog wil... | L |
1350305 | The new EPA regs are pure fantasy. http://t.co... | C |
1350306 rows × 2 columns
with open('outdata/tweet_df.pkl', 'wb') as f:
pickle.dump(tweet_df, f)
# save dataframe to pickle to AWS S3
s3.meta.client.upload_file('outdata/tweet_df.pkl',
bucket_name,
'tweet_df.pkl')
os.remove('outdata/con_list.pkl')
os.remove('outdata/lib_list.pkl')
os.remove('outdata/tweet_df.pkl')