Political Twitter Part 3

3 minute read

Exploratory Data Analysis


Exploratory Data Analysis (EDA) is different when being performed on text data. Analysis performed on text works with representations of the data rather than the original data itself. Here we will perform EDA using the cleaned text from Part 2.

Click Here for full code notebook

Check for Data Imbalance

I have close to equal proportions of liberal and conservative tweets. 2020 is an election year and the Democrats have a large field of hopefuls so it is no surprise that they have slightly more tweets.


# read in df from pickle file
tweets_clean_df = pd.read_pickle('outdata/tweets_clean_df.pkl')

# get counts by political group
counts = tweets_clean_df.groupby("class").agg(
    count=('class',"count")).reset_index()

# Generate full text lables for use in graph
counts["Lean"] = np.where(counts['class']=='L',"Liberal", "Conservative")

# prepare data for plotting
counts = counts[["Lean", "count"]]

# begin plot
sns.set(style="ticks")

# Initialize the matplotlib figure
f, ax1 = plt.subplots(figsize=(8,10))

# Barplot of parties
sns.set_color_codes("pastel")
g = sns.barplot(x="Lean",
            y="count",
            data=counts,
            label="Total",
            color="b",
            ax=ax1).set_title("Count of Tweets by Political Leanings")

sns.despine()




png

# separate corpus into Liberal and and Conservative lists for analysis
lib_tweets = tweets_clean_df[tweets_clean_df['class']=='L']['clean_tweets'].tolist()
con_tweets = tweets_clean_df[tweets_clean_df['class']=='C']['clean_tweets'].tolist()

Use N-Gram counts to discover top themes

A common Natural Language Processing technique for identifying themes in a text is to count the occurrences of n-grams. N-grams are groupings of n number of words. If we were to count the number of each word in a document, we would be counting 1-grams or Unigrams. The most useful n-grams are usually 2-grams and 3-grams. Any n larger than three starts to give counts that are too low to be very useful.

Some of the top themes from liberal 2-grams and 3-grams are: “health care” , “president trump”, “climate change”, “affordable care act”, “martin luther king”, “gun violence prevention”, “voting rights act”.

Conservatives had a similar trend in 2-grams and 3-grams: “small business”, “man woman”, “health care”, “tax reform”, “national security”, “tax cuts job”, “man woman uniform”, “law enforcement officer”, “small business owner”.

# get top bigrams
def get_top_n_bigram(corpus, n=None):
    vectors = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    words = vectors.transform(corpus)
    sum_words = words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectors.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    words_freq = [i for i in words_freq if 'pron' not in i[0]]
    return words_freq[:n]

# get top trigrams
def get_top_n_trigram(corpus, n=None):
    vectors = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    words = vectors.transform(corpus)
    sum_words = words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectors.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    words_freq = [i for i in words_freq if 'pron' not in i[0]]
    return words_freq[:n]
# conservative n gram counts
con_bigrams_top = get_top_n_bigram(con_tweets, 20)
con_trigrams_top = get_top_n_trigram(con_tweets, 20)

# liberal ngram counts
lib_bigrams_top = get_top_n_bigram(lib_tweets, 20)
lib_trigrams_top = get_top_n_trigram(lib_tweets, 20)

#convert to dataframe for easy plotting
con_tri_df = pd.DataFrame(con_trigrams_top, columns = ['Trigram','Counts'])
con_bi_df = pd.DataFrame(con_bigrams_top, columns = ['Bigram','Counts'])

#convert to dataframe for easy plotting
lib_tri_df = pd.DataFrame(lib_trigrams_top, columns = ['Trigram','Counts'])
lib_bi_df = pd.DataFrame(lib_bigrams_top, columns = ['Bigram','Counts'])

Top 2-grams by Political Group

png

Top 3-grams by Political Group

png

Get Character and Word Counts per Tweet

Who has longer tweets, Liberals or Conservatives? To judge we will make histograms of the distribution of “Tweet Length by Character” and “Tweet Length by Word.”

tweets_clean_df['words_per_tweet'] = [len(x.split(" ")) for x in tweets_clean_df['tweet'].tolist()]
tweets_clean_df['chars_per_tweet'] = [len(x) for x in tweets_clean_df['tweet'].tolist()]
tweets_clean_df[['words_per_tweet','chars_per_tweet']].sample(10)
words_per_tweet chars_per_tweet
748989 4 42
761081 13 98
581458 19 127
763005 20 146
674677 17 143
1172019 16 129
136311 16 137
432322 26 140
491182 20 140
550561 19 140

Combined Distribution of Tweet Lengths.

This distribution includes Conservative and Liberal Tweets.

It looks like Tweets are generally around 20 words or 140 Characters.

char_count = tweets_clean_df.chars_per_tweet.tolist()
word_count = tweets_clean_df.words_per_tweet.tolist()

f, (ax1,ax2) = plt.subplots(2,1,figsize=(12,12))

# set color theme
sns.set_color_codes("pastel")

# --------Top Plot-----------------
sns.distplot(word_count, kde=False,
             bins=30,
            ax=ax1).set_title("Words per Tweet",
                             fontsize=20)
ax1.tick_params(labelsize=20)

# ----------lower plot ------------
sns.distplot(char_count, kde=False,
             bins=30,
            ax=ax2).set_title("Characters per Tweet",
                             fontsize=20)
ax2.tick_params(labelsize=20)

png

Distribution of Tweet Lengths by Class

The trend of tweets being about 20 words or 140 characters seems to be shared by both Liberals and Conservatives.

No significant difference in the distribution of tweet length between Liberals and Conservatives is apparent.

png

Next steps

Now that we understand the data the next steps are to prepare the data for modeling and to build the model!!

Updated: