Political Twitter Part 3

3 minute read

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is different when being performed on text data. Analysis performed on text works with representations of the data rather than the original data itself. Here we will perform EDA using the cleaned text from Part 2.

Click Here for full code notebook

Check for Data Imbalance

I have close to equal proportions of liberal and conservative tweets. 2020 is an election year and the Democrats have a large field of hopefuls so it is no surprise that they have slightly more tweets.

# read in df from pickle file
tweets_clean_df = pd.read_pickle('outdata/tweets_clean_df.pkl')

# get counts by political group
counts = tweets_clean_df.groupby("class").agg(
    count=('class',"count")).reset_index()

# Generate full text lables for use in graph
counts["Lean"] = np.where(counts['class']=='L',"Liberal", "Conservative")

# prepare data for plotting
counts = counts[["Lean", "count"]]

# begin plot
sns.set(style="ticks")

# Initialize the matplotlib figure
f, ax1 = plt.subplots(figsize=(8,10))

# Barplot of parties
sns.set_color_codes("pastel")
g = sns.barplot(x="Lean",
            y="count",
            data=counts,
            label="Total",
            color="b",
            ax=ax1).set_title("Count of Tweets by Political Leanings")

sns.despine()

png

# separate corpus into Liberal and and Conservative lists for analysis
lib_tweets = tweets_clean_df[tweets_clean_df['class']=='L']['clean_tweets'].tolist()
con_tweets = tweets_clean_df[tweets_clean_df['class']=='C']['clean_tweets'].tolist()

Use N-Gram counts to discover top themes

A common Natural Language Processing technique for identifying themes in a text is to count the occurrences of n-grams. N-grams are groupings of n number of words. If we were to count the number of each word in a document, we would be counting 1-grams or Unigrams. The most useful n-grams are usually 2-grams and 3-grams. Any n larger than three starts to give counts that are too low to be very useful.

Some of the top themes from liberal 2-grams and 3-grams are: “health care” , “president trump”, “climate change”, “affordable care act”, “martin luther king”, “gun violence prevention”, “voting rights act”.

Conservatives had a similar trend in 2-grams and 3-grams: “small business”, “man woman”, “health care”, “tax reform”, “national security”, “tax cuts job”, “man woman uniform”, “law enforcement officer”, “small business owner”.

# get top bigrams
def get_top_n_bigram(corpus, n=None):
    vectors = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    words = vectors.transform(corpus)
    sum_words = words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectors.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    words_freq = [i for i in words_freq if 'pron' not in i[0]]
    return words_freq[:n]

# get top trigrams
def get_top_n_trigram(corpus, n=None):
    vectors = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    words = vectors.transform(corpus)
    sum_words = words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectors.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    words_freq = [i for i in words_freq if 'pron' not in i[0]]
    return words_freq[:n]

# conservative n gram counts
con_bigrams_top = get_top_n_bigram(con_tweets, 20)
con_trigrams_top = get_top_n_trigram(con_tweets, 20)

# liberal ngram counts
lib_bigrams_top = get_top_n_bigram(lib_tweets, 20)
lib_trigrams_top = get_top_n_trigram(lib_tweets, 20)

#convert to dataframe for easy plotting
con_tri_df = pd.DataFrame(con_trigrams_top, columns = ['Trigram','Counts'])
con_bi_df = pd.DataFrame(con_bigrams_top, columns = ['Bigram','Counts'])

#convert to dataframe for easy plotting
lib_tri_df = pd.DataFrame(lib_trigrams_top, columns = ['Trigram','Counts'])
lib_bi_df = pd.DataFrame(lib_bigrams_top, columns = ['Bigram','Counts'])

Top 2-grams by Political Group

png

Top 3-grams by Political Group

png

Get Character and Word Counts per Tweet

Who has longer tweets, Liberals or Conservatives? To judge we will make histograms of the distribution of “Tweet Length by Character” and “Tweet Length by Word.”

tweets_clean_df['words_per_tweet'] = [len(x.split(" ")) for x in tweets_clean_df['tweet'].tolist()]
tweets_clean_df['chars_per_tweet'] = [len(x) for x in tweets_clean_df['tweet'].tolist()]

tweets_clean_df[['words_per_tweet','chars_per_tweet']].sample(10)

	words_per_tweet	chars_per_tweet
748989	4	42
761081	13	98
581458	19	127
763005	20	146
674677	17	143
1172019	16	129
136311	16	137
432322	26	140
491182	20	140
550561	19	140

Combined Distribution of Tweet Lengths.

This distribution includes Conservative and Liberal Tweets.

It looks like Tweets are generally around 20 words or 140 Characters.

char_count = tweets_clean_df.chars_per_tweet.tolist()
word_count = tweets_clean_df.words_per_tweet.tolist()

f, (ax1,ax2) = plt.subplots(2,1,figsize=(12,12))

# set color theme
sns.set_color_codes("pastel")

# --------Top Plot-----------------
sns.distplot(word_count, kde=False,
             bins=30,
            ax=ax1).set_title("Words per Tweet",
                             fontsize=20)
ax1.tick_params(labelsize=20)

# ----------lower plot ------------
sns.distplot(char_count, kde=False,
             bins=30,
            ax=ax2).set_title("Characters per Tweet",
                             fontsize=20)
ax2.tick_params(labelsize=20)

png

Distribution of Tweet Lengths by Class

The trend of tweets being about 20 words or 140 characters seems to be shared by both Liberals and Conservatives.

No significant difference in the distribution of tweet length between Liberals and Conservatives is apparent.

png

Next steps

Now that we understand the data the next steps are to prepare the data for modeling and to build the model!!

Share on

Twitter Facebook LinkedIn

Sean Carey

Political Twitter Part 3

Exploratory Data Analysis

Check for Data Imbalance

Use N-Gram counts to discover top themes

Top 2-grams by Political Group

Top 3-grams by Political Group

Get Character and Word Counts per Tweet

Combined Distribution of Tweet Lengths.

Distribution of Tweet Lengths by Class

Next steps

Share on

You may also enjoy

From Notebook To Production Part 3

From Notebook To Production Part 2

From Notebook To Production Part 1

From Notebook To Production Project Introduction