Two Dialects of Trumpese

In early August, visual effects artist Todd Vaziri noted a difference between Donald Trump's tweets that came from Twitter for iPhone and those that came from Twitter for Android. Here's how Vaziri originally put it:

Shortly thereafter, data scientist David Robinson took a more in-depth look at the difference between "iPhone Trump" and "Android Trump," concluding that these tweets were written by different individuals. More specifically, Robinson found that the Android tweets tend to be posted in the morning, whereas the iPhone tweets tend to be posted in the early evening. Tweets from Android also often use a form of "manual retweeting," with the original tweet surrounded in quotes, while iPhone tweets were much more likely to include pictures or links. Perhaps most interesting was Robinson's use of sentiment analysis to show that the Android tweets were much more negative in sentiment than the iPhone tweets. Since then, it has become conventional wisdom that the more negative, hyperbolic Android tweets are written by Trump himself, while the more positive, professional iPhone tweets are written by Trump's campaign staff.

Varizi's and Robinson's observations already provide strong support for this conclusion, but I started thinking about other types of evidence that we might use to validate the distinction between Android Trump and iPhone Trump. To start off, we can look at the distribution of the sources of Trump's tweets from the beginning of 2015 until the end of September 2016:

Tweets from Android dominate in early 2015, before Trump officially announced his candidacy. We start to see a bit of iPhone activity during the summer of 2015, just around the time that Trump launched his campaign, but the proportion of tweets from iPhone really takes off a few months later.

iPhone tweets certainly became more common after the start of Trump's campaign, but this falls short of showing that the iPhone tweets differ in content from the Android tweets, as Vaziri and Robinson have argued. Robinson's blog gives one way to establish such a difference. Here, I look at another. The idea is to build a classifier, essentially a computer program that takes in the text of a tweet, but does not know its source. Based on the text alone, the classifier will try to determine whether the tweet came from an Android phone or from an iPhone. If we can build a classifier that categorizes the tweets with reasonably good accuracy, then the text of the tweet provides a good guide as to where the tweet came from. This, in turn, shows that the iPhone tweets are qualitatively different from the Android tweets.

One classifier I'll use is known as a Naive Bayes classifier. This is a fairly simple type of classifier, which means that for some challenging tasks it won't perform well. But in many cases Naive Bayes is sufficient for the problem at hand, and its simplicity makes it relatively easy to implement and understand. I'll also build several classifiers that aren't Naive Bayes classifiers, but which are conceptually quite similar.

To start out, suppose we have the text of some tweet. We then want to find the probability that, given the text, this is an Android tweet, , and the probability that, given the text, this is an iPhone tweet, . Finally, we'll simply classify the tweet depending on which probability is greater. That's all straightforward, but how do we get these probabilities? The "Bayes" part of Naive Bayes comes from the fact that we reformulate the problem of finding and using Bayes' theorem:

Since we're simply going to compare and to see which is larger, we can ignore the denominator. This means that we can solve for the following:

In other words, rather than directly calculating the probability that a tweet came from a particular source given the tweet's text ( or ), we calculate the probability that we'd see the text of the tweet given a particular source ( or (). We then multiply this by the so-called "base rate" of the tweet coming from a particular source (, ).

At first glance, it may seem that we've made our problem harder. After all, originally we just had to calculate two values: and . Now we have to calculate four: , , , and . But the reason to reformulate the problem in this way is that these four values are often much easier to calculate that the original two we started with.

Next, we get some data. For this project, I collected all tweets from Donald Trump's Twitter account posted from either Twitter for Android or Twitter from iPhone from June 16, 2015 (the official start of Trump's campaign) until October 1, 2016. I excluded retweets and any tweets including a quotation mark ("), since many of these are the "manual retweets" that Robinson discusses. Next, I split the data into two sets: a training set consisting of 80% of all tweets and a test set consisting of the remaining 20%. The training set is what we use to calculate the probabilities that the classifier will use. The test set is used to evaluate the classifier at the end.

Recall that we need to calculate four probabilities in order to implement the classifier: , , , and . The first two terms are very straightforward to calculate: we simply take the percentage of tweets in the training set that came from each source. In the end, these terms won't be too important; in my training set, about 49% of tweets were from Android, while about 51% were from iPhone.

To calculate the other terms, , and , we'll use what's called an n-gram language model. A language model is just a probability distribution over sequences of words. Another way to think of it is as a computer program that takes in a sequence of words and returns a probability. We'll need two language models, one for Android tweets that will give us and one for iPhone tweets that will give us . In particular, I used KenLM, software made by Kenneth Heafield for building such models.

I won't go into all the details about how language modelling works, but it's useful to introduce some basic ideas. One of the simplest types of language models is a unigram model. A unigram model assigns a probability to a sequence of words based on how frequently each word in the sequence appeared in the training set. So, for example, if we put the sentence "Make America great again" into a unigram language model, the model would give us a probability that depends on how frequently "make" appeared in the training set, how frequently "America" appeared in the training set, etc. A bigram model extends this idea to two-word sequences. So, to assign a probability to "Make America great again", we'd look at how frequently "make America" appeared in the training set, how frequently "America great" appeared in the training set, etc. Trigram models extend this idea to three-word sequences, and so on. KenLM also incorporates a technique known as Kneser-Ney smoothing, which we can think of as a way of intelligently mixing probabilities we would get from higher-order models (e.g. a 4-gram model) with those that we'd get from lower-order models (e.g. a trigram model) in order to improve the performance of the language model.

Using KenLM, I built language models of orders 1-5 for both the Android and iPhone training data. Each language model corresponded to a particular classifier, each of which classified all tweets in the test set as coming from either an Android phone or an iPhone. Only the model using a unigram language model is, strictly speaking, a Naive Bayes classifier. But, as outlined above, the other models are quite similar in their basic logic. To evaluate these classifications, I used a metric called the F1 score, which depends upon two other important metrics, precision and recall. Precision tells us what percentage of tweets that were classified as coming from an Android phone/iPhone were actually from an Android phone/iPhone. Recall tells us what percentage of actual Android/iPhone tweets were correctly classified as Android/iPhone tweets. The F1 score is a single metric that balances both precision and recall.

At the outset, we said that if we were able to build a successful classifier for Android and iPhone tweets, this would tell us that the text of a tweet alone is sufficient to tell us the tweet's source. But what counts as a successful classifier? We can consider the success of our Android-iPhone classifier by comparing it to a control case in which we build a classifier for two arbitrary classes of tweets. Since there is no principled distinction between these two groups of tweets, the success of any classifier built for these groups must be due to chance. To this end, I combined all Android and iPhone tweets and randomly categorized them into either "Group A" or "Group B." Next, I built classifiers for Group A tweets and Group B tweets, again using an 80%-20% training-test split and again using language models of orders 1-5. If the Android-iPhone classifier performs a good deal better than the Group A-Group B classifier, we know that the former is picking up on a real distinction in the tweets and is not simply getting things right by pure luck.

So, how did the classifiers do? Here are the F1 scores for all the classifiers tested:

The Android-iPhone classifiers perform significantly better than the Group A-Group B classifiers, confirming the hypothesis that there is a difference between the text in Android tweets and that in iPhone tweets. We also see that although there is a slight improvement when using higher-order models, the gains are not that great. The biggest improvement comes from moving from a unigram model to a bigram model, and even the unigram model does a fairly decent job at categorizing the tweets.

Since this classifier gives us the probabilities that each tweet came from either an Android phone or an iPhone, we can use it to answer questions like "What is Trump's Androidiest tweet?" or "What is his iPhoniest tweet?" In other words, which tweets are most likely to have come from an Android phone or an iPhone, just looking at the text of the tweets? To answer this, I built a new 5-gram language model for each tweet in either the original training set or the original test set. The corresponding model for each tweet was trained on every other tweet from either the training or test set. Each model was then used to estimate the probabilities that the held-out tweet came from either an Android phone or an iPhone. In other words, this method predicts the probability of a tweet's source by learning from every tweet except the one whose source is to be predicted.

Here are Trump's five Androidiest tweets, all of which did come from an Android phone:

If we assume that Trump's Android tweets are written by Trump himself, we might also think of these as the Trumpiest of all Trump tweets, or at least the most purely Trump.

And here are the five iPhoniest tweets. Again, all of these did, in fact, come from an iPhone:

These results match up pretty well with our intuitions about what distinguishes Android Trump from iPhone Trump. Android Trump touts his poll numbers, says his opponents are bought and paid for by lobbyists, the system is rigged, and Mexico is taking our jobs. iPhone Trump is asking people to volunteer for the campaign, sharing messages from Ivanka, and advertising rallies. Particularly telling is the contrast between the two Trumps' economic messages. Android Trump tells us that "Mexico is killing us on jobs and trade," while iPhone Trump says, "Instead of driving jobs and wealth away, AMERICA will become the WORLD'S great magnet for innovation & job creation!"

So, what have we learned? The longitudinal data backs up suspicion that Trump's iPhone tweets actually come from his campaign staff, while the classifier gives us additional evidence that the Android tweets and the iPhone tweets differ significantly in style and content. In fact, even the simplest classifier we looked at performed much better than chance in distinguishing the iPhone tweets from the Android tweets. It's not too hard for a classifier to pick up on the difference between Android Trump and iPhone Trump, revealing the difference between the two is quite stark.