Tweet Classification and Comparison of Russian Propagandists and U.S. Politicians

Jenna Brandt and Erin Puckett

Abstract

Research has been done with natural language processing (NLP) and neural networks to find and classify troll tweets broadly on Twitter as well as find and classify them on Russian Twitter. However, no research has focused on differentiating Russian trolls from the U.S. politicians they sought to mimic with their disinformative tweets during the 2016, 2018, and 2020 elections in the United States. To close that gap, we have created a model that differentiates between Democrat tweets, Republican tweets, Russian LeftTroll tweets, and Russian RightTroll tweets in order to determine if neural networks can be used to effectively distinguish between those groups. We used a pre-trained AWD LSTM text classification model from fastai and also created our own two-layer model with PyTorch. We focused more on the PyTorch model and tuned the hyperparameters to find the best possible accuracy, particularly per-class accuracy. That provided us with an overall accuracy of 79.8%, and class-specific accuracies of 70.8% for Democrat tweets, 76.3% for Republican tweets, 80.0% for LeftTroll tweets, and 80.6% for RightTroll tweets. This significant accuracy shows the effectiveness of neural networks in NLP models and how social media companies could deploy a model like ours to prevent election-influencing disinformation from propagating on their sites.

Introduction

During the last two presidential elections in the U.S., disinformation from a variety of sources has spread on Twitter. Russian disinformation has been particularly prevalent. This has created a need for an effective way to find the trolls used to propagate that disinformation in order to remove them. As such, we are investigating Twitter usage among Democratic and Republican politicians as well as tweets created by Russian trolls, as these trolls often try to imitate politicians’ rhetoric. We have created a neural network model to effectively categorize the author of a tweet as either a Democrat, a Republican, a right-wing-imitating Russian troll, or a left-wing-imitating Russian troll. We claim that a NLP model utilizing neural networks can effectively distinguish between these types of tweets and help solve the problem of Russian disinformation on Twitter and in American politics.

In addition to being extremely important, finding an effective solution to this issue is difficult. There are many features of tweets that could be used to determine whether they were written by trolls or American politicians, where both groups are divided by political affiliation, including message length, frequency, content, time tweeted, number of followers, and retweets. One approach could be to look at account-level details rather than tweet-level details. Often, those behind the trolls (especially in cases like this of state-sponsored disinformation campaigns) closely study their targets in order to appear as similar as possible. There is an active intent to fool, and that makes tweets hard to tell apart, particularly as some U.S. politicians themselves have embraced disinformation. With all of that in mind, we decided to focus our algorithm on the content of the tweets themselves, rather than other features.

We found that it was difficult as humans to assess why our model categorizes certain tweets as it does. The nature of the input (the content of the tweet) with its varied length, structure, and inclusion or lack thereof of hashtags offers a variety of features for the model to rely on to distinguish between the classes. That said, perhaps focusing specifically on one or two features is how the model works. For example, troll tweets seem to be shorter and less structured than politicians’ tweets, just from a glance through the dataset. Thus, perhaps instead of looking at the actual language used in the tweets, our model might actually categorize tweets based on length, and then look at vocabulary to distinguish political affiliation within the broader troll or politician categories. Regardless of how the model is working, our results show significant overall success. Thus, we argue that a neural network model can be effectively used to root out trolls on social media, particularly Twitter.

Our model correctly categorizes tweets as Democrat, Republican, LeftTroll, or RightTroll 79.8% of the time. Interestingly, the model had more trouble distinguishing between the tweets of different-party politicians and between differently-affiliated trolls than between the tweets of the same partisan affiliations but not the same profession, which points to the existence of some similarities between the tweets of politicians of different parties and between the tweets of trolls of different pretended affiliations. This is surprising given the current state of political polarization. We had expected the model to have the most difficulty distinguishing between Republican and RightTroll tweets because of how some GOP politicians have embraced disinformation. However, the moderate level of success means that social media companies could try approaches like ours to flag tweets that appear to come from sources who might intend to disrupt American democracy in some way. That would mean that the public (general users of social media) would not be subject to the potentially very harmful messaging of Russian trolls, or trolls more generally.

When conducting this research, we kept in mind the ethical implications of categorizing tweets as “trolls” rather than real people for the ability of social media users to have free speech. We also considered how using a machine learning model like ours to classify tweets might it make it easier for those proliferating disinformation to do so more easily by creating tailored content that is more able to evade censorship. While the need to identify and remove disinformation from social media made this research worthwhile, we were aware of the potential negative ethical implications of our classifications.

There has been other work done by researchers involving machine learning and Twitter, specifically with a focus on politicians. We describe several closely related works as follows, and then we note how our research differs from others’ works. A University of Michigan and Georgia Tech paper by Im et al. (2018) focuses on classifying Russian trolls vs. “normal”/control accounts on Twitter. They pay particular attention to Russian attempts to interfere with the 2016 U.S. Election, and create a machine learning model to correctly predict Russian troll accounts from non-troll accounts with high accuracy. This paper, unlike ours, does not address American politicians specifically or make any distinctions between Republicans and Democrats.

Another paper by Badawy, Ferrera, and Lerman (2018) at the University of Southern California (USC) addresses misinformation with machine learning. In this paper, the researchers used machine learning techniques to look at misinformation on Twitter using Russian trolls as an significant example of such practices. They specifically looked at both liberal and conservative media outlets, and particularly focused on “users who re-shared the posts produced on Twitter by the Russian troll accounts publicly disclosed by U.S. Congress investigation” of 2016 election interference. This paper does not focus on actual politicians’ tweets but instead focuses on private citizen retweets; we will focus on the tweets of politicians.

An additional paper, by Stukal et al. (2019) from New York University, uses neural networks to classify tweets from Russian accounts as being pro-regime, anti-regime, or neutral. Specifically, the researchers used a deep feedforward neural network that employs words, word pairs, links, mentions, and hashtags to classify trolls as pro-Kremlin, neutral/other, pro-opposition, and pro-Kiev. The results were “high-confidence predictions for most observations”. This paper focuses on only Russian accounts and does not investigate American politicians, compared to our work which concentrates on Russian-produced tweets that relate to events in United States politics as well as American politicians.

Finally, a paper by Kundugunta and Ferrera (2018) from the Indian Institute of Technology and USC, respectively, uses neural networks with a contextual long short-term memory (LSTM) architecture that looks at both content and metadata from Twitter accounts to identify trolls among real human users. The authors also used synthetic minority oversampling in order to create a large dataset for training. Our paper will address a similar issue involving classification of troll tweets versus real human user tweets, but with a focus on political speech involving Russian trolls and American Democrat and Republican politicians. Thus, there are a variety of papers that have investigated political tweets, troll tweets, Russian-produced tweets, and differences between troll and real human tweets, but none have specifically looked to distinguish Russian trolls, American Republican politicians, and American Democrat politicians. We hope this paper brings some insight into the area of political tweets and Russian interference in American political discourse on Twitter.

Methods

The software we used for this project involved Python libraries, namely fastai and PyTorch. We used fastai to create a preliminary text classification model with an average stochastic gradient descent (SGD) weight-dropped long short-term memory (AWD LSTM) architecture, utilizing transfer learning. Then, we focused on using PyTorch for the remainder of the project. PyTorch allowed us to more fully customize a model and gain an in-depth understanding of how to create neural networks for a natural language processing application involving political tweets. When we refer to “our model” in this paper, we are referring to this PyTorch model, rather than the fastai one, as that is the model that we decided to expend more effort on.

We used two datasets from Kaggle, the “Russian Troll Tweets” dataset from FiveThirtyEight and the “Democrat Vs. Republican Tweets” dataset from Kyle Pastor. The politicians in the dataset were U.S. Representatives in Congress. We eliminated irrelevant data, specifically Russian Troll tweets that were not classified as RightTroll or LeftTroll. The data was also cleaned to remove non-ascii characters. In the end, the final combined dataset has a size of around 1 million tweets, with around 622k RightTroll entries, 340k LeftTroll entries, 44k Republican entries, and 42k Democrat entries.

Once our data was cleaned, we worked on a transfer learning model using a pre-trained AWD LSTM text classification model from fastai. That model was pre-trained on Wikipedia, specifically to guess the next word given all of the previous words, following the Universal Language Model Fine-tuning (ULMFiT) approach. We reorganized our dataset into an ImageNet-style organization such that each individual tweet was contained in a .txt file, and then put into folders separated by class, contained within folders for training and testing data. We first tested our model on a small dataset of 4000 entries. Once convinced we had fixed the bugs, we moved on to using our full dataset. With the full dataset of around 1 million tweets, we fine-tuned the fastai model for 20 epochs, with a base learning rate of .001 and a dropout rate of 25%.

After using transfer learning, we decided to build a text classification model from scratch using PyTorch, specifically torchtext. We wanted to see if we could achieve similar or better results when we were able to control the hyperparameters and train the model on our dataset from the beginning. We first split the data into test and training sets, but kept it in .csv format, rather than in an ImageNet style. After creating a vocabulary using built-in functions from torchtext, and splitting the tweets into batches, we created and trained a TextClassification model, with the vocabulary size, embedding dimension, and number of classes as the model’s parameters. The architecture of the model is comprised of two layers: an EmbeddingBag Layer and then a final linear layer. In our loss function, we adjusted the weights to account for the fact that there were many more troll tweets (especially RightTroll tweets) than the politician tweets. Thus, the weights were 1.0, 1.0, 0.125, and 0.0714 for the classes of Democrat, Republican, LTroll, and RTroll, in that order. In the run of the model with our final hyperaparmeters, we trained the model on 30 epochs, with an initial learning rate of 1, a scheduler gamma of 0.99, and a batch size of 64.

Creating the PyTorch model was difficult as updates for certain modules from Pytorch, particularly torchtext, have rendered certain features of the library unusable. Therefore, some of the tutorials that we attempted to leverage at first where not helpful, and we were unable to make a model following their instructions. We also considered different approaches to determining the best architecture for our model, as there are infinite possible ways to arrange layers. We wanted something efficient but simple, which is why we settled for our two-layer architecture.

Using the HPC enabled us to train our models in a reasonable amount of time without using our own laptops’ CPUs. On the HPC, our training process was first to train and test both models on a small dataset of 4000 tweets, with 1000 of each class represented in that dataset. The fastai and PyTorch models were each implemented in separate Jupyter Notebooks. When once secure in the correctness of the code, we ran both models on the full dataset. For the fastai model, since fine-tuning with 20 epochs took over four hours, we utilized tmux software in order to let the notebook run on the server while not being connected to the server the whole time. The PyTorch model took far less time to train, on the order of 10 minutes for 10 epochs, and so it was not necessary to leave it running and come back to it later.

The tools we used for data analysis were the matplotlib, seaborn, and mlxtend libraries. They allowed us to take our fastai text classification model and PyTorch text classification model and visualize the results using confusion matrices, so we could see where the model succeeded and where it struggled. Creating visualizations also allowed us to compare between the two models.

Lastly, in order to be able to implement model inference where the trained PyTorch model could output the predicted class for a single tweet, we implemented the same pre-processing steps as for the training and validation dataset. Essentially, we created a dataset of one tweet and assigning it an arbitrary label for each tweet we wanted to get a prediction for. While overwrought, this method was effective in allowing making predictions using the saved model and vocabulary in a separate Jupyter Notebook from the Notebook in which the model was coded up.

Discussion

Both our fastai model and our PyTorch model yielded significant results. The fastai model provided us with an overall accuracy rate of 87.8%. We determined the per-class accuracy rates for that model to be the following: 73.2% for Democrat tweets, 73.4% for Republican tweets, 84.7% for LeftTroll tweets, and 91.7% for RightTroll tweets. On the other hand, the PyTorch model provided us with an overall accuracy of 79.8%, and class specific accuracies of 70.8% for Democrat tweets, 76.3% for Republican tweets, 80.0% for LeftTroll tweets, and 80.6% for RightTroll tweets.

For the PyTorch model, we tuned the hyperparameters to create the best possible outcome. Eventually we settled on 30 epochs, an initial learning rate of 1 and a scheduler gamma of .99, and a batch size of 64 because that was the iteration of the weighted model that had the lowest validation cost at the end of the final epoch. This was achieved after trying different values for the hyperparameters as well as adding weights to the loss function to reflect the proportions of the dataset that each class represented. The full details of the hyperparameters tested and their effects on the PyTorch model can be seen in Table 1.

Table 1

Model Number	Model Framework	Weighted?	Weights	Initial LR	Batch Size	Scheduler Gamma	Epochs	End Accuracy	Dem. Accuracy	Rep. Accuracy	LTroll Accuracy	RTroll Accuracy	End Valid. Cost
1	PyTorch	No	1.0, 1.0, 1.0, 1.0	5	64	0.99	10	83.4%	61.35%	71.58%	76.28%	89.64%	339.41
2	PyTorch	Yes	1.0, 1.0, 0.125, 0.0714	5	64	0.99	10	81.3%	70.00%	76.70%	75.2%	85.8%	545.426
3	PyTorch	Yes	1.0, 1.0, 0.125, 0.0714	3	64	0.99	10	80.6%	67.1%	81.3%	75.4%	84.3%	503.369
4	PyTorch	Yes	1.0, 1.0, 0.125, 0.0714	1	64	0.99	10	77.2%	68.8%	75.7%	74.7%	79.2%	498.563
5	PyTorch	Yes	1.0, 1.0, 0.125, 0.0714	1	64	0.99	20	78.3%	70.4%	74.2%	80.4%	77.0%	495.936
6	PyTorch	Yes	1.0, 1.0, 0.125, 0.0714	1	64	0.99	30	79.8%	70.8%	76.3%	80.0%	80.6%	490.866
7	PyTorch	Yes	1.0, 1.0, 0.125, 0.0714	1	64	0.99	40	79.8%	71.3%	75.1%	81.4%	79.8%	510.69
8	fastai	No	1.0, 1.0, 1.0, 1.0	0.001	N/A	N/A	20 (fine tuning)	87.8%	73.2%	73.4%	84.7%	91.7%	N/A

The results from the fastai model and the variations of the PyTorch model show promise. However, they do not reveal a clear winner between the fastai and PyTorch models or even between different iteration of the PyTorch model. While the fastai model had better overall accuracy than all of PyTorch models, some versions of the PyTorch model had more consistent accuracies between classes than the fastai model, particularly the weighted PyTorch models. This is consistent with the fact that the fastai model had unweighted classes, despite the unbalanced class sizes. Between the weighted models, while using a higher initial learning rate produced a slightly better overall rate of accuracy, using an initial learning rate of 1 produced more consistent per-class accuracies. Furthermore, the unweighted version of the PyTorch model had a siginficantly lower end validation cost than the weighted version of the model, which surprised us, as we expected the weighted version to have better performance and thus lower a validation cost. However, despite the lower cost, we decided to keep the weights because we wanted to model to perform more equally across the classes.

The confusion matrices show that both models have greater difficulty distinguishing between Democrat and Republican tweets and between LeftTroll and RightTroll tweets than between groups of the same partisan slant. In both the confusion matrix for the fastai model (Image 1) and in the confusion matrix for the PyTorch model with gamma of 0.9 (Image 2), the greatest number of erroneously classified Democrat tweets were classified as Republican tweets, and vice versa. Similarly, the greatest number of erroneously classified LeftTroll tweets were classified as RightTroll tweets, and vice versa. This was surprising and heartening to us, as it shows that the Russian propagandists were mostly unsuccessful in emulating U.S. politicians, and that there is perhaps a smaller divide between the two parties, at least in terms of how they structure or word their tweets.

Confusion Matrices for Fastai Model (Image 1) and PyTorch Model 6 (Image 2)

Given these results, we have found that an NLP model using neural networks can be effectively used to differentiate between Democrat, Republican, LeftTroll, and RightTroll tweets. This builds on the success of previous work and gives hope to the cause of rooting out disinformation from Twitter and social media more broadly, where it can cause political polarization and influence election results.

Compared to previous work, our work investigates a specific inquiry related to politicians: how well an NLP model using neural networks can distinguish between American politician tweets and Russian troll tweets, divided by Democrat/Republican and RightTroll/LeftTroll partisan affiliations. Other researchers have looked at uncovering Russian troll tweets using machine learning but not in the context of comparing them to American politicians. Furthermore, because we found that our NLP model can quite accurately determine American politician tweets from Russian troll tweets, our work demonstrates that social media companies may be able to use a model such as ours to flag and/or delete tweets that are likely to come from troll accounts to limit the influence of foreign governments on American politics.

Ethical Considerations

When conducting this research, we kept in mind the ethical implications of categorizing tweets as “trolls” rather than real people. Freedom of speech is an important right protected by the U.S. Constitution, and people value having it on social media. Since social media companies purport to be simply a platform for sharing one’s thoughts, this would represent a departure from that. Then, by classifying tweets as trolls, and thus potentially misclassifying real human tweets as troll tweets, we risk dismissing their thoughts as not real and part of a malicious campaign to harm American democracy. However, given the moderately high accuracy of the model and the importance of identifying disinformation, we think that the potential benefits to a more well-informed public online outweighs the possible concerns of limits to their free speech. Freedom of speech does have limitations, and flagging a tweet as possibly authored by a foreign entity would not actually censor any speech. If tweets likely written by trolls, especially those promoting falsehoods or violence, were to be deleted by social media companies, we submit that there is a compelling interest in requiring social media companies to do so given the need to maintain American democracy.

Our second ethical consideration regarding our project is that Russian trolls could determine which of their tweets were flagged, if a system such as ours was implemented by Twitter. They could then simply use that information in a generative adversarial network to improve the quality of troll tweets, which in itself is an ethical quandary. Would we potentially be allowing for the increased proliferation of harder-to-identify disinformation? Such a question would need to be considered by developers pursing a model such as ours for actual use on a social media platform.

Reflection

Further steps we could take would be to look at the troll tweets and compare them to political tweets not written by politicians. As this has somewhat been done before, it would be interesting to see how our model, trained on politician tweets as the ‘non-troll’ tweets, would fare compared to other models trained on a broader set of political troll and human tweets. As we’ve noted, we suspect that the model relies upon the greater formality of the structure of politician tweets to distinguish them from troll tweets, and then uses differences in vocabulary to distinguish between tweets of different political/party affiliation. Since it struggles more with the latter and the layperson who is not a troll is more likely has less formally structured tweets than a politician, it seems likely that our model might have lower success on this new data than it does on the original dataset we’ve trained it on.

Other forms of disinformation have also proliferated on Twitter, and on other social media sites like Facebook and Reddit. QAnon is an example of another strain of political disinformation with the potential to wreak havoc. Recently, research from The Soufan Center has suggested that up to 20% of QAnon posts on Facebook in 2020 were created by nation-states, including Russia, China, Iran, and Saudi Arabia. Models like ours could be used to weed out that disinformation. There is so much potential in this field to do good, if only the ethical implications for free speech are duly considered.

Our results show the geopolitical importance of employing technological tools like our neural network models. With such technology, democratic states have a way to fight back against undemocratic states bent on shaping elections to their desire. While of course diplomatic tools will be required, this presents an option that can be used by the private sector unilaterally - it does not require negotiations or agreements. However, because the burden is on the private sector for this tool, states will need to incentivize social media companies to do so, which presents its own challenges. Alternatively or additionally, the U.S. government, whether state or federal, may be able to pass legislation requiring some level of regulation of tweets or posts on social media, perhaps including but not necessarily limited to laws designed to limit the spread of false information presented as fact. Possible free speech concerns and constitutional questions may arise, but we believe that narrowly tailored laws (perhaps those similar to regulations pertaining to television content restrictions) that advance the compelling interests of preserving American democratic insitutions would survive consitutional scrutiny. Regardless of whether such models could be mandated by the public sector or would be independently implemented by private companies, we believe neural networks such as ours that identify foreign manipulation in American politics can be a critical tool in combatting disinformation.

Bibliography

Badawy, A., Ferrara, E., & Lerman, K. (2018). Analyzing the Digital Traces of Political Manipulation: The 2016 Russian Interference Twitter Campaign. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). https://doi.org/10.1109/ASONAM.2018.8508646

Im, J., Chandrasekharan, E., Sargent, J., Lighthammer, P., Denby, T., Bhargava, A., Hemphill, L., Jurgens, D., & Gilbert, E. (2020). Still out there: Modeling and Identifying Russian Troll Accounts on Twitter. 12th ACM Conference on Web Science. https://doi.org/10.1145/3394231.3397889

Kudungunta, S. & Ferrara, E. (2018). Neural Networks for Bot Detection. Cornell University. https://doi.org/10.1016/j.ins.2018.08.019

The Soufan Center. (2021). Quantifying the Q Conspiracy: A Data-Driven Approach to Understanding the Threat Posed by QAnon. https://thesoufancenter.org/research/quantifying-the-q-conspiracy-a-data-driven-approach-to-understanding-the-threat-posed-by-qanon/

Stukal, D., Sanovich, S., Tucker, J. A., & Bonneau, R. (2019). For Whom the Bot Tolls: A Neural Networks Approach to Measuring Political Orientation of Twitter Bots in Russia. SAGE Open. https://doi.org/10.1177/2158244019827715

jennabrandt.github.io