Text Analysis of 2020 Congressional Election Candidates

Last updated on May 18, 2021 2 min read

This is a project that I worked on during my time as a Graduate Research Assistant at the Massive Data Institute. While the project is still a work in progress, a brief description of the methodology and preliminary results can be found below.

Project Overview:

In this project, we use text data found on the campaign websites and Twitter feeds of 2020 congressional election candidates to develop various measures of political ideology. We then use these measures to test the effect of ideology on electoral performance. We begin by using web scraping techniques to collect text data from the campaign websites of over 1,400 candidates, including both incumbent members of Congress and non-incumbent candidates. The website data is complemented by the content of candidates’ publicly available Twitter feeds, obtained through the Twitter API. After collecting the website and Twitter text data, we employ a variety of preprocessing methods to prepare the data for analysis. Next, we utilize techniques like cosine similarity, custom dictionary methods, and a wordfish model to score each candidate on an ideological scale. Finally, we use these ideological scores to assess the relationship between ideology and electoral performance. Preliminary results suggest that being more moderate, as opposed to being more liberal for Democrats or more conservative for Republicans, is associated with stronger electoral performance (as measured by vote share percentage). While the project is still a work in progress, potential contributions include: methodological insights into novel techniques for assessing political ideology, substantive insights into the effect of ideology on electoral performance, and the value of the data assets for future analysis by other researchers.

Text Analysis of 2020 Congressional Election Candidates

Project Overview:

Andy Green

Data Scientist & Researcher