Alexander Klapheke

I’m a data scientist and self-taught coder (Python, R, Haskell) with a background in linguistics and math and over 10 years’ experience explaining technical concepts to laypeople.

COVID-19 sentiment analysis

1,434 words CC-BY 📎 code, slide deck View source

This project is joint work with Luken Weaver, Jon Godin, and Reza Farrokhi.

The Master said, “When the multitude hate a man, it is necessary to examine into the case. When the multitude like a man, it is necessary to examine into the case.”

1 Background & problem statement

The COVID-19 crisis officially began in the US on January 20, when an infected traveler in Wuhan flew home to Washington State.1 For the several months thence, in the absence of a unified national policy, states have engaged in a complicated fandango of business closures, mask orders, and quarantines for visitors. While it would take a much more sophisticated data analysis to gauge their effectiveness, we can look at how these policies were popularly received using social media sentiment as a proxy.

We chose to compare sentiment between two cities in which the saga took quite different turns: New York, which suffered both an alarming outbreak and an austere lockdown (ongoing at time of writing), and Houston, which had a much easier time, with fewer infections than New York City had deaths, and a lockdown lasting barely a month (fig. 1).

Figure 1: Case and death counts in New York City (all five boroughs) and in Harris County, Texas (which contains Houston) as of June 22. Data courtesy of the New York Times. Source code    Cases   Deaths   Lockdown (statewide)
Figure 1: Case and death counts in New York City (all five boroughs) and in Harris County, Texas (which contains Houston) as of June 22. Data courtesy of the New York Times. Source code

 Cases
 Deaths
 Lockdown (statewide)

A naïve hypothesis would be that Houstonians, faring better overall, would speak with more positive affect. Complicating this, of course, is the profusion of topics on which people might converse, including: China, Donald Trump, Anthony Fauci, the CDC and WHO, governors Andrew Cuomo2 and Greg Abbott3, cruise ships, quarantines, face masks, and the cancellations of events, not to mention the pathogen itself and its physiological effects. Our goal, therefore, was to isolate topics of conversation, and examine the sentiment surrounding particular public figures.

2 Data collection

Our preferred data source was Twitter, which, with a third of a billion users, captures an enormous fraction of the public discourse. In addition, there is an up-to-date, curated dataset of COVID-19 related tweets.4 However, time and budget allowed few tweets to be captured—either on the order of 103 total, or stretching back only seven days. In addition, the paucity of geotags—by some estimates, on the order of 1%—would make filtering by city nearly impossible.

Reddit turned out to be somewhat more hospitable, as pushshift provides free access to the Reddit API, as well as a full dataset of Reddit comments stretching back to the site’s founding,5 although the latter source had not been updated recently enough for our purpose. Reddit users subscribe to interest communities called “subreddits”, prefixed with “/r/”, some of which are for cities and other locales. While it can’t be verified that a particular subscriber is a resident, we assume that a large proportion of them are. One downside of this structure is that subreddits can become echo chambers which do not represent the larger population.

Another difficulty is that although Reddit ranks in the top 20 websites by traffic, the volume of posts pales next to Twitter. Although each metropolitan subreddit records hundreds of thousands of subscribers, by the 90–9–1 rule of thumb, we guessed that in a subreddit with on the order of 105 subscribers, only about 103 would be active commenters, as shown in tbl. 1.

Table 1: City and subreddit populations and commenter estimates
Subreddit City pop. Subscribers Casual commenters Active commenters
/r/nyc 8,300,000 225,000 ≈20,000 ≈2,000
/r/houston 2,300,000 150,000 ≈13,500 ≈1,500

We focused on comments, since most of the substantive (read: sentiment-laden) discussion happens there; however, these are not guaranteed to contain topic-relevant keywords, so we searched for posts whose titles contained any of the keywords “COVID”, “coronavirus”, “quarantine”, or “pandemic”. Reddit generates a base 36 ID for each post and comment on the site, so we collated post IDs with comment IDs, as the simplified code snippet below illustrates.

In toto, we collected 150,000 comments from /r/nyc, and 85,000 from /r/houston, and randomly sampled 85,000 New York comments for the analysis to avoid variance issues.

3 Sentiment analysis

The sentiment analysis itself was done with the VADER sentiment parser,6 a sophisticated rule-based parser that incorporates contextual information such as negations (“not”), intensifiers (“very”), hedges (“somewhat”), and even emoticons (“:)”).7 For these reasons, it works best on unprocessed text:

It provides a “sentiment intensity” score which is signed for polarity (positive good, negative bad). We took the 5-day rolling mean of the sentiments of all comments, filtering for various keywords.

Two important caveats apply, the first of which is that knowing what the sentiment is doesn’t tell you why. For example, in late April, /r/nyc responded negatively to the keyword “Fauci”, but this turned out to be displeasure not with Fauci himself, but rather with Trump, who had considered firing him. Thus, sentiment around a public figure should not be conflated with sentiment toward that figure. The second caveat is that positive and negative sentiments can cancel; so although a highly positive or negative overall sentiment score reflects unanimity in these feelings, a neutral score does not represent apathy, but simply lack of consensus.

One way consensus is achieved on Reddit is by voting: users vote for (“upvote”) or against (“downvote”) comments, and the comment’s score is the number of votes for minus the number against. If voting is a way of agreeing with the sentiment of a comment, then it can be seen as tacitly expressing the same sentiment. We accounted for this by eliminating comments that scored less than one, and weighting the sentiment of the remainder by their scores:

One potential pitfall is that since the highest-scoring comments appear highest on the page, this can be subject to the bandwagon effect, by which users see, and then upvote, comments that are already popular.

We graphed the weighted sentiments of COVID-related posts in 2020 (fig. 2), and formulated a baseline sentiment for comparison—maybe sentiment is more positive in haler times, or maybe New Yorkers really are less positive in general—so we looked at comments from the same months in the previous year, without filtering by keyword (fig. 3).

Figure 2: COVID-related sentiment across cities, 2020 (5-day rolling mean)    New York   Houston
Figure 2: COVID-related sentiment across cities, 2020 (5-day rolling mean)

 New York
 Houston

The overall results were unilluminating; the only particularly positive or negative periods, in February 2020, are essentially artifacts due to insufficient data (the virus had not yet seized the nation’s attention).

Figure 3: General sentiment across cities, 2019 (5-day rolling mean)    New York   Houston
Figure 3: General sentiment across cities, 2019 (5-day rolling mean)

 New York
 Houston

A clearer picture arose around public figures. After filtering by the surname of each state’s governor, we see a positive spike in New York in the early days of shutdowns, whereas Houston takes a swift negative turn when Abbott declares lockdown, and swiftly becomes positive after he orders its end (fig. 4).

Figure 4: Sentiment around governors across cities, 2020 (5-day rolling mean)    New York (Keyword “Cuomo”)   Houston (Keyword “Abbott”)
Figure 4: Sentiment around governors across cities, 2020 (5-day rolling mean)

 New York (Keyword “Cuomo”)
 Houston (Keyword “Abbott”)

This seems to imply that New Yorkers, having been much harder hit, were more sanguine about curtailments of public life than the relatively freewheeling Texans.

A similarly revealing pattern is shown in response to the keyword “Trump” (fig. 5). While comments about the president tended negative overall, they dipped steeply in late April, possibly in response to his defunding the WHO.

Figure 5: Sentiment around President Trump across cities, 2020 (5-day rolling mean)    New York (Keyword “Trump”)   Houston (Keyword “Trump”)
Figure 5: Sentiment around President Trump across cities, 2020 (5-day rolling mean)

 New York (Keyword “Trump”)
 Houston (Keyword “Trump”)

4 Conclusions

We have shown that public attitudes as expressed on social media can be tied to world events, and analyzed attitudes toward events during the COVID-19 crisis. This analysis could be expanded with, e.g., a more sophisticated weighting scheme, more precise filtering, or a larger dataset.


  1. Michelle L. Holshue et al., “First Case of 2019 Novel Coronavirus in the United States,” New England Journal of Medicine 382 (2020): 929–36, doi:10.1056/nejmoa2001191.

  2. Dem., NY, in office since 2011

  3. Rep., TX, in office since 2015

  4. Juan M. Banda et al., “A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—an International Collaboration,” May 2020, doi:10.5281/zenodo.3819464.

  5. Jason Baumgartner et al., “The Pushshift Reddit Dataset,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, 1 (Palo Alto, CA: AAAI Press, 2020), 830–39, https://aaai.org/ojs/index.php/ICWSM/article/view/7347.

  6. Clayton J. Hutto and Eric E. Gilbert, “VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text,” in Eighth International Conference on Weblogs and Social Media, ed. Eytan Adar et al. (Palo Alto, CA: AAAI Press, 2014), 216–25, http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109.

  7. Some sample inputs and scores are shown below:

    Text Score
    great! +0.6588
    great +0.6249
    very good +0.4927
    good +0.4404
    not bad +0.4310
    somewhat good +0.3832
    okay +0.2263
    meh −0.0772
    not good −0.3412
    bad −0.5423