5 minute read

Reddit: The front page of the internet

Reddit is a popular social media website founded in 2005. It’s both a discussion and aggregator platform. The website has a mixture of features, such as a traditional website forum with subforums, bulletin board and also image based social media platform. Each subforum (or subreddit) has its own community, with its own set of rules and norms. Yet this description does not do justice, or adequately describe, the essence of Reddit and the communities which inhabit it.

All the above features come together to create a uniquely paradoxical platform. A platform which allows for both in-depth long form discussions, typical of older forums, and quick fleeting consumption of entertainment content, typical of other social media platforms like Instagram and Twitter.

Popularity and demographics

There are over 330 million monthly active users who are part of over 1.2 million communities, with over 150,000 of these being active.¹ These communities can range in size from 18 million subscribers to as little as several.² And this only accounts for users who have clicked a ‘join’ button. This does not count users who do not join subreddits, or people who browse the website without a user account.

According to Pew Research Centre (2016) the average user is American, young, male and likely to be college educated. 45% of users are between the ages of 18 and 29, and a significant amount of 30–49 users it. The typical user 46% of Reddit app users have a college degree or higher, while 40% have a high school degree (Agrawal, 2016). The website has grown significantly in recent years, driven by the popularity of the official Reddit and third party apps. In 2018 it very briefly overtook Facebook to become the third most visited in the United States and overtook Twitter by active monthly users (Digital Consulting, 2018).

What is really being studied?

Reddit is broadly approached in two major ways by researchers. Reddit is viewed as a social phenomena itself to be investigated or it is used as a proxy to investigate another social phenomena.

When I say Reddit itself is being investigated I am simplifying. It can mean a number of things. It can mean: the entire website, hundred of subreddits, to multiple subreddits, to just a single one. And it can a mean users and their interactions through comments, posts, upvotes/downvotes. Or it can be a Reddit-centric mediated or created social phenomena. A great example of this is Chandrasekharan et al (2018) who investigated the hidden norms of the website, and another is Squirrell (2019) who investigated the relationship between volunteer moderators and the userbase on the site. Or the much broader work of J. Nathan Matias and the non-profit Civilservant.io to help create a fairer, safer, more understanding internet.

The second is that Reddit is used as proxy to investigate another social phenomena. This means that at the end of the day the researcher isn’t really studying Reddit. They’re studying something else and using Reddit as a source to study it. This can include: collecting data on user comments, posts upvotes, downvotes and interactions on a particular topic or event. In addition subreddit communities are used as convenience sampling frames for traditionally hard to reach groups. Two illustrative examples include: Nunes & Filho’s (2018) study on consumer behaviour towards Google Glass and Pilkington and Rominov’s (2017) study on the types of worry exhibited by fathers during pregnancy.

What type of academics and disciplines study Reddit?

Around half of all academics who research Reddit fall under computer science, followed by the social sciences. Traditionally when people talk about ‘the social sciences’ they commonly mean: anthropology, communication studies, economics, history, musicology, human geography, linguistics, political science, psychology, public health, education, social and media studies, and sociology.

this is a placeholder image
Credit: Chackraborty et al (2017).

However, nearly all of these computer science researchers come from the subdiscipline of Human Computer Interaction (HCI). HCI studies the interaction between humans and technology. A seen from the graphic HCI draws from a diversity of other disciplines and approaches. Including many of the social sciences.So saying research output comes from computer science and the social sciences can be a bit misleading.

A review of the academic literature between 2008 to May 2019 identified 211 research output on Reddit. This included peer reviewed journal articles, book chapters and conference papers. The below disciplines and subdisciplines were represented:

  • Computer science (HCI)
  • Health
  • Communications
  • Culture and Media
  • Information systems
  • Gender studies
  • Political studies
  • Public policy
  • Sociology
  • Criminology
  • Psychology
  • Linguistics
  • Business studies
  • Journalism
  • Family studies
  • Anthropology
  • Musicology
  • Human geography

The above refers to my own systematized review which was conducted over two months. This type of review type is defined by Grant & Booth (2009) as one which includes one or more elements of the systematic review process, while stopping short of claiming that the resultant output is a systematic review. This type of review is common for postgraduate researchers (that’s me!) who do not have the full resources to conduct a full systematic review. This includes time and funding for additional researcher(s).

Tools of the trade

The Reddit Corpus: Pushshift.io

You really can’t mention research and Reddit without coming across what academics call the ‘Reddit Corpus’ in the literature.

The Corpus refers to database of all Reddit comments and posts from the site since its creation in 2008. The Corpus was created and hosted on Pushshift.io in 2015. The website was made by Jason Baumgartner who is both a data scientist and Reddit user. The Corpus is usually updated every month with the most recent comments and posts barring technical issues. It can be accessed and searched with Python.

Baumgartner and his website have been invaluable for research. At the time of writing of this post there have been 129 articles on appearing Google Scholar which have explicitly used and mentioned the database. And my own systematized review identified 200+ published research involving Reddit.



In addition for researchers who do not currently have the skills to take full advantage of the Reddit Corpus, Baumgartner has written a search engine wrapper which uses Pushshift data. The website Redditsearch.io is a brilliant application with a minimal interface and data rich search features. It allows you to search for terms in comments and posts. With further options to narrow down by subreddit, author, domain, and time period.

  1. An active community is defined as a subreddit with at least 5 posts or comments posted in a given day (Reddit Inc., 2015).
  2. Any user can create a subreddit and become a moderator for it. Individual subreddits do not appear ‘live’ via Reddit search until a certain amount of users join the subreddit. The exact number is not known to the public, only by Reddit Inc. This is to deter spam.

Note: This blogpost was originally posted on Medium in July 2019.