What brands are the most popular across Reddit? How does popularity change over time? Is there a gender divide? What brands are talked about positively and negatively?
To answer these questions I will be conducting a short project using machine learning with Reddit data involving natural language processing (NLP).
I will take data from a purposive sample of fashion-related subreddits. I will use Python to access Pushshift’s API to access and download relevant comments and posts to construct a corpus.
Date cleaning and pre-processing
Having a clean and processed dataset is crucial for NLP analysis. In addition to typical pre-processing you have to take into account Reddit’s custom form of markdown which has to be removed with redditcleaner.
To analyse the dataset I will use the Natural Language Toolkit (NLTK) library for sentiment analysis of identified brands and topic modelling, specifically Latent Dirichlet Allocation (LDA), to identify in what ways brands are discussed.
After popularity and sentiment of brands has been tracked I’m keen to explore qualitatively for deeper insights. I expect high positive and negative sentiment to be clustered around certain events, such as articles on slave labour, store closures or product announcements.