Analyzing Opioid Users on Reddit
Background
The opioid crisis in America was declared a public health emergency on October 26, 2017. The number of opioid overdose deaths continue to climb and with numbers on the rise, there is a need to improve treatments and interventions. Reddit is a network of communities, called subreddits, where groups of people with similar interests can gather and communicate anonymously. Exploring subreddits dedicated to opioids can provide insights into the social workings of opioid users. This project aims to gain a deeper understanding of opioid users by analyzing the words used in two subreddits: r/Opiates (Opiates) and r/OpiatesRecovery (Recovery). These two subreddits were chosen in the interest of comparison. The goal is to identify any measurable differences between the posts during opioid use and the posts after opioid use. If there are any differences, can they help improve treatments and interventions?
​
Data for overdose death rates involving opioids is from the CDC.

Data
In order to collect posts that fell into the timespan of January 1, 2019 - May 31, 2021, pushshift.io API and Python Reddit API Wrapper (PRAW) were utilized. Pushshift.io allows for certain parameters to be scraped from historical data; these parameters include the post ID, the date the post was created, the author of the post, and the number of comments attached to the post, but does not include the body of the post. To extract the body of the posts, the post IDs collected via pushshift.io were looped through PRAW and PRAW used the IDs to search through Reddit, finding the posts of interest and scraping the body of those posts. After pre-processing the data sets, the Opiates data set consists of 31,736 posts by 13,024 unique users with an average of 60 words per post. The Recovery data set consists of 10,188 posts by 4,568 unique users with an average of 85 words per post.

Before building and using models to analyze the data sets, we can take a look at the most common words. Below we have visualizations of the 50 most common words for each subreddit. The larger the bubble in the visualization, the more the word appears in the subreddit. We can see that many common words are shared between the two subreddits, but what we are most interested in are the words that differ. The words in blue are the words that are unique to the 50 most common words of Opiates. The words in orange are the words unique to the 50 most common words of Recovery. Words unique to Opiates include "pain", "oxy", "sh*t", and "drug"; these words give a sense of negativity, they focus on using, and they display some frustration with the use of foul language. Words unique to Recovery include "clean", "life", and "withdrawal"; these words give a sense of positivity, that portray an outlook that involves more than drugs, and they focus on life after using which includes withdrawal.

Opiates

Recovery
Sentiment Analysis
A sentiment classifier was built by training the Linear SVC algorithm from scikit-learn with a pre-labeled data set that can be found on Kaggle. The trained model was used to predict and assign a label to each post in the subreddit data sets: -1 for negative sentiment, 0 for neutral sentiment, and 1 for positive sentiment; these labels were used to calculate the overall sentiment for each subreddit.
The percentage of all posts that are positive and the percentage of all posts that are negative for each subreddit is displayed to the right. Overall, Recovery has a higher percentage of positive posts while Opiates has a higher percentage of negative posts.
​

After arriving at an overview of the sentiment for the subreddits, we can now examine a monthly breakdown and identify any changes throughout the time period of the data set. Plotting the monthly sentiment exhibits that Recovery almost always holds a higher percentage of positive posts and Opiates holds a higher percentage of negative posts up until July 2020. At this point, the percentage of negative sentiment from both subreddits meet. Although the COVID pandemic is not a focus for this project, t-tests were performed on the sentiment pre-quarantine and post-quarantine, with the cut off month being March 2020. There was a statistically significant difference between the negative sentiment and positive sentiment for Recovery (positive sentiment decreased and negative sentiment increased) but there was no significant change in the sentiment of Opiates.

Word Networks
Word networks allow us to visualize relational data in the form of graphs. First, we create co-occurrence matrices that contain a column for each word in the data set and rows that hold the number of times a word appears in a post. Next, we find the correlation between the word pairs using Pearson's Correlation Coefficient.The results of the correlation determine the mapping of the edges connecting the nodes.
The word networks for the entire data sets (not available for viewing on mobile) are complex but we can come to some meaningful conclusions by taking a look at the center of the networks and by creating smaller networks.
We can briefly dive into the center of the networks and take note of the nodes that have the most connections. In the Opiates network, we see that "h" (short for heroin) and "g" (short for gram) are the nodes that have the most connections and are the clear focal point for the network. In the Recovery network, we see "h" as the node with the most connections here as well, but we have a second focal point that includes a cluster of words such as "thing", "get", and "go". Additionally, the word "g" does not appear to be prominent in this network at all.
​
We can use nodes that appear prominently in both subreddit networks to create smaller networks. Since the word "h" is the node that has one of the most, or the most, connections in each of the subreddit networks, the word "heroin" may be an interesting starting point for creating a smaller network with a single word as the focal point for each subreddit. These single focal point networks will allow us to see how contexts differ for particular words.
The following networks are the networks for the word "heroin" for Opiates and for Recovery. The difference in the words that correlate with "heroin" is immediately apparent. In Opiates, the unique words (blue nodes) include words that are referring to drugs such as "fent", "morphine", "opiate", "methadone", "hit", and "dose". In Recovery, the unique words (orange nodes) include "house", "family", "money", "love", and "clean"; in general, there are other concerns being discussed in Recovery that do not seem to revolve around drug use.
​

Opiates

Recovery
Topic Modeling
Topic modeling is an unsupervised machine learning method that identifies topics by observing ngrams and analyzing word frequencies. Latent Dirichlet Allocation (LDA) is a popular topic topic modeling technique that calculates the probability a word will belong to a topic. Here is a link that provides information on LDA. The results can be used to filter through posts and take a deeper dive into posts that pertain to a particular topic or contain a particular word. We can also discover connections or themes that were not so apparent in our initial analysis.
Below are the interactive visualizations (not available in mobile view) for topic modeling via LDA. On the left hand side of these visualizations, the size of the bubbles inform us of the percentage of all tokens, or all words, that the bubble represents. For example, the first topic in Opiates represents 29.8% of all tokens. Clicking on a bubble allows further examination of a topic. On the right hand side, we have 30 of the most relevant words for the selected topic. The bars next to the words tell us the count of the word in the topic (red) and in the entire data set (red + blue). The following interpretation of the results of topic modeling was done at a relevance metric of 0.2.
Opiates
Opiates Topic Analysis:
-
The first topic represents 29.8% of all tokens and contains words related to everyday life. Words included in this topic are "car", "house", "mom", "friend", and "family".
-
The second topic represents 22.4% of all tokens and contains words related to withdrawal. Words included in this topic are "taper", "quit", "craving", and "relapse".
-
The third topic represents 19.6% of all tokens and contains words related to drugs. Words included in this topic are "oxycodone", "codeine", and "morphine".
-
The fourth topic represents 14.8% of all tokens and contains words related to techniques. Words included in this topic are "vein", "shoot", "nose", "spoon", and "needle".
-
The fifth topic represents 13.4% of all tokens and contains words related to research/studies. Words included in this topic are "test", "survey", "information", and "report".
​
​
​
​
Recovery
Recovery Topic Analysis:
-
The first topic represents 32.5% of all tokens and contains words related to feelings or current life events. Words included in this topic are "shit", "good", "love", "cry", and "happy".
-
The second topic represents 31.9% of all tokens and contains words related to withdrawal. Words included in this topic are "taper", "cold", "turkey", and "withdrawal".
-
The third topic represents 26.3% of all tokens and contains words related to everyday life. Words included in this topic are "parent", "family", and "job".
-
The fourth topic represents 8.3% of all tokens and contains words related to research/studies. Words included in this topic are "survey", "information", and "participate".
-
The fifth topic represents 1.1% of all tokens and had words related to music but did not seem to fit a coherent category. The words surrounding the ones listed in the topic were examined in order to create some context. Overall, it seemed as though the words in this topic were all connected to storytelling.
​
Conclusion
The goal of this project was to receive some insight on the social workings of opioid users and identify any measurable differences between posts during opioid use and posts after opioid use. There were differences between the sentiment in Opiates and Recovery. Overall, Recovery has a higher percentage of positive posts while Opiates has a higher percentage of negative. Posts from Recovery show a lot more focus on connections to society while posts from Opiates show a lot more focus on drug use: speaking about opioids, how to get them, and how to take them. This was evident from the difference in the most common words, word networks, and topic modeling.
For the most common words, words unique to Opiates include "pain", "oxy", "sh*t", and "drug"; these words give a sense of negativity, they focus on using, and they display some frustration with the use of foul language. Words unique to Recovery include "clean", "life", and "withdrawal"; these words give a sense of positivity that portray an outlook that involves more than using, and they focus on life after using which includes withdrawal. In the word networks for "heroin", Recovery displayed a connection to "house", "family", "money", "love", and "clean"; in general, it appears there are concerns and discussions that do not seem to revolve around drug use. In topic modeling, the life related topics in Recovery (topics 1 and 3) represented a total of 58.8% of all tokens while the everyday life topic only represented 29.8% of all tokens in opiates. Drug related topics in Opiates (topics 3 and 4) make up a large portion of the tokens with these topics representing 34.4% of all tokens.
Additionally, the topics that overlap in Recovery are withdrawal and everyday life while the topics that overlap in Opiates are everyday life and techniques for taking drugs. Could the difference in overlapping topics be the answer to which users recover and which continue to use? Does recovery require social connections when they are trying to quit and experiencing withdrawals? This can be taken a step further and be applied to the changes in sentiment that was identified in the monthly sentiment breakdown. Perhaps the rise in negativity in Recovery posts during the COVID pandemic can be attributed to connections being cut off. The importance of connections could possibly be a focus in future treatments and interventions.