This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the JMIR Infodemiology, is properly cited. The complete bibliographic information, a link to the original publication on https://infodemiology.jmir.org/, as well as this copyright and license information must be included.
The COVID-19 pandemic has affected people’s daily lives and has caused economic loss worldwide. Anecdotal evidence suggests that the pandemic has increased depression levels among the population. However, systematic studies of depression detection and monitoring during the pandemic are lacking.
This study aims to develop a method to create a large-scale depression user data set in an automatic fashion so that the method is scalable and can be adapted to future events; verify the effectiveness of transformer-based deep learning language models in identifying depression users from their everyday language; examine psychological text features’ importance when used in depression classification; and, finally, use the model for monitoring the fluctuation of depression levels of different groups as the disease propagates.
To study this subject, we designed an effective regular expression-based search method and created the largest English Twitter depression data set containing 2575 distinct identified users with depression and their past tweets. To examine the effect of depression on people’s Twitter language, we trained three transformer-based depression classification models on the data set, evaluated their performance with progressively increased training sizes, and compared the model’s tweet chunk-level and user-level performances. Furthermore, inspired by psychological studies, we created a fusion classifier that combines deep learning model scores with psychological text features and users’ demographic information, and investigated these features’ relations to depression signals. Finally, we demonstrated our model’s capability of monitoring both group-level and population-level depression trends by presenting two of its applications during the COVID-19 pandemic.
Our fusion model demonstrated an accuracy of 78.9% on a test set containing 446 people, half of which were identified as having depression. Conscientiousness, neuroticism, appearance of first person pronouns, talking about biological processes such as eat and sleep, talking about power, and exhibiting sadness were shown to be important features in depression classification. Further, when used for monitoring the depression trend, our model showed that depressive users, in general, responded to the pandemic later than the control group based on their tweets (n=500). It was also shown that three US states—New York, California, and Florida—shared a similar depression trend as the whole US population (n=9050). When compared to New York and California, people in Florida demonstrated a substantially lower level of depression.
This study proposes an efficient method that can be used to analyze the depression level of different groups of people on Twitter. We hope this study can raise awareness among researchers and the public of COVID-19’s impact on people’s mental health. The noninvasive monitoring system can also be readily adapted to other big events besides COVID-19 and can be useful during future outbreaks.
COVID-19 is an infectious disease that has been spreading rapidly worldwide since early 2020. It was first identified on December 31, 2019, and was officially declared as a pandemic by the World Health Organization on March 11, 2020 [
Mental disorders were affecting approximately 380 million people of all ages worldwide before COVID-19 [
Multiple studies have investigated the economic and social impacts of COVID-19 [
Given this pressing situation, we would like to quantify mental health conditions of the general population during the pandemic. Nevertheless, the data source selection is critical for overcoming the two challenges mentioned previously. In the past decade, people have been increasingly relying on social media platforms such as Facebook, Twitter, and Instagram to express their feelings. Social media can thus serve as a resourceful medium for mining information about the public’s mental health conditions [
As shown in
Density of Twitter coverage regarding “depression,” “ptsd,” “bipolar disorder,” and “autism.” ptsd: posttraumatic stress disorder.
The potential of machine learning models for identifying Twitter users who have been diagnosed with depression was pioneered by De Choudhury et al [
The CLPsych 2015 Shared Task data set containing 447 diagnosed depression users [
The CLPsych 2019 Shared Task [
In addition to these two challenge data sets, several studies attempted to gather their own data of various forms. Tsugawa et al [
Although the time series plots of keyword frequencies in
Therefore, the main objectives of this study are to develop a method to create a large-scale depression user data set in an automatic fashion so that the method is scalable and can be adapted to future events; to verify the effectiveness of transformer-based deep learning language models in identifying depression users from their everyday language; to further improve the depression classification model using explainable psychological text features and to examine their importance in classification; and, finally, to use the model for monitoring the fluctuation of depression levels of different groups as the disease propagates.
First, we identified users with depression from 41.3 million COVID-19–related tweets posted by about 36.6 million users from March 23 to April 18, 2020. We collected the COVID-19–related tweets using the keywords “corona,” “covid19,” “covid19,” “coronavirus,” “#Corona,” “#Covid_19,” and “#coronavirus.” From these tweets, we looked for signals that can tell whether the user has depression from both the text and the user profile description.
Empirically, we observed that many Twitter users with depression described themselves as “depression fighters” in their descriptions. Some of them may also post relevant tweets to declare that they have been diagnosed with depression. Inspired by Coppersmith et al [
In the end, 2575 distinct Twitter users were classified into the depression group. Of 200 randomly sampled users in the depression set, 86% were labeled positive by human annotators. We randomly selected another 2575 distinct users so that depression-related terms did not appear in their past 200 tweets or descriptions as our control group. Users in this group were not considered to have depression (nondepression group). Once we found the targeted Twitter users, we used the Tweepy application programming interface (API) to retrieve the public tweets posted by these users within the last 3 months since the time of posting the depression-related tweet, with a maximum of 200 tweets per user. We chose 200 tweets because, on average, it is roughly the number of tweets posted by an individual within a 3-month time span, which is the length commonly adopted by previous work [
Previous psychological research has shown that the big five personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism) are related to depression [
Besides personality, we hypothesized that individuals’ sentiments and emotions could also reflect whether they were experiencing depression or not. Sentiment analysis is widely-used in deciphering people’s health and well-being from text data [
Distributions of positive and negative emotion scores among the depression and nondepression groups. VADER: Valence Aware Dictionary for Sentiment Reasoning.
Previous psychological studies have shown differences in depression rates among people of different ages and of different genders [
We used LIWC—a well-validated psycholinguistic dictionary [
We chose 8 features that were analyzed in previous works [
We also found that the tweets of the depression group expressed more sadness emotion and used words related to the biological process more frequently. Although there is no clear link between biological process–related words and depression, this finding shows that people with depression may pay more attention to their biological statuses. The
Linguistic profiles for the depression and nondepression tweets. LIWC: Linguistic Inquiry and Word Count.
We used the proportion of tweets with mentions, number of responses, unique user mentions, user mentions, and tweets to measure the social media engagement of each user, as did Coppersmith et al [
We formulated our task as a classification task, where the model was trained to predict whether a particular tweet or a chunk of tweets comes from a user from the depression set. Note that not all tweets by people in the depression set were explicitly referring to depression per se. By definition, though, they were all posted by users with depression and were thus labeled true. To help improve the model’s generalizability, during training and testing, we excluded all the tweets used to identify the users with depression by regular expressions that contained trivial patterns and keywords. We assumed there were subtle differences in the language used between the depression and nondepression groups. Our goal was to build a model capable of capturing these subtleties and classifying users correctly.
We performed stratified random sampling on our data set. We first sampled 500 users to form our testing set. On the rest of the users, we progressively added users to the training sets and recorded the performance of the models trained on sets of 1000, 2000, and 4650 users. All the training and testing sets have a 1:1 (depression:nondepression) ratio.
Jamil et al [
We preprocessed the text using the tweet preprocessing pipeline proposed by Baziotis et al [
After chunking and preprocessing, on average, each user had 6-7 text chunks, making the actual sizes of the 4650-user train-validation set and the 500-user testing set to be 29,315 and 3105, respectively. The preprocessed tweet chunk data sets were then passed to deep learning models for training.
We used deep learning models to perform chunk-level classification. We set up two baseline models, multi-channel CNN and BiLSTM with context-aware attention (attention BiLSTM), as described in Orabi et al [
We ran the models on all the tweet chunks of the same user and took the average of the confidence scores to get the user-level confidence score. There were 4163 (89.5%) out of 4650 users remaining in the training set and 446 (89.2%) out of 500 users in the testing set whose entire features were retrievable. We then passed different combinations of user-level scores (personality, VADER, demographics, engagement, LIWC, and average confidence) to machine learning classification algorithms including random forest, logistic regression, and SVM provided by the
During training, we randomly split the train-validation set to training and validation sets with a ratio of 9:1. We used Adam optimizer with a learning rate of 7e-3 and weight decay of 1e-4 for training attention BiLSTM. We used Adam optimizer with a learning rate of 5e-4 for training CNN. We used AdamW optimizer with a learning rate of 2e-5 for training BERT and RoBERTa, and 8e-6 for training XLNet. We used the cross-entropy loss for all our models during training. We used the stochastic gradient descent optimizer with adaptive learning rate, with initial learning rate as 0.1 for training SVM and logistic regression classifier. We recorded the models’ performances on the validation set after each epoch and kept the model with the highest accuracy and F1 scores while training until convergence. We manually selected the hyperparameters that gave the best accuracy and F1 scores on the deep learning models.
In
Another observation was the performance gain of transformer-based models over BiLSTM and CNN models. The CNN model slightly outperformed BiLSTM, which replicated the findings of Orabi et al [
Chunk-level performance (%) of all 5 models on the 500-user testing set using training-validation sets of different sizes.a
Model and training-validation set | Accuracy | F1 | AUCb | Precision | Recall | ||||||
|
|||||||||||
|
1000 users | 70.7 | 69.0 | 76.5 | 70.9 | 67.3 | |||||
|
2000 users | 70.3 | 68.3 | 77.4 | 70.7 | 66.1 | |||||
|
4650 users | 72.7 | 71.6 | 79.3 | 72.1 | 71.1 | |||||
|
|||||||||||
|
1000 users | 71.8 | 72.6 | 77.4 | 72.7 | 72.6 | |||||
|
2000 users | 72.8 | 74.5 | 80.3 | 72.2 | 76.9 | |||||
|
4650 users | 74.0 | 70.9 | 81.0 | 77.4 | 68.9 | |||||
|
|||||||||||
|
1000 users | 72.7 | 74.4 | 79.8 | 72.0 | 76.9 | |||||
|
2000 users | 75.7 | 76.3 | 82.9 | 76.1 | 75.7 | |||||
|
4650 users | 76.5 | 77.5 | 83.9 | 76.3 | 78.8 | |||||
|
|||||||||||
|
1000 users | 74.4 | 75.7 | 82.0 | 74.2 | 77.3 | |||||
|
2000 users | 75.9 | 77.9 | 83.2 | 73.8 |
|
|||||
|
4650 users | 76.2 |
|
84.1 | 74.4 | 81.9 | |||||
|
|||||||||||
|
1000 users | 73.7 | 75.1 | 80.7 | 73.2 | 77.2 | |||||
|
2000 users | 74.6 | 76.8 | 82.6 | 72.6 | 81.5 | |||||
|
4650 users |
|
77.9 |
|
|
78.3 |
aWe used 0.5 as the threshold when calculating the scores.
bAUC: area under the receiver operating characteristic curve.
cBiLSTM: bidirectional long short-term memory.
dCNN: convolutional neural network.
eBERT: Bidirectional Encoder Representations from Transformers.
fRoBERTa: Robustly Optimized BiLSTM Pretraining Approach.
gItalics indicate the best performing model in each column.
Next, we report our experiment results at the user level. Since XLNet trained on the 4650-user data set outperformed the other models, we took it for user-level performance comparison. Our experimental results demonstrated a substantial increase on the user-level scores of XLNet shown in
The results are shown in
In an attempt to investigate what specific textual features besides those extracted by XLNet have the most impact on depression classification, we calculated the permutation feature importance [
User-level performance (%) using different features.
Featuresa | Accuracy | F1 | AUCb |
VADERc | 54.9 | 61.7 | 54.6 |
Demographics | 58.7 | 56.0 | 61.4 |
Engagement | 58.7 | 62.3 | 61.7 |
Personality | 64.8 | 67.8 | 72.4 |
LIWCd | 70.6 | 70.8 | 76.0 |
V + D + E + P + Le | 71.5 | 72.0 | 78.3 |
XLNet | 78.1 | 77.9 | 84.9 |
All (random forest) | 78.4 | 78.1 | 84.9 |
All (logistic regression) | 78.3 | 78.5 |
|
All (SVMg) |
|
|
86.1 |
aWe used SVM for classifying individual features.
bAUC: area under the receiver operating characteristic curve.
cVADER: Valence Aware Dictionary and Sentiment Reasoner.
dLIWC: Linguistic Inquiry and Word Count.
eV + D + E + P + L: VADER + demographics + engagement + personality + LIWC.
fItalics indicate the best performing model in each column.
gSVM: support vector machine.
Permutation importance of different features. LIWC: Linguistic Inquiry and Word Count; VADER: Valence Aware Dictionary for Sentiment Reasoning.
In this section, we report two COVID-19–related applications of our XLNet based depression classifier: (1) monitoring the evolution of depression levels among the depression group and the nondepression group, and (2) monitoring the depression level at the US country level and state level during the pandemic. We chose to use XLNet because of its simplicity as a stand-alone model, as it performed comparably to the fusion model.
We took the 500 users from the testing set (n=500), along with their tweets from January 1 to May 22, 2020. We concatenated a user’s tweets consecutively from January 1 one by one until reaching 250 words and labeled this chunk’s date as the date of the author posting the tweet that was in the middle of the chunk. We grouped 3 days into a bin from January 1 and assigned the chunks to the bins according to the labeled date. We ran the XLNet model on the preprocessed tweet chunks and recorded the confidence scores. We trimmed the upper and lower 10% of the data to reduce the skew in the score distribution. We then took the mean of the scores for each time bin and plotted the depression trend shown in
Aggregated depression level trends of the depression and nondepression groups from January 1 to May 22, 2020. Since users with depression have a substantially higher depression level, we used different y-axes for the 2 groups' depression levels to compare them side by side.
Two immediate observations followed. First, depression level among users in the depression group was substantially higher than that in the nondepression group. This held across the entire observation period from early January to late May 2020. Second, and more importantly, the depression levels shared a strikingly similar trend among the two groups.
Delving deeper into these curves, we marked three important time points on the plot—the first confirmed case of COVID-19 in the United States (January 21, 2020), the US National Emergency announcement (March 13), and the last stay-at-home order issued (South Carolina, April 7). In January, both groups experienced a drop in depression scores. This may be caused by the fact that people’s mood usually hits its lowest in winter [
To better understand the trend, we applied the LDA model to retrieve the topics before and after the announcement of the US National Emergency. Each chunk of the tweets was assigned 5 weights for each of the 5 topics. We labeled the topic of the highest weight as the dominant topic of this chunk of the tweets and counted the frequency of each topic shown in
Topic distributions of depression and nondepression groups before and after the announcement of the US National Emergency.
To investigate country-level and state-level depression trends during COVID-19, we randomly sampled users who had US state locations stated in their profiles and crawled their tweets between March 3 and May 22, 2020, the period right before and after the US announced a National Emergency on March 13. Using the same logic as in the previous section, we plotted the change of depression scores of 9050 geolocated users (n=9050) sampled from the 36.6 million users mentioned, excluding those used for training, as the country-level trend. For state-level comparison, we plotted the aggregated scores of three representative states—economical center New York on the East Coast that was highly affected by the virus, tech center California on the West Coast that was also struck hard by the virus, and the less affected tourism center Florida in the southeast. Each selected state had at least 550 users in the data set to validate our findings. Their depression levels are shown in
The first observation of the plot is that depression scores of all three states and the United States behaved similarly during the pandemic; they experienced a decrease right before the National Emergency; a steady increase after that; a slight decrease past April 23, 2020; and another sharp increase after May 10. We also noticed that the overall depression score of Florida was substantially lower than the US average and the other two states. Since Florida had a lower score both before and after the virus outbreak, we hypothesized that it has a lower depression level overall compared to the average US level irrespective of the pandemic.
We calculated the topics at the state level after the announcement of the US National Emergency. As shown in
Aggregated depression level trends of the United States, New York, Califoria, and Florida after the announcement of the US National Emergency.
Distributions of the top 5 topics (state level) after the announcement of the US National Emergency.
In this study, we developed a practical pipeline that included first gathering and cleaning a large-scale Twitter depression classification data set quickly in response to an outbreak, then training an accurate depression signal detection model on this data set, and finally applying the model to monitoring public depression trends. We analyzed the depression level trends during the COVID-19 pandemic, which shed light on the psychological impacts of the pandemic. Our main results were fourfold and corresponded to the four objectives listed in the
First, using a stringent yet effective regular expression-based search method, we constructed by far the largest data set with 5150 Twitter users, including half identified as depression users and half as control users, along with their tweets within the past 3 months and their Twitter activity data.
Second, we developed a chunking and regrouping method to construct 32,420 tweet chunks, with 250 words each in the data set. We progressively added data to our training set and showed experimentally that the performance of deep learning models improves as the size of the training set grows, which validates the importance of our data set size. We compared the models’ performances at the chunk level with the user level and observed further performance gain, which added credibility to our chunking method.
Third, we built a more accurate classification model (with 78.9% accuracy on n=449) upon the deep learning models along with linguistic analysis of dimensions including personality, LIWC, sentiment features, and demographic information. A permutation importance test showed that conscientiousness, neuroticism, appearance of first person pronouns, talking about biological processes such as eating and sleeping, talking about power, and exhibiting sadness are closely related to depression cues.
Finally, we showed the feasibility of the two proposed methods for monitoring the change of public depression levels as the disease propagates by aggregating individuals’ past tweets within a time frame. Our method can target different groups of people, and we showed the depression trends of identified depression and nondepression groups (n=500), and of groups at different geolocations (n=9050). The temporal trends showed that the nondepression group’s depression level rose earlier than that of the depression group, which we explained by psychological theories and LDA topics extracted from key time points. We also found that New York, California, Florida, and the United States in total all shared a similar depression trend, with Florida having a substantially lower depression level, which was also verified by LDA topic analysis.
Our study has practical implications. For example, upon detecting a rise in depression levels in a certain area, internet-based intervention services can be recommended by the social media platforms to the users. An intervention for depression commonly recommended is cognitive behavioral therapy (CBT), which is a type of therapy that targets one’s irrational thinking patterns and unadaptable behavioral patterns [
Although our data collection method is fast and fully automatic, we acknowledge that the same limitations exist as noted in detail by Coppersmith et al [
The data set used in this study containing 2575 depression users was much larger than those used previously, which contained 1402 depression users at most. De Choudhury et al [
COVID-19 has infected over 100 million people worldwide [
Supplemental data statistics and tables.
application programming interface
area under the receiver operating characteristic curve
Bidirectional Encoder Representations from Transformers
bidirectional long short-term memory
bag of words
cognitive behavioral therapy
convolutional neural network
latent Drichlet allocation
Linguistic Inquiry and Word Count
posttraumatic stress disorder
Robustly Optimized Bidirectional Long Short-Term Memory Pretraining Approach
support vector machine
Valence Aware Dictionary and Sentiment Reasoner
YZ and JL conceived and designed the study. YZ performed regular expression search and preprocessing, examined feature importance, and wrote the majority of the manuscript. HL performed data collection and applied the LDA models. HL and YZ analyzed the data and wrote part of the manuscript. YZ and YL trained the models and performed depression monitoring. XZ analyzed the findings using psychological theories. All authors helped design the study and edit the manuscript.
None declared.