Twitter Trends for Celiac Disease and the Gluten-Free Diet: Cross-sectional Descriptive Analysis

Background Few studies have systematically analyzed information regarding chronic medical conditions and available treatments on social media. Celiac disease (CD) is an exemplar of the need to investigate web-based educational sources. CD is an autoimmune condition wherein the ingestion of gluten causes intestinal damage and, if left untreated by a strict gluten-free diet (GFD), can result in significant nutritional deficiencies leading to cancer, bone disease, and death. Adherence to the GFD can be difficult owing to cost and negative stigma, including misinformation about what gluten is and who should avoid it. Given the significant impact that negative stigma and common misunderstandings have on the treatment of CD, this condition was chosen to systematically investigate the scope and nature of sources and information distributed through social media. Objective To address concerns related to educational social media sources, this study explored trends on the social media platform Twitter about CD and the GFD to identify primary influencers and the type of information disseminated by these influencers. Methods This cross-sectional study used data mining to collect tweets and users who used the hashtags #celiac and #glutenfree from an 8-month time frame. Tweets were then analyzed to describe who is disseminating information via this platform and the content, source, and frequency of such information. Results More content was posted for #glutenfree (1501.8 tweets per day) than for #celiac (69 tweets per day). A substantial proportion of the content was produced by a small percentage of contributors (ie, “Superuser”), who could be categorized as self-promotors (eg, bloggers, writers, authors; 13.9% of #glutenfree tweets and 22.7% of #celiac tweets), self-identified female family members (eg, mother; 4.3% of #glutenfree tweets and 8% of #celiac tweets), or commercial entities (eg, restaurants and bakeries). On the other hand, relatively few self-identified scientific, nonprofit, and medical provider users made substantial contributions on Twitter related to the GFD or CD (1% of #glutenfree tweets and 3.1% of #celiac tweets, respectively). Conclusions Most material on Twitter was provided by self-promoters, commercial entities, or self-identified female family members, which may not have been supported by current medical and scientific practices. Researchers and medical providers could potentially benefit from contributing more to this space to enhance the web-based resources for patients and families.


Introduction
Chronic disease diagnoses often are coupled with a significant period of adjustment as patients learn how to manage and live with the condition. Having access to relevant and reliable information is important for educating and aiding new patients in disease management [1][2][3]. Over the past 16 years, many individuals with a chronic disease have been turning to Internet sources, such as social media, for education about their condition and treatment [4,5] despite a hesitancy from physicians and medical providers to use this resource for patient education [6]. Social media use among Americans has increased dramatically across adults of all genders, race, income, education level, and communities since the early 2000s [4]. The social media platform Twitter provides a unique data source whereby important questions can be asked and analyzed regarding how various participants are searching and sharing information, such as information related to patient education and disease management.
Having the technological ability to collect (ie, "mine") publicly available data on social media platforms such as Twitter provides an opportunity to systematically quantify and categorize information on such platforms into trends and useful information for interested parties (eg, patients with chronic diseases). One component of using these emerging methodologies to analyze social media information is through the use of "affinity spaces." Affinity spaces represent either physical or web-based gathering places (rather than geographic or identity-based communities) where people come together in a "common endeavor" to develop and share various types of knowledge, including individual, internal, and in-depth information [7].
The systematic application of common data mining techniques on social media platforms facilitates the analysis of disease management-related trends and information available to patients [5]. This is of relevance to those with celiac disease (or in British English, "coeliac disease"; CD). CD is a condition that requires extensive education around a dietary treatment steeped in stigma and myth [8]. CD is a chronic autoimmune condition wherein the ingestion of gluten results in an immune-mediated injury to the small intestine [9]. Damage to the small intestine leads to malabsorption of nutrients and can result in short-and long-term complications ranging from gastroenterological distress to cancer and even death [9]. It is estimated that CD affects approximately 1% of individuals worldwide [10]. To date, the only treatment is adherence to a strict gluten-free diet (GFD) [11]. CD is associated with heavy biopsychosocial demands and challenges following a CD diagnosis [12,13].
Prior work on broader internet-based sources for CD education is emerging and denotes concern for the information, and misinformation, that is presented by these sources [14][15][16][17][18]. Overall, information disseminated by the top websites found in web-based searches conducted by researchers are not entirely accurate, transparent, or reliable for interested consumers such as patients or providers, including dietitians [15,17,19]. Moreover, despite its potential to reach millions of viewers, the top videos on YouTube related to CD in 2019 lacked adequate or helpful information [14].
Given the high prevalence of CD and the heavy burden associated with managing CD and the GFD, many resources are available; nonetheless, it is difficult to identify credible educational information about the treatment for CD (a GFD). New methodologies from the field of computer science have emerged that allow for further exploration of patient education through not only the internet but also, more specifically, the social media space. The purpose of this study was to combine the fields of computer science and behavioral science to explore trends on Twitter as an educational source for patients with CD. This study conducted a preliminary evaluation of the scope and nature of information available on Twitter by (1) determining who the primary contributors are who lead the conversations about CD and GFD-related topics on Twitter, as well as (2) identifying what type of information (ie, content, source, and frequency) is being disseminated by these contributors.

Selecting an Internet Information Source
The social media platform Twitter allows for broader access to data than other social media platforms. Additionally, the nature of "tweets" (posts from Twitter users) and user profile descriptors is text-based versus image-based (such as content found on Instagram), which allows for more ready analysis of the data. Despite not being the most widely used platform, as is YouTube (81%) or Facebook (69%), Twitter is used by approximately a quarter (23%) of American adults and relatively equally among self-identified men and women and racial groups [5]. A 2021 survey of US adults demonstrated that young adults (18 to 29 years) are the predominant users of social media [5]. However, use by older adults (>65 years) has increased in recent years to 45% of older adults in 2021, which indicates that they use at least 1 social media site [5]. Given the ready availability of the data and wide use of users including individuals with CD, Twitter was chosen as the social media source for this study.

Defining Affinity Spaces
An increasingly common research practice has been to examine affinity spaces found on the popular social media platform Twitter through the use of hashtags (an author's use of the hash symbol followed by the subject of a message) as a way to categorize and group messages; eg, #celiac and #glutenfree) [20,21]. These hashtags are conceptualized as a type of affinity space to explicate how these organic web-based spaces are used by communities to communicate, share, and find information [20,21]. As an open platform with very few barriers to participation and 330 million monthly active users [22], Twitter encourages the organic development of affinity spaces around topics and events via hashtagged keywords (eg, #celiac and #glutenfree).
The 2 topics most central to this study are "celiac" and "gluten-free." Information available on Twitter regarding these topics might exhibit different norms in terms of who participates in these affinity spaces and how (eg, someone might want information on a gluten-free diet for non-celiac-related reasons). Hence, the original tweets that were tagged by Twitter users with either the #celiac or #glutenfree hashtag were treated as 2 different affinity spaces rather than 1 collective affinity space. These affinity spaces were then analyzed individually and compared to each other. Moreover, recognizing that many other hashtags might be used synonymously with #glutenfree or #celiac, hashtags akin to either of these terms in their relative affinity spaces also were included (ie, #gluten-free, #glutenfreediet, and #gluten_free, with #glutenfree and including #celiacdisease, #celiacs, #celiacsdisease, #coeliac, and #coeliacdisease with #celiac).

Data Collection
This study collected Twitter user and tweet data using the public Twitter application programming interface. Researchers used custom PHP scripts to collect query results and store them to a MySQL relational database for cleaning and analysis. The data set used in this study consisted of tweets that included any of the hashtags listed above. Given the large number of such tweets, we limited the time frame of our study to 8 months (October 27, 2019, through June 8, 2020), allowing us to have sufficient data for analysis without being influenced too heavily by a single event (eg, the US Thanksgiving holiday season or the onset of the COVID-19 pandemic). Furthermore, because our main emphasis was to understand who was posting to these hashtags, "retweets" (a user who reposts a message created by a different user) were excluded to focus only on original posts and the users who generated them. This resulted in 334,907 and 15,602 original tweets containing #glutenfree and #celiac, respectively, including those from synonymous hashtags for analysis. Table 1 provides an overview of general user and Tweet metadata over the 8-month collection period.

Data Analysis
As is standard in analyzing data gathered from Twitter to analyze affinity spaces [20,21], all tweet and author users' publicly available profile data (eg, Twitter handles and locations) were saved to a database. Descriptive statistics of tweet and author user objects were calculated to determine the method to use to classify users into user types for further analysis. Descriptive statistics revealed that users exhibited a highly positive skew in their posting activities. This behavior was expected given previous studies carried out on Twitter data [23]. Based on the positive skew, van Mierlo's [24] 90-9-1 Principle was selected to classify users in each affinity space into relative activity groups. Users were classified as follows: superusers (top 1% of users posting content), contributors (next 9% of users contributing content), or lurkers (the remaining 90% of users; see Table 1) [24]. Following the standard for affinity space analysis [20,21], basic language processing techniques were then used to (1) extract keywords from user biographies (eg, "doctor" or "blogger"), (2) identify co-occurring hashtags (eg, "#vegan" or "#recipe"), and (3) identify common domains that users linked to in their tweets (eg, celiac.com). A detailed description of these categories is provided below in the Results section.

Ethical Considerations
Ethics approval was obtained or determined to not be necessary by all author institutions owing to the public nature of the data.

User Activity Group: Superusers, Contributors, and Lurkers
Participation in each affinity space (ie, #glutenfree and #celiac) was evenly spread across the 3 groups, with superusers producing 25.5% of an overall 28.7% of posts containing #glutenfree and #celiac, contributors producing 25.2% of an overall 35.2% of posts, and lurkers producing 49.3% of an overall 36.1% posts. In other words, superusers (1% of users posting to the named affinity spaces) posted on average 10.0 times (#glutenfree) and 7.3 times (#celiac) more than contributors (the next 9% of users contributing), and contributors posted 4.5 times (#glutenfree) and 9.8 times (#celiac) more than lurkers (the other 90% of users posting to these spaces). Additionally, a comparison of raw tweet counts showed that Lurker behaviors were similar between the 2 hashtag groups but that #celiac superusers and contributors posted at least twice as often as their #glutenfree counterparts. #glutenfree represented more than 20 times the tweets as #celiac, but 40.3% of tweets in #celiac were also cross-listed in the #glutenfree data set (Table 1).

Biographical Self-descriptors
To understand the professional backgrounds of Twitter users posting to these hashtags, each user's self-description was parsed out into a list of keywords [25] after removing stop words (eg, "a," "and," and "the"). Descriptions produced roughly 200,000 unique keywords (eg, "blogger" and "author"). The study team reviewed the most common 500 keywords for each hashtag and user activity group and then excluded those that did not suggest the author's expertise or were disassociated from the topic (eg, "director" and "vegan" were retained, while "music" and "www" were excluded). Descriptors related to family relationships were also retained (eg, "mother" was included), expecting that many family members of individuals with CD would participate in these affinity spaces to learn more about managing CD and the GFD. Specifically, tweets from users who self-identified with these keywords related to female family relationships (eg, mother or wife) represented 4.3% of tweets containing #glutenfree and 8% of those containing #celiac. Male family relationships (eg, father or husband) represented 1.5% of tweets containing #glutenfree and 1.2% of those containing #celiac.
Recognizing that some users might identify terminal degrees and medical expertise in their name fields instead of their descriptions, a keyword search for variants of "Doctor," "Physician," "PhD," "MD," and "dietitian" on names was conducted. This showed that 0.4% of #glutenfree users and 2.1% of #celiac users self-identified with one of these terms in this way, but this calculation also included various distractors, such as multiple references to the television series "Doctor Who."

Affinity Spaces: #glutenfree Versus #celiac
Comparing the 2 affinity spaces, #glutenfree was much more active, averaging 1501.8 (SD 223.2) tweets per day, while #celiac averaged 69.0 (SD 16.7) tweets per day. Users posting to #glutenfree represented 163,911 accounts, averaging 2.0 (SD 10.0) tweets per account for the time period, while users posting to #celiac represented 4383 accounts, averaging 3.6 (SD 12.4) tweets per account. At the user participation level, a noticeable overlap was found between affinity spaces, with 64.0% of #celiac posters also posting to #glutenfree in the time period (with 1.7% of #glutenfree users also posting to #celiac).

Co-occurring Hashtags
To better understand the nature of the tweets that were being posted in each affinity space, the use of co-occurring hashtags was analyzed for easy grouping. In other words, hashtags that were used in tweets that did not have similar word stems to the targeted grouping hashtags (eg, #vegan was included in #glutenfree, while #gluten and #gf were ignored) were analyzed to identify groupings [26]. Percentages for each co-occurring hashtag were calculated by the likelihood that the hashtag would be used if any co-occurring hashtags existed at all (see Tables  6 and 7).
Tweets containing #celiac were highly represented in the #glutenfree data set, ranking at a similar level to mentions of paleo and keto diet hashtags, but overall results indicate that tweets containing #glutenfree focused heavily on a variety of other diets, including #vegan, #dairyfree, #plantbased, #keto, #paleo, #vegetarian, and #organic, suggesting that interest in GFDs was most commonly associated with a variety of weight loss and health regimens unrelated to CD (Tables 6 and 7). In the #celiac data set, gluten-related hashtags were dominant (with #glutenfree co-occurring in 50.5%-69% of tweets; see Tables  6 and 7), but other hashtags were more varied with some focusing on recipes (eg, #veganrecipes), others on symptoms (eg, #chronicpain), and other diseases (eg, #IBD and #IBS). These hashtags amounted to less than 1% of overall tweets.
Comparing the 2 affinity spaces, it appeared that #glutenfree was both more widely used but also more lifestyle based (eg, associated with other diet trends such as paleo or keto) than the #celiac space (see Tables 6 and 7).

Shared Link Domains
To understand what resources users were sharing, the domains of unshortened links in tweets were analyzed. URL shorteners that were used as aliases rather than an actual direct link, and automated content providers were ignored (eg, bit.ly) [27]. Results for both affinity spaces revealed that links to social media and video sharing sites were common (eg, Instagram, Pinterest, and YouTube), and many blog, recipe, and other specialty sites were heavily linked to as well (see Table 8). Some of these domains were highly represented because many users were tweeting about them (eg, 1064 users tweeting YouTube videos in posts containing #glutenfree), but others were highly represented because a relatively small number of users were promoting a specific resource (eg, 1 user tweeting about foodgawker.com 136 times and promoting it to the #2 spot; Table 8).
Domains ending in ".com" (ie, commercial sites) were more prevalent (as opposed to nonprofit [.org] or government [.gov] domains). In fact, keyword searches for .com, .org, and .gov domains on the overall data set revealed that .com websites were linked to posts containing #glutenfree or #celiac 54.7 and 16.8 times more than .org sites and 1173.0 and 44.7 times more than .gov domains. This shows that the commercial influence seems to be much more apparent and disproportional to other influences in the #glutenfree space but that information in the #celiac space may also be heavily dominated by commercial interests.

Principal Findings
The purpose of this study was to combine methods from computer science and the behavioral sciences to begin to examine internet-based CD educational sources. As part of this initial investigation, this study describes information about CD and the GFD disseminated on the social media platform Twitter. With increasing use of social media as an educational resource and source of support for populations of individuals with chronic illness [28][29][30], it is crucial to understand the nature of information on platforms such as Twitter. Our findings emphasize the prominence of posts on both CD and the GFD, which appear to come from users focused on promotion of themselves (eg, identifying as vegan) or a business (eg, endorsing a restaurant) rather than from more traditional sources of information such as medical professionals or nonprofit organizations [19]. This supports previous findings regarding the hesitancy of medical providers to engage in social media as a form of medical education [6]. It also raises concerns about the quality of information individuals are receiving about CD and the GFD, as individuals with CD require the GFD for medical purposes [19]. This is likely not unique to CD as concerns have been raised in the field of food allergies [30]. We propose the need for a social media presence focused on providing high-quality, up-to-date, fact-checked information to users, particularly for those within the CD or other gluten-related diseases.

Clinical Implications
Based on our findings, there is an opportunity and arguably a demand for increased presence on social media and internet-based platforms among medical and nonprofit experts in CD to provide high-quality information to consumers. This has been executed among populations of individuals with other diseases, such as inflammatory bowel disease (IBD). For example, ImproveCareNow [31] is a community of clinicians, researchers, parents, and patients of children and youths with IBD. The main goal of this organization is to provide a platform to help this community learn about "more reliable, proactive IBD care" [31]. Their social media campaign involves accounts on various platforms, including a blog, Facebook, Twitter, and YouTube. The content posted on these platforms is monitored by the organization.
Guidelines have been developed by several organizations to help inform medical providers on social media best practices, including the Association for Healthcare Social Media [32]. The use of guidelines can best inform medical providers on the use of social media as a source of patient education. Other groups are working to develop competencies including advocacy and communication responsibilities that specialists in various areas of health might develop in helping to educate certain patient populations [33].

Limitations and Future Directions
There are several additional considerations for this study in analyzing publicly available Twitter data. First, we collected our sample of data during a relatively narrow (8-month) time period, which may not account for natural variations across seasons and events (eg, holidays and major scientific or medical conferences). The activity and nature of posts may have changed as the COVID-19 pandemic has continued. Second, our analysis did not examine co-occurring words within individual user accounts. For instance, it is possible that one account may note being a "vegan," "blogger," and "mom." Future research could collect more detailed information about active members of social media to better understand "influencers" in this area. Furthermore, this study should be understood in light of the typical Twitter user. Twitter is used by about a quarter of American adults, both men and women of various racial groups, but we recognize that social media users may be younger and not necessarily representative of all ages and demographics [5]. Future work might examine the role of social media use in educating different subgroups of the population.
Additionally, we used established but relatively new methods of automated extraction and categorization of data rather than human coding, though we used human observation and judgment during the process of cleaning and synthesizing the data. This relied on algorithms based on anticipated data and did not allow for inductive reasoning by the human eye. Such an approach allowed the study team to rely on objective data rather than potential biases or a priori assumptions of individual experts [34]. Future studies may strengthen knowledge on this topic through expansion of data collection across a longer time span and further evaluation of the nature of users as well as the sentiments and accuracy of content within tweets.

Conclusions
To our knowledge, this was the first study evaluating Twitter data using the topics #celiac and #glutenfree. Given the popularity and broad use of social media, this is an important starting point for this research that generates several new hypotheses and research questions. Our findings emphasize the large volume of information communicated on social media. We suggest that platforms such as Twitter pose risks of spreading biased or inaccurate information to the public, particularly when the sources of information come from entities who may be influenced by commercial conflicts of interest.
Social media represents an immense opportunity to achieve open and clear dialogue between health care professionals and the public, which could be a major facilitator of future research and patient education about CD and the GFD.