This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Infodemiology, is properly cited. The complete bibliographic information, a link to the original publication on https://infodemiology.jmir.org/, as well as this copyright and license information must be included.
As direct-to-consumer genetic testing services have grown in popularity, the public has increasingly relied upon online forums to discuss and share their test results. Initially, users did so anonymously, but more recently, they have included face images when discussing their results. Various studies have shown that sharing images on social media tends to elicit more replies. However, users who do this forgo their privacy. When these images truthfully represent a user, they have the potential to disclose that user’s identity.
This study investigates the face image sharing behavior of direct-to-consumer genetic testing users in an online environment to determine if there exists an association between face image sharing and the attention received from other users.
This study focused on r/23andme, a subreddit dedicated to discussing direct-to-consumer genetic testing results and their implications. We applied natural language processing to infer the themes associated with posts that included a face image. We applied a regression analysis to characterize the association between the attention that a post received, in terms of the number of comments, the karma score (defined as the number of upvotes minus the number of downvotes), and whether the post contained a face image.
We collected over 15,000 posts from the r/23andme subreddit, published between 2012 and 2020. Face image posting began in late 2019 and grew rapidly, with over 800 individuals revealing their faces by early 2020. The topics in posts including a face were primarily about sharing, discussing ancestry composition, or sharing family reunion photos with relatives discovered via direct-to-consumer genetic testing. On average, posts including a face image received 60% (5/8) more comments and had karma scores 2.4 times higher than other posts.
Direct-to-consumer genetic testing consumers in the r/23andme subreddit are increasingly posting face images and testing reports on social platforms. The association between face image posting and a greater level of attention suggests that people are forgoing their privacy in exchange for attention from others. To mitigate this risk, platform organizers and moderators could inform users about the risk of posting face images in a direct, explicit manner to make it clear that their privacy may be compromised if personal images are shared.
The cost of genome sequencing has steadily decreased over time [
As DTC-GT services have grown in popularity, consumers have increasingly relied upon online social platforms to discuss and share their test results (though not always the raw genome sequences) [
When r/23andme users share their results for discussion, instead of simply typing text, some users attach a screenshot of their DTC-GT result page (eg, the ancestry composition). Since Reddit is a virtual online community where users generally rely upon pseudonyms for communication, such screenshots of results typically do not contain a user’s real name. Therefore, even when users share and discuss their DNA test results, this subreddit has historically been a community with a culture of anonymity.
However, in 2019, r/23andme users began attaching personal images to their posts.
An example of a face image posted on the r/23andme subreddit. The report is shown together with a face image and testing results. The actual face and name are obscured for this publication; however, the data exist in the public domain.
Though users may be aware that revealing their face likely compromises their privacy, it is unclear why they choose to do so. Various investigations into behavioral psychology and economics show that some people waive their privacy rights in exchange for a service that they value [
To answer these questions, we collected posts from the r/23andme subreddit and categorized them into three types: (1) posts with only text, (2) posts with face images, and (3) posts with images not containing a face. We next measured the temporal posting trends regarding the type of post. Then, we applied topic modeling to compare the primary topics associated with types of post. Finally, we performed a regression analysis to infer the association between the attention that a post received, in terms of votes, comments, and whether the post contained a face image.
This study involved only online posts that were openly accessible on Reddit. We have published the analysis results only in this paper, and any referenced posts or figures have been anonymized to protect the privacy of users.
An overview of the research workflow for r/23andme post analysis. RQ: research question.
To collect data from the r/23andme subreddit, we first gathered the IDs of all posts (ie, submissions) and comments using pushshift.io. We then applied the Python Reddit application programming interface wrapper package (version 6.3.1) to extract data from Reddit for each post ID. Specifically, we collected all posts and comments published on r/23andme between December 31, 2012, and January 31, 2020. Each collected post contained the following information: (1) author identifier, (2) post title, (3) post text body, (4) image URL (if there was an image in the post), (5) comments on the post, (6) post date, and (7) karma scores of the post and affiliated comments.
We downloaded the images from posts containing an image URL and applied the face-recognition Python package (version 1.3.0) [
To describe face image posting behavior, we compared the face posts with the other two types of posts along three perspectives: (1) posting temporal trend, (2) post theme, and (3) the attention that a post received from other users, in terms of the number of comments and karma score.
To examine the thematic differences between the three post types, we applied topic modeling [
We investigated two types of associations. First, we considered the association between an image post (with and without a face) and the attention it received. Second, we considered the association between a face post and the attention it received. Since the number of comments and the karma score are nonnegative count variables, we applied a negative binomial regression to infer the association [
Given that posts published earlier may be read by more readers and, thus, receive more comments and votes, we included the number of days a post had been published as a control variable. In addition, posts on different topics might receive different levels of attention. To reduce the effects of post topic, we incorporated the topic distribution of each post as an additional set of control variables. During model fitting, we dropped one topic (T4, see below) to address collinearity.
Moreover, the activity level of users might affect the popularity of their posts. For example, posts from active users may receive more attention. To reduce the impact of user activity, we incorporated the number of posts and the number of comments of each user as an additional set of control variables. We utilized the implementation of negative binomial regression in the statsmodels Python package (version 0.11.1) to fit models for the karma score and the number of comments separately. We reported the features that achieved statistical significance at the
We collected 15,596 posts and 188,843 comments, which were published by 20,883 users between December 31, 2012, and January 31, 2020. Among the collected posts, 24.8% (3818/15,596) contained faceless images, while 5.4% (849/15,596) contained face images.
In
Smoothed temporal trends of three types of post, including the number of posts published per month (A) and quarterly growth rate of posts (B).
Attention to three types of posts. The number of comments per post (A) and karma score per post (B). For presentation purposes, we removed posts with more than 80 comments or karma scores greater than 150 (3% of the data). The entire data set is provided in Figure S3 and Figure S4 in
We measured user activity in terms of the number of posts and comments. We found that 26.8% (2442/9114) of the users posted faceless images, while 8.5% (774/9114) posted face images.
Number of posts per user (A) and number of comments per user (B) for users who posted (1) text only, (2) faceless images, and (3) face images. For presentation purposes, we removed users who published more than 10 posts or 50 comments, accounting for 4.4% of the total number of users. The entire data set is provided in Figure S3 and Figure S4 in
Ancestry composition included 4 topics: T1, T2, T3, and T4. Posts in this category focused on the presentation and discussion of ancestry composition testing results. The 4 topics captured ancestry information, which communicate a user’s race, continental origin, and nationality.
The topics inferred from the r/23andme subreddit. The sample words are presented in descending order according to their relevance score within the topic.
Category | Top-20 most relevant terms | Topic distribution | ||
|
||||
|
Topic 1 | European, -PRON-, result, Italian, Irish, British, surprise, Jewish, white, Chinese, broadly, bit, eastern, Ashkenazi, surprised, Scandinavian, give, eye, lot, surprising | 11.6% | |
|
Topic 2 | -PRON-, ancestry, German, guess, French, make, post, heritage, year, ethnicity, grandmother, common, grandparent, explain, mega-thread, feel, polish, Canadian, confused, wrong | 7.9% | |
|
Topic 3 | result, -PRON-, expect, finally, back, ancestor, interesting, pretty, AncestryDNA, bear, confidence, recent, location, Filipino, cool, guy, live, thought, Finnish, big | 9.1% | |
|
Topic 4 | American, Asian, African, native, Mexican, people, south, percentage, region, Neanderthal, gene, high, part, Spanish, unassigned, east, north, variant, trace, add | 10.6% | |
|
||||
|
Topic 5 | -PRON-, family, today, close, tree, understand, worth, info, don, trait, history, link, happen, picture, excited, love, list, connection, inherit, risk | 6.5% | |
|
Topic 6 | -PRON-, find, dad, half, mom, father, cousin, mother, side, sister, adopt, brother, great, sibling, grandfather, full, grandma, biological, aunt, figure | 9.2% | |
|
||||
|
Topic 7 | kit, long, time, extraction, wait, timeline, genetic, day, receive, sample, analysis, week, testing, step, send, batch, fail, information, work, stick | 14.2% | |
|
Topic 8 | andme, ancestry, datum, health, raw, accurate, GEDmatch, MyHeritage, good, DNA, upload, compare, site, comparison, land, data, service, difference, WeGene, interpret | 11.0% | |
|
Topic 9 | DNA, test, relative, question, parent, report, share, -PRON-, phase, show, generation, relate, computation, person, unexpected, noise, mystery, relationship, account, number | 9.7% | |
|
Topic 10 | result, update, beta, haplogroup, match, maternal, change, paternal, chromosome, map, mixed, chip, Puerto Rican, Korean, lose, comment, late, original, Romanian | 10.2% |
“So I’m a lot less British than I thought, and a lot more Swiss” (Topic 1).
“Any guesses on my friend’s ethnicity? He thinks he’s French/German, English, and maybe some Slavic” (Topic 2).
“Born and raised in Manila, grew up thinking I was 100% Filipino. A bit shocked at my results” (Topic 3).
“Found out I am East Asian and Native American but I have northern Asian and Native American so high” (Topic 4).
“Found out I have about a dozen cousins I didn’t know about” (Topic 6).
“My cousin did the DNA test and connected us to our great grandmother’s family!” (Topic 5).
“On my account apparently my mom and her twin sister are both my moms” (Topic 6).
“Is my kit moving slow? It took 2 weeks to be marked as “arrived” after tracking showed it was delivered” (Topic 7).
“23andMe vs WEGENE – uploaded 23andMe raw data to WEGENE and here are the differences” (Topic 8).
“What is a likely relationship if the shared DNA is 1610 centimorgans across 80 segments?” (Topic 9).
“Beta update v5.2 should now be available to all earlier chip (pre-V5) users, when opting into the Beta program” (Topic 10).
The prevalence of topics for each post type. The topics are arranged according to category. *
With respect to the
In addition, there were two notable findings with respect to the control variables. First, the log-transformed number of published days exhibited a negative association in the
Results of the regression analysis relating post type to comments and karma score. All associations were statistically significant (
Negative binomial regression | Dependent variable | Independent variable |
|
Z | SD | |
Number of comments | Posting image | .152 | 6.41 | 0.024 | <.001 | |
Karma score | Posting image | .618 | 12.35 | 0.050 | <.001 | |
Number of comments | Posting face image | .451 | 10.21 | 0.044 | <.001 | |
Karma score | Posting face image | .760 | 9.64 | 0.079 | <.001 |
This investigation made several notable findings. First, consistent with previous studies on other social platforms [
Second, the 10 inferred topics from the titles of r/23andme posts appeared to fall into three categories. Posts in the first category, which covered 4 out of 10 topics, focused on discussing users’ ancestry composition. Notably, the topics in this category were associated with a higher rate of image and face image posting. It was further observed that users invoked their face images as proof (or counterexamples) of the genetic testing results. Posts about kinship and family member discovery exhibited a moderate rate of face image sharing. When inspecting posts in this category, posts such as “finally find my half-sister,” with a group photo of a reunion attached, were more prevalent than in other categories. Finally, posts asking general questions about genetic testing, which focused on comparisons between DTC-GT companies, the progress of testing result delivery, and upgrades to testing algorithms, exhibited the lowest rate of image sharing.
Third, counter to our expectation, we found that the number of days a post was published was negatively associated with a post’s attention. One possible explanation for this result is that Reddit archives posts older than 6 months and no longer allows commenting on them. Thus, the number of comments and votes was limited for earlier posts. We further noticed that the topic related to general questions was negatively correlated with attention to a post.
Natural language processing techniques have been applied to various health care applications [
This paper analyzes the association between face image sharing and attention paid to posts in an online setting; this setting may incentivize users to sacrifice their privacy in exchange for the benefit of a social response. This observation, however, does not imply that attention is undesirable in all cases, as several studies have shown that social engagement is beneficial to an individual’s physical and mental health. For instance, in a large online breast cancer forum, Yin et al [
Despite our findings, there are certain limitations to this work, which we believe serve as opportunities for future research. First, the face recognition package had an estimated 2% false negative rate, which means that approximately 76 of the 3865 face images (2%) were likely wrongly labeled as faceless images. These misclassified images might have influenced the accuracy of our findings, although not their overall direction. Second, most topics inferred from topic modeling were interpretable and intuitive, but topic T10 was difficult to interpret. As shown in
DTC-GT users are increasingly posting full-face images with their DTC-GT results on social platforms. In this study, we investigated the trend in this behavior in the r/23andme subreddit to obtain insight into potential underlying motivations. Our findings show that such behavior began in September 2019 and experienced rapid growth, with over 849 face-revealing posts by early 2020. Furthermore, our study suggests that posts including a face received, on average, 60% (5/8) more comments and 2.4 times higher karma scores than other posts. Posts that included face images were primarily about sharing and discussing ancestry composition and sharing family reunion photos with relatives discovered via DTC-GT. These findings verify our hypothesis that posting a personal image is associated with receiving more online attention, which is consistent with previous findings that people appear to be willing to give up their privacy (ie, their personal images) in exchange for a benefit (ie, attention from others). Based on this analysis, platform organizers and moderators might inform users about the risk of posting face images in a direct, explicit manner and make it clear that users’ privacy may be compromised if personal images are disclosed.
Supplementary materials.
direct-to-consumer genetic testing
natural language processing
latent Dirichlet allocation
YL, ZY, ZW, and CY proposed the research idea, which was finalized by BAM. YL and CN collected the data. YL and ZY designed and conducted the experiments. BAM and EWC provided advice on the data analysis. YL drafted the manuscript. EWC, ZY, BAM, YV, MK, and WX edited the final manuscript. All authors reviewed the final manuscript. This research was sponsored in part by the National Institutes of Health (grant RM1-HG009034, grant R01-HG006844, and grant U2COD023196).
None declared.