This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Infodemiology, is properly cited. The complete bibliographic information, a link to the original publication on https://infodemiology.jmir.org/, as well as this copyright and license information must be included.
Internet search volume for medical information, as tracked by Google Trends, has been used to demonstrate unexpected seasonality in the symptom burden of a variety of medical conditions. However, when more technical medical language is used (eg, diagnoses), we believe that this technique is confounded by the cyclic, school year–driven internet search patterns of health care students.
This study aimed to (1) demonstrate that artificial “academic cycling” of Google Trends’ search volume is present in many health care terms, (2) demonstrate how signal processing techniques can be used to filter academic cycling out of Google Trends data, and (3) apply this filtering technique to some clinically relevant examples.
We obtained the Google Trends search volume data for a variety of academic terms demonstrating strong academic cycling and used a Fourier analysis technique to (1) identify the frequency domain fingerprint of this modulating pattern in one particularly strong example, and (2) filter that pattern out of the original data. After this illustrative example, we then applied the same filtering technique to internet searches for information on 3 medical conditions believed to have true seasonal modulation (myocardial infarction, hypertension, and depression), and all bacterial genus terms within a common medical microbiology textbook.
Academic cycling explains much of the seasonal variation in internet search volume for many technically oriented search terms, including the bacterial genus term [“Staphylococcus”], for which academic cycling explained 73.8% of the variability in search volume (using the squared Spearman rank correlation coefficient,
Although it is reasonable to search for seasonal modulation of medical conditions using Google Trends’ internet search volume and lay-appropriate search terms, the variation in more technical search terms may be driven by health care students whose search frequency varies with the academic school year. When this is the case, using Fourier analysis to filter out academic cycling is a potential means to establish whether additional seasonality is present.
Google Trends is an open access portal that allows researchers to explore how the public’s quest for information on specific topics varies with time. The data made available by Google Trends is the “volume” (number) of searches for a specific search term entered by the public into the Google search engine per unit time (eg, per week), provided as a percentage of the highest search volume for that term over the period of interest (eg, last 5 years). The data are anonymous and collated geographically, and, given the public use of Google to search for health information [
“Seasonality” in symptom burden refers to an annual periodicity, or modulation, in some measurable aspect of those symptoms. Much of this modulation may result from seasonal variation in environmental factors that convey the risk of disease. Respiratory viral illnesses are one of the best examples of this [
Google Trends has become a popular tool for investigation of disease seasonality. An early use in this area was rapid real-time surveillance of influenza-like illness [
In our use of Google Trends to explore disease seasonality, we have come across an important potential confounder, which has yet to be described. This confounder is the searches for health information carried out by students who are taking courses at the undergraduate level. Such searches can be expected to be low in volume during the summer and winter break (in most countries) and high in volume during the final examination season. We have repeatedly observed such a biphasic seasonal pattern, which we will refer to as “academic cycling,” in many academic-oriented search terms (ie, fairly technical terms that are less commonly used in lay conversation such as proper diagnoses). Such academic cycling spans all fields of study. Some examples from health care, mathematics, and physics are shown in
Google Trends search volume for terms with strong academic cycling in the 5 years prior to onset of the COVID-19 pandemic. Searches are limited to the United States, and each color represents a period of 1 year. A high-frequency filter has been applied to remove fluctuations with a period less than 5 weeks (this smooths the curve and eliminates current event driven search volume spikes that last less than 2.5 weeks).
One of the pillars of signal processing is the recognition that time-series data can be represented as the sum of many different sinusoidal waves, each with its own amplitude and phase difference. FFT is a software tool that does just that, representing a given time series (such as our 5-year Google Trends search volume) in the “frequency domain,” by showing what sinusoidal waves would need to be added together to produce the same curve [
We first demonstrated our filtering process in detail using the term [“thermodynamics”], which was chosen because of its strong academic cycling and helped each step to be visualized. The initial step involved preprocessing of the Google Trends data before FFT could be applied and involved shifting the time-series data down by subtracting the mean value. The resulting transformed data had the same shape as the original time series, but the data were now represented by positive and negative numbers that had a mean value of 0. Although not strictly necessary, we also chose to filter out high-frequency “noise” with FFT to make patterns more visible to the naked eye. These 2 preprocessing steps were applied to both the term of interest and to the control terms that represent the academic cycling that we wish to remove. We then identified how much of the academic cycling component was present in the term of interest by using a least squares regression analysis, subtracted that component in the frequency domain, and recreated the time series with inverse FFT. Following this demonstration, we applied the same technique to a selection of clinically relevant examples.
Google Trends time series data are freely downloadable and presented as the relative search volume (RSV) for the specified search terms per unit time (month, week, day, or hour). An RSV of 0 indicates little to no search volume, and an RSV of 100 indicates the highest volume for that term in the period of interest. We used weekly data for the 5-year period from July 3, 2016, to June 30, 2021. We restricted our analysis to the United States since it was the country with the largest internet search volume and since a single geographic region was needed for most residents to have a shared experience of the changing of the seasons and school year. Our 5-year window was selected to capture 5 full academic years. Although Google Trends provides the option of having search terms represent “topics” (in which case Google Trends aggregates a variety of searches they feel capture the same topic area), this option is not available for all search terms. Hence, for consistency, unless otherwise indicated, we did not use the “topic” search feature. Our search term nomenclature is in accordance with previous literature [
Our frequency filtering program was built using R (version 4.0.2) within the RStudio interactive development environment (version 1.4.1106). The process for filtering out academic cycling, every time it was applied, used the following steps. We will illustrate each step using the example term [“thermodynamics”], which displays strong academic cycling. When we refer to the time domain, we mean how the data look as a time series (ie, the way Google Trends initially presents the data in their web browser). When we refer to the frequency domain, we mean the way the data are visualized using the FFT, which is as a series of spikes showing how much of each frequency is present in the data for all of the sinusoids that would need to be combined to create it (
(A) Time series representation of [“thermodynamics”] Google Trends data both before and after removal of academic cycling; color indicates a calendar year. (B) Frequency domain representations of the same time series. Each frequency domain spike is the amplitude of the sinusoids that would need to be combined to produce the time series shown.
We first shifted and scaled the data such that it moved up and down around a mean value of 0 using the following formula:
Transformed RSV = [RSV – mean(RSV)]/mean(RSV)
Once filtering was complete, we applied this transformation in reverse to return to the original scaling.
Assuming that most high-frequency fluctuations in search volume (ie, sudden changes) are not biologically driven [
After high-frequency filtering, we applied FFT as natively encoded in R [
The academic cycling pattern that we want to filter out could look different for different disciplines considering the school year and that examination schedules could differ. As such, we chose different control terms for our “thermodynamics” example than we did for our medically relevant examples (choosing [“binomial” + “integral” + “derivative”] as control terms for [“thermodynamics”] and [“gram stain” + “gram positive” + “gram negative”] as control terms for biomedical searches). In the Google Trends browser, using a “+” sign means “or”; that is, [“cat” + “dog”] would count any Google search in which the words “cat” or “dog” were included in the search phrase entered by the user.
Similar to our search terms of interest (“thermodynamics” in this example), the search volume for the control terms (ie, [“binomial” + “integral” + “derivative”]) also underwent the first 3 aforementioned steps. The frequency domain pattern of spikes for the control term is the “fingerprint” we intend to filter out of the data for our terms of interest.
To best estimate how much of the academic fingerprint was present in a signal, we used a sum of squares minimization approach using the optimize algorithm in R. That is, we took the frequency domain representations of both the term of interest and the control term, and scaled the control term components by an amount k, such that the sum of the squared differences in frequency components between term of interest and control was minimized (note that as shown in
SS2 = Σ(Real Test – Real Control*k)2 + (Imaginary Test – Imaginary Control*k)2)
The resultant filtered Fourier coefficients were back-transformed to the time domain using the inverse FFT algorithm, which is part of the same native R function. This allowed us to visualize the time series without the academic cycling, which appears to be eliminated in the “thermodynamics” example (
The genus names of pathogenic organisms could be searched for by both patients and providers, who encounter the organism in the usual course of care, and by students learning about such organisms during their training. It is also possible that the abundance of these organisms, their vectors, or the environments in which they are most easily transmitted undergo seasonal modulation. As such, we identified and analyzed 58 pathogenic bacterial genus terms discussed in a common medical microbiology textbook [
We also applied our filtering technique to 3 conditions that appeared to have academic cycling and for which previous observational evidence suggests some seasonal modulation; these include depression, myocardial infarction, and hypertension [
Post filtering, for bacterial genus terms, we selected the 6 terms (top 10%) with the strongest annual cycling component (ie, genus names with the highest amplitude frequency domain peaks at 52 weeks) and displayed them graphically. To do this, since these terms generally had a low search volume, and hence a relatively high amount of noise (ie, more seemingly random fluctuations), we graphed the average monthly volume to help average out random fluctuations and make any annual patterns more visible. In order to demonstrate how much the academic searches were driving the search volume for bacterial genus terms, we also calculated the squared Spearman rank correlation coefficient between the time series for each bacterial term and the time series for our control term (ie, [“gram stain” + “gram positive” + “gram negative”]). The squared Spearman rank correlation coefficient was used to estimate the amount of variation in the test data set, which was explained by the variation in the control.
Our filtering technique successfully removed academic cycling from a wide variety of Google Trends data where it is evident. Although the terms [“depression”], [“hypertension”], and [“myocardial infarction”] all had annual cycling prefiltering, this was only evident in searches for [“depression”] once academic cycling was removed. Of 56 pathogenic bacterial genus names, largely because of low search volumes, only 5 displayed substantial annual cycling prefiltering ([“Clostridium”], [“Escherichia”], [“Mycobacterium”], [“Staphylococcus”], and [“Streptococcus”]), and none of these 5 genus names displayed seasonality after academic cycling was removed. After filtering all genus terms, 10% of them with the strongest seasonality (ie, strongest 1-year periodicity in the frequency domain) were [“Aeromonas” + “Plesiomonas”], [“Moraxella”], [“Haemophilus”], [“Ehrlichia”], [“Legionella”], and [“Vibrio”], each of which had search volume peaks consistent with what the clinical literature would predict.
Owing to the relatively low search volume, few of our 56 bacterial genus terms displayed obvious academic cycling, with only 5 having a squared Spearman rank correlation coefficient of ≥0.5 with their corresponding control term. Academic cycling was clearly present, however, when the bacterial genus terms were averaged together and in the term [“Staphylococcus”] (
The top 10% of genus terms with the most annual cycling (ie, highest 52-week frequency domain peaks) after filtering out academic cycling are shown in
(A) High-frequency filtered Google Trends Internet search volumes for [“Staphylococcus”], the aggregate mean of 56 pathogenic bacterial genus term data (excluding [“Staphylococcus"]), and the [“gram stain” + “gram positive” + “gram negative”] control term used to identify academic cycling in such terms; color indicates a calendar year. (B) The frequency domain representation of the same time series, showing the amplitude of each sinusoid that would need to be summed to obtain the original signal.
Google Trends internet relative search volume for various pathogenic bacteria, filtered to remove academic cycling, and averaged for each month over a 5-year span from July 3, 2016, to June 30, 2021. (A) [“Aeromonas” + “Plesiomonas”] (combined out of convenience owing to similar reservoirs, similar modes of infection, and historically common taxonomy). (B) [“Ehrlichia”]. (C) [“Haemophilus”]. (D) [“Legionella”]. (E) [“Moraxella”]. (F) [“Vibrio”]. The dotted line is the mean search volume across all 261 data points that are available for averaging. Numbers being averaged are the weekly search volume, obtained as a percentage value of the maximum weekly search volume for that term over the 5-year period.
Google Trends internet relative search volume before and after filtering out academic cycling for the terms [“Clostridium”], [“Escherichia”], [“Mycobacterium”], [“Staphylococcus”], and [“Streptococcus”]. (A) These terms in the time domain. (B) The same terms in the frequency domain after applying the fast Fourier transform tool.
Academic cycling is evident in searches for information on all 3 of these common conditions (
Google Trends internet relative search volume before and after filtering out academic cycling for the terms [“depression”], [“hypertension”], and [“myocardial infarction”]. (A) These terms in the time domain. (B) The same terms in the frequency domain after applying the fast Fourier transform tool.
Biphasic academic cycling is commonly seen in Google Trends data when technical search terms are used. When this is the case, it can potentially be filtered out using FFT and an appropriate control. Although initially confounded by academic cycling, true seasonality in the public’s searches for information on depression seems to be present. It is less obvious that seasonality is present in searches for information on myocardial infarction and hypertension. Seasonality is also present in searches for information on a variety of pathogenic bacteria.
Biphasic academic cycling patterns are clearly present in some published Google Trends data, but to date, those patterns have been overlooked or given other interpretations. This includes an exploration of the influence of public health campaigns on searches for information on marijuana use, colorectal cancer, and HIV [
The months during which we observed higher interest in internet searches on specific bacterial pathogens are consistent with those reported in the microbiology literature. In Hungary, cases of
We chose [“depression”], [“hypertension”], and [“myocardial infarction”] as terms to explore because each has both academic cycling in Google Trends data and epidemiologic evidence of seasonal modulation. Depression and myocardial infarctions have been shown to be more common in winter [
Our filtering technique is limited by our ability to use an appropriate control. If the shape of the academic cycling in our control term does not match that of the term of interest, its removal would be imperfect or would introduce other seemingly seasonal components. We chose to use the control term [“gram stain” + “gram positive” + “gram negative”] for all our clinical examples because we believed microbiology-related searches would track with health care searching in general. While future researchers could choose to use this same control term to identify and filter out academic cycling, they may alternatively wish to build control terms that display strong academic cycling, which are more specific to the relevant specialty area. We can also only remove academic cycling when it is obviously present. For lower search volume terms, where there is vast higher-frequency “noise,” our filtering method essentially left the waveform intact. As such, our method of averaging together the search volume on a monthly basis to remove some of the noise, and reinforce the seasonal component, would have also reinforced any academic cycling component that was present.
Google Trends internet search volume is a useful tool for detecting disease seasonality when symptoms, or diagnoses, can be expressed in lay terms that have no alternate meaning. Care should be taken, however, to ensure that any emerging cyclic patterns do not have the biphasic pattern that is highly characteristic of searches driven by the academic school year. This is particularly relevant when researchers use more technical terms, such as proper diagnoses. When this is the case, consideration could be given to using the filtering technique we present here, the R script for which is available in
Supplementary figures.
Open-source R code for our Fourier analysis filtering methodology adapted for reader use.
Fast Fourier transform
relative search volume
sum of squared differences
The authors would like to thank the Faculty of Medicine and Dentistry at the University of Alberta for providing TG with The David and Beatrice Reidford Research Scholarship to pursue this work.
None declared.