The unadjusted correlation was statistically significant between search activity and dengue outbreak (r=0.632, p<0.001) and cases (r=0.673, p<0.001); PUO/fever outbreaks (r=0.315, p=0.026) and cases (r=0.323, p=0.022); chikungunya cases (r= 0.374, p=0.007); and total outbreaks (r=0.581, p<0.001) and cases (r=0.615, p<0.001).
In Model 1, the dependent variable was the Google search query volume; the independent variables in the model included outbreaks of dengue, PUO/fever, chikungunya, and typhoid. In Model 2, the dependent variable was the Google search query volume; the independent variables in the model included case counts (incidence) of dengue, PUO/fever, chikungunya, and typhoid.
Tables 1 and 3 show the result of the regression done in the first step of the Engle-Granger method for models 1 and 2 respectively. The residuals from each of the models were saved as another variable in Gretl and the results of the Augmented Dickey-Fuller Tests run on that variable for model 1 and 2 separately are shown in Tables 2 and 4 respectively.
Figure 1 shows time series trend between Google search query volume (measured on the secondary axis on the right) and the number of cases of infectious diseases causing fever. Figure 2 shows time series trend between Google search query volume (measured on the secondary axis on the right) and total outbreaks every week. Figure 3 shows the time series trend between Google search query volume and the total numbers of outbreaks and cases per week.
The time series plots show a weak temporal association between the total numbers of outbreaks every week and the total number of cases of different infectious diseases causing fever every week and Google search query volume. The search activity goes up slightly before every spike in disease incidence. However, because the baseline search activity is very high, the specific effect of the incident disease is not very clear, unless the disease metrics spikes are large enough. For example, in Figure 3, a large spike in cases around weeks 34-36 show a corresponding spike in the search volume activity in the preceding weeks (around week 31-34).
When the time series are examined statistically, they are seen to have stationarity of the residuals, which means that there is a predictable response in the curve for Google search query volume with a change in the number of outbreaks or case counts. However, this is not as strong a relationship as had been previously demonstrated in the studies emanating from developed countries and the rejection of the unit root null hypothesis barely reaches statistical significance for both the outbreak-related model (p=0.048) and case count related model (p=0.041).
These discrepancies can be explained based on two main issues. The data obtained from the IDSP is probably not representative of the incidence of the diseases in the country. Moreover, because of the method by which the data is collected, this data is primarily sourced from the rural and peri-urban areas of India. However, though mobile connectivity has become ubiquitous in India, Internet penetration has been poor, owing to lop-sided economic development. Moreover, use of Internet to search for healthcare information is not a very well-recognized pattern. This behavior is even more restricted in rural areas, where, owing to lower socioeconomic status, poorer awareness levels and lower literacy rates, Internet utilization rates remain low.
Considering the limitations of the available data, it would, therefore, be inappropriate to extrapolate the findings to arrive at concrete conclusions about the utility of the data for modeling or forecasting or “now-casting” disease outbreaks. The trend, however, is encouraging because it shows that despite the myriad limitations, the data shows some consistency, which can be expected to improve once better health statistics are available and there is penetration of Internet utilization for healthcare information needs to peri-urban and rural India.