## Abstract

Search engine query data deliver insight into the behaviour of individuals who are the smallest possible scale of our economic life. Individuals are submitting several hundred million search engine queries around the world each day. We study weekly search volume data for various search terms from 2004 to 2010 that are offered by the search engine Google for scientific use, providing information about our economic life on an aggregated collective level. We ask the question whether there is a link between search volume data and financial market fluctuations on a weekly time scale. Both collective ‘swarm intelligence’ of Internet users and the group of financial market participants can be regarded as a complex system of many interacting subunits that react quickly to external changes. We find clear evidence that weekly transaction volumes of S&P 500 companies are correlated with weekly search volume of corresponding company names. Furthermore, we apply a recently introduced method for quantifying complex correlations in time series with which we find a clear tendency that search volume time series and transaction volume time series show recurring patterns.

## 1. Introduction

Econophysics research—*econophysics* forms the interdisciplinary interface between the two disciplines economics^{1} and physics^{2}—has been addressing a key question of interest in the subfield of financial markets: quantifying and understanding large stock market fluctuations. Previous work was focused on the challenge of quantifying the behaviour of the probability distributions of large fluctuations of relevant variables such as returns, volumes and the number of transactions. Sampling the far tails of such distributions requires a large amount of data. However, there is a truly gargantuan amount of pre-existing precise financial market data already collected, many orders of magnitude more than for other complex systems. Accordingly, financial markets are becoming a paradigm of complex systems, and increasing numbers of scientists are analysing and modelling market data (Stanley *et al.* 1995; Vandewalle & Ausloos 1997; Cont & Bouchaud 2000; Krawiecki *et al.* 2002; Plerou *et al.* 2002*b*; Gabaix *et al.* 2003; Lillo *et al.* 2003; Kiyono *et al.* 2006; Preis *et al.* 2006, 2007; Watanabe *et al.* 2007; Podobnik *et al.* 2009). Empirical analyses have been focused on quantifying and testing the robustness of power-law distributions that characterize large movements in stock market activity. The use of estimators that are designed for serially and cross-sectionally independent data supports the hypothesis that the power-law exponents that characterize fluctuations in stock price, trading volume and the number of trades (Fama 1963; Lux & Marchesi 1999; Plerou *et al.* 2002*a*) are seemingly ‘universal’ in the sense that they do not change their values significantly for different markets, different time periods or different market conditions.

A reason why the economy is of interest to statistical physicists is that—like an Ising model which is a model of ferromagnetism—it is a system made up of many subunits. The subunits in an Ising model are the interacting spins, and the subunits in the economy are market participants—buyers and sellers. During any time interval, these subunits of the economy may be either positive or negative as regards perceived market opportunities. People interact with each other, and this fact often produces what economists call the herd effect. The orientation of whether they buy or sell is influenced not only by neighbours but also by news usually realized by an external field. If we hear bad news, we may be tempted to sell. So the state of any subunit is a function of the states of all the other subunits and of a field parameter (Preis & Stanley 2010).

One very illustrative example of the herd effect is shown in figure 1. The search engine Google offers the possibility to extract information about how popular are specific search terms’.^{3} Thus, one can compare the interest in financial crisis related keywords, such as ‘Subprime’, ‘Lehman Brothers’ and ‘Financial Crisis’, with the fluctuations of the S&P 500 index that has the rank of an international benchmark index. It is easy to understand that peaks in the search volume for the term Subprime coincide with dips in the S&P 500 index time series. At the climax of the crisis, the collapse of Lehman Brothers caused the sell-out of stocks and the public was talking about the Financial Crisis afterwards. Figure 1 documents this course of time and shows that people acted with steadily increasing dynamic. The search volume profiles track the levels of escalation, which can be seen as a prominent example of the herd effect.

This kind of data provides insights into our economic life on different scales. A steadily increasing number of Internet users visit websites of search engines every day. Each query request can be seen as an individual vote: using search engines, we leave information about our interests codified as search terms. Thus, search engines can collect our interests on the smallest possible scale—the scale of individual requests. On larger time scales, our interest forms trends. Aggregated search volume data can be used for uncovering such trends that affect our economic life on large scales. As seen before, the international financial crisis is one prominent example. However, product trends can be extracted as well—an example for that is the cell phone market. Search volume data provided by Google can also be used to predict spreading of seasonal influenza (Ginsberg *et al.* 2009). In addition, correlations were found linking both the current level of economic activity in given industries and search volume data of industry based query terms (Choi & Varian 2009).

The ‘experimental basis’ of the interdisciplinary science *econophysics* is given by time series that can be used in their raw form or from which one can derive observables. Such historical price curves can be understood as a macroscopic variable for underlying microscopic processes. The price fluctuations are produced by the superposition of individual actions of market participants, thereby generating cumulative supply and demand for a traded asset—e.g. a stock. The analogue in statistical physics is the emergence of macroscopic properties, which is caused by microscopic interactions among involved subunits.

In this paper, we will ask the question whether there is a link between search volume data and financial market fluctuations. For this task, we study cross correlations between the ‘collective intelligence’ of Internet users and the change of financial market quantities—weekly stock prices and weekly stock volume. In addition, we apply a method to find complex correlations in search volume data, which was recently introduced by Preis *et al.* (2008). Uncovering mechanisms and dependencies, which are useful to understand the formation of financial crises, is of crucial importance as an effective crises observatory could contribute in protecting the stability of financial systems.

This article is structured as follows. Section 2 describes the datasets that we analyse. In §3, we present correlation analyses between financial market fluctuations and search volume data. In §4, we analyse complex correlations in financial data and search volume data. Finally, §5 summarizes our results.

## 2. Data analysed

We use weekly closing prices of *N*=500 US stocks, which were constituents of the S&P 500 index on 31 May 2010. These weekly datasets also contain aggregated transaction volumes covering the time period from the calendar week of 4 January 2004 until the calendar week of 30 May 2010. Thus, *T*=335×*N*=167 500 weekly closing prices and weekly transaction volumes are available for analysis. A detailed list of the S&P 500 index components can be found in the electronic supplementary material. This list contains the exchange trading symbols and the company names.

In order to investigate whether Internet search volume is correlated with financial market fluctuations, we use search volume data provided by the search engine Google, which is available for the same period of time. This service which is called Google Trends analyses a portion of Google Web searches to compute how many searches have been done for specific terms, relative to the total number of searches done on Google over time—here we use all 500 company names of the S&P 500 components. As exact company names—e.g. *Microsoft Corporation*—may result in a weaker search volume quality in comparison to common abbreviations—e.g. *Microsoft*— we optimize the list of company names in order to improve the data quality and availability. The company names that are used for our search volume data requests can be found in the electronic supplementary material.

## 3. Linear autocorrelations and linear cross correlations

The *Pearson* product-moment correlation coefficient is a measure of the correlation between two variables *X*(*t*) and *Y* (*t*), giving a value between +1 and −1 inclusive (Pearson 1895). This correlation coefficient is widely used as a measure of the strength of linear dependence between two variables. In our case, *X*_{n}(*t*) and *Y* _{n}(*t*) are time series—the change of closing price, *p*(*t*+1)−*p*(*t*), the change of volume, *v*(*t*+1)−*v*(*t*), or the change of search volume, *s*(*t*+1)−*s*(*t*)—of stock *n* with length *T*−1. As we would like to determine the correlation coefficient in dependence of a time lag parameter **Δ***t*, we use *t*∈{1,2,…,*T*−1−**Δ***t*}. Thus, the correlation coefficient for stock *n* (*n*∈{1,2,…,*N*}) is given by
3.1
with 〈…〉 denoting the expectation value. Only non-vanishing changes of time series *X*_{n}(*t*) and *Y* _{n}(*t*) are considered as, for example, search volume data are not available for a few search terms at all observation times. Thus, let *T*′_{n}(**Δ***t*) be the number of non-vanishing time series changes of stock *n* in dependence of **Δ***t*. The aggregated correlation coefficient of the set of stocks is calculated by
3.2
For the analysis of cross correlations and autocorrelations (*Y* _{n}(*t*)=*X*_{n}(*t*)), we assume that the underlying variables *X*_{n}(*t*) and *Y* _{n}(*t*) have a bivariate normal distribution. Thus, we can use the Fisher transformation (Fisher 1915) for the determination of time lag-dependent confidence intervals. The Fisher transformation of is given by
3.3
For the *z*-score,
3.4
we obtain the confidence intervals from cumulative distribution function values for the standard normal distribution. An inverted Fisher transformation provides confidence intervals on a correlation scale.

First, we study autocorrelations . In figure 2*a*, the autocorrelation coefficients of weekly closing price changes are shown in dependence of **Δ***t*. Almost all values are practically negligible and are located close to the 95% confidence interval. Only the negative autocorrelation coefficient at time lag **Δ***t*=1 week seems to be relevant and reminds us that high-frequency financial market transaction prices exhibit a strong negative autocorrelation at the smallest possible time lag (larger than the trivial case of **Δ***t*=0) on time scales of individual transactions—Preis *et al.* (2008) report a value of roughly −0.30 for the German DAX Futures contract. On the contrary, the autocorrelation functions of volume changes (figure 2*b*) and search volume changes (figure 2*c*) provide significantly negative values for small time lags (**Δ***t*<4 weeks).

Figure 3 illustrates cross correlations between weekly closing price changes and search volume changes and between weekly transactions volume changes and search volume changes for one proxy of the S&P 500 index—Apple Incorporated. There are no significant correlations between price changes and search volume changes (figure 3*a*). All values are within the 95% confidence interval. However, increasing/decreasing transaction volumes of this stock coincide with increasing/decreasing search volumes as one can see at time lag **Δ***t*=0 weeks in figure 3*b*. Thus, one can conclude that search volume reflects the present attractiveness for trading a stock. But it seems that neither buying transactions nor selling transactions are preferred. This example shows that the commonly accepted reasons for financial market movements—‘news moves the market’ and ‘volume moves the market’—are clearly linked together because news should be the most probably reason for searching company names in Internet search engines. The same effect can be found for aggregated correlation coefficients of all S&P 500 constituents (figure 4), even if the correlation coefficient at time lag **Δ***t*=0 weeks (figure 4*b*) is smaller than for the single stock, Apple Incorporated. In figure 4*a*, a few correlation coefficients (**Δ***t*≈4 weeks) are not in the 95% confidence interval indicating a non-random correlation between weekly closing price changes and search volume changes. In fact, present price movements seem to influence the search volume in the following weeks. However, the correlation coefficients are very small, . Thus, confirming analyses with more records are necessary. Unfortunately, Google Trends offer search volume data only on a weekly basis.

## 4. Pattern conformity

These results raise hopes that complex correlations exist on weekly time scales in the data. A sophisticated observable to quantify them was introduced in a recent work (Preis *et al.* 2008).^{4} This work was focused on finding complex correlations in high-frequency financial market datasets. In such a context, the existence of complex correlations implies that market participants—human traders and most notably automated trading algorithms—react to a given time series pattern just like to comparable patterns in the past (figure 5). However, this concept is transferable to medium and large time scales. To quantify additional correlations, we will define a pattern conformity (PC) observable.

The aim is to compare the current reference pattern of time interval length **Δ***t*^{−} with all previous patterns in the time series *p*(*t*). The current observation time shall be denoted by , then the reference interval is given by . The forward evolution after this current reference interval—the distance to is expressed by **Δ***t*^{+}—is compared with the prediction derived from historical patterns. As the standard deviation is not constant in time, all comparison patterns have to be normalized with respect to the current reference pattern. Thus, we use the true range—the difference between high and low. Let be the maximum value of a pattern of length **Δ***t*^{−} at time and analogously be the minimum value. Note that *p*(*t*), and depend also on *n*, the specific stock. However, we waive the corresponding superscript to improve the readability. We construct a modified time series, which is true range adapted in the appropriate time interval, through
4.1
with , as illustrated in figure 6. At this point, the fit quality between the current reference sequence and a comparison sequence for has to be determined by a least mean square fit through
4.2
with as a result of the true range adaption. With these elements, one can define an observable for the PC, which is not yet normalized by
4.3
as motivated in figure 6. Furthermore, we use the definition
4.4
The parameter *χ* weights terms according to their qualities (Preis *et al.* 2008). The larger *χ* is, the stricter the pattern weighting in order to use only sequences with good agreement to the reference pattern. The expression in equation (4.3), which takes into account the value of reference and comparison pattern after for a proposed **Δ***t*^{+} relative to , is given by the following expression:
4.5
We normalize the observable for PC and obtain for stock *n*
4.6
where denotes the PC of a stock *n*. In order to obtain an aggregated quantity of all S&P 500 stocks, we define
4.7

The PC for a standard random walk time series, which exhibits no correlations by construction, is 0 for all pairs of **Δ***t*^{+} and **Δ***t*^{−}. The PC for a perfectly correlated time series—a straight line—is 1. With this method, it is possible to search for complex correlations in various time series.

Figure 7 shows the PCs of weekly closing prices (figure 7*a*), weekly search volumes (figure 7*d*) and weekly transaction volumes (figure 7*g*). The tendency to reproduce historic price patterns is very small for weekly closing prices (figure 7*a*). It is difficult to distinguish the given PC from a completely random behaviour (Preis *et al.* 2008). So far, the comparison between reference and historic patterns was only based on the price time series, . Now, we also incorporate the time series of transaction volumes *v*(*t*), i.e. , to improve the pattern selection. In the same way, it is possible to include the search volume time series *s*(*t*) for the pattern selection, i.e. . If we include transaction volumes for the selection process (figure 7*b*), then we obtain a noisier PC profile. A still noisier profile can be achieved by using search volume time series as an additional pattern selection criterion (figure 7*c*). Clear recurring tendencies can be found for the search volume time series. Figure 7*d* shows significant non-zero values for the PC. In contrast to results obtained for high-frequency transactions, parameter pairs with large time lags **Δ***t*^{+} and **Δ***t*^{−} provide the highest level of PC of roughly 0.42—due to the given amount of data points we limit the analyses to the range from one week to three months. The additional incorporation of weekly transaction volumes (figure 7*e*) increases the maximum value of the PC in the range that we analyse. The maximum value is roughly 0.66. This fact supports our finding that there is a clear link between weekly transaction volumes and weekly search volumes. More important, there is not only a linear dependence as found in §3 but also complex dependencies uncovered by the PC approach. Thus, it is evidence that search volume time series and transaction volume time series show recurring patterns. On the contrary, the inclusion of weekly closing prices does not alter the PC significantly (figure 7*f*). Analogously, transaction volume time series are characterized by large PC values (figure 7*g*) that are slightly smaller than in figure 7*d*. If one also incorporates closing price times series (figure 7*h*) or search volume time series (figure 7*i*) for the pattern selection, then a slightly increased PC can be observed.

## 5. Conclusion

Search engine query data offer insights into our economic life on the smallest possible scale of individual actions. In order to investigate whether Internet search volume is correlated with financial market fluctuations—the largest possible scale of our economic life—we used search volume data provided by the search engine Google. We studied weekly search volume data for various search terms from 2004 to 2010. We asked the question whether there is a link between search volume data and financial market fluctuations on the same, weekly time scale and found clear evidence that weekly transaction volumes of S&P 500 companies are correlated with weekly search volume of the corresponding company names. Increasing transaction volumes of stocks coincide with an increasing search volume and vice versa. Thus, one can conclude that search volume reflects the present attractiveness of trading a stock. But it seems that neither buying transactions nor selling transactions are preferred when one detects an increased search volume. Thus, the commonly accepted reasons for financial market movements—news and volume—are clearly linked together because news should be the most likely reason for searching company names in Internet search engines. In addition, we have seen that present price movements seem to influence the search volume of the corresponding company name in the following weeks.

Furthermore, we applied a recently introduced method for quantifying complex correlations in time series with which we find the clear tendency that search volume time series and transaction volume time series show recurring patterns. This fact supports our finding that there is a clear link between weekly transaction volumes and weekly search volumes. More important, there is not only a linear dependence but also complex dependencies, which raises hopes that search volume data can contribute to understand financial crises.

## Acknowledgements

The authors are very grateful for helpful discussions with D. Helbing, P. Virnau and K. Yamasaki. In addition, T.P. would like to thank D. Diefenbach for insightful comments.

## Footnotes

↵1 Ancient Greek: oíχoνoμíα—management.

↵2 Ancient Greek: ϕυσ1χήτέχνη—art of handling nature.

↵3 More details can be found at http://www.google.com/trends.

↵4 This approach consumes a huge amount of computing time. However, an accelerated calculation is possible on graphic card architectures (Preis

*et al.*2009*a*,*b*) which can also be used in computational physics (Block*et al.*2010).One contribution of 13 to a Theme Issue ‘Complex dynamics of life at different scales: from genomic to global environmental issues’.

- This journal is © 2010 The Royal Society