Data collection and modelling are increasingly important in social science and science-based policy, but threaten to crowd out other ways of thinking. Economists recognize that markets embody and shed light on human sentiments. However, their ethical consequences have been difficult to interpret, let alone manage. Although economic mechanisms are changed by data intensity, they can be redesigned to restore their benefits. We conclude with four cautions: if data are good, more may not be better; scientifically desirable data properties may not help policy; consent is a double-edged tool; and data exist only because someone thought to capture and codify them.
This article is part of the themed issue ‘The ethical impact of data science’.
The generation, collection, analysis and use of data have become pervasive. This raises a host of practical and theoretical issues, especially for economics. The natural inclination of economists is therefore to argue either that there is nothing essentially new, or that everything has changed. Economics has been regarded by some as more ‘scientific’ than other social sciences in part because of the vast amounts of (quantified) data it generates and because its most useful methodologies and striking findings are so closely tied to analysis of these data. At the same time, economists have tended to lead the way in recognizing the critical importance of endogeneity—at least in recognizing that economic data exist because someone, with suitable access and some kind of model in mind, thought them worthy of capture. In consequence, these data are not neutral observations of an objectively defined world, but by-products of its subjectively experienced function. From this, it can be argued that data science tells us as much about the data scientists as it does about the world from which the data were drawn. A ‘good’ model, therefore,1 is one that establishes whether the data came from a world in which the underlying theory (or structural model) was valid, not a set of equations that ‘do what the data (are observed to) do’. But economic models, whether mathematically expressed or not, are also bound up with human behaviour and typically incorporate a complex network of ethical considerations, e.g. that rational behaviour and informed consent enable the invisible hand of competitive markets to identify and realize outcomes that, if not fair in any specific sense, at least have the property that no other arrangement would make everyone better off—by their own account. Of course, this orthodoxy was developed for a very different economy, where informed consent was possible and where rationality and efficiency, if not demonstrable, did at least represent directions in which the economy and the society it supported were heading. But we inhabit a very different world in which data have tended to supplant the realities to which they ostensibly pertain.
Formally, this raises the questions of whether the use of data always makes individual behaviour more ‘rational’ in the neoclassical sense and whether this, in turn, will lead to efficiency. Indeed, we argue that it is not even evident that data or the information they represent have positive value to the individual or to society. If rationality is ‘bounded’ by the speed, size and complexity of choices we face, what forces will bring our limited attention, data reduction, rules of thumb and default conventions into the kind of balance needed to respond ‘optimally’ to changes in the world?
Beyond this formal question lie two sorts of ethical issue. The first is a version of the fallacy of reification; whether the abstract rational or collectively coherent behaviour of humans can reasonably be reduced to concrete—and especially quantified—data structures. The second is whether normative inferences based on analysis of data should be used to make decisions, set policies and otherwise shape or constrain human and societal behaviour. The salience and tractability of these questions is influenced in turn by: the sheer scale and variety of data available; the structure and meaning such data may (be imagined to) have; and the extent, speed and form of access to the data and the analytic techniques that are applied to them.
A simple example will illustrate these questions. We are accustomed, when faced with decisions whose consequences are uncertain, to equate rational behaviour with: anticipating the outcomes of different options; evaluating them according to our interests or preferences; and choosing a ‘best response’ accordingly. But modern data-intensive and interactive systems are inherently complex. In consequence, their behaviours may be emergent in the sense that outcomes cannot be anticipated, but only predicted with a precision that depends critically on the data available. This does not automatically mean that more data lead to better decisions; small—and possibly unknowable—details of the data used may make huge differences to the predicted outcomes, even before interactions among people are taken into account. This can give rise to a pernicious go/no go paradigm that frustrates rational choice. In such systems it is not reasonable to adopt the ‘oceanic’ perspective of perfectly competitive markets where each individual's actions are insignificant to the system as a whole and can freely be made without regard to systemic impacts. On the other hand, the ‘monopolistic’ perspective in which individuals fully comprehend and ‘internalize’ the broader consequences of their actions is frustrated by the challenge of emergence. Put simply, the individual knows that his or her actions may have profound and widespread consequences, but knows equally that these cannot themselves be known. Moving away from the self-interested frame of neoclassical economics explicitly to consider the interests of others and such abstractions as virtue, duty and belief only magnifies these problems.
In this essay, we consider in turn a series of considerations associated with the data intensity of economic activity and the application of data science to economic phenomena.
2. Social aspects of data
Some of the most commonly raised considerations in relation to data in economic settings stem from the linkage between individuals and data pertaining to them. Connections between data privacy and privacy more generally, and between fundamental and economic rights to data, have been discussed extensively (e.g. ). For present purposes, we note the following points.
Data privacy is no longer an effective substitute for personal privacy in online settings. Many if not most of the ‘personal’ data protected by regulation were never truly ‘personal’ in the first place, but were provided to individuals by government or private sector organizations, typically to aid the functioning of those organizations. Personal data therefore include data that are granted (provided by systems), observed, induced or elicited and uttered or used to signal to others. The very different structures of access, control and value attached to these makes it unlikely that a single system of governance will work for them all. Private data are also nuanced, albeit in different ways. If the quality of being personal is a characteristic of the data themselves in relation to their subjects, the quality of privacy2 is concerned with access and control. In this sense, data may become more or less private if they are joint3 or contingent,4 if their observation depends on otherwise-unobservable characteristics, if maintaining their private status depends on burdensome or risky curation efforts by data subjects or controllers or if they are processed for new or multiple purposes.
At a deeper level, people whose online activities are profiled or ‘nudged’ into profitable alignment by structured flows of data may have lost a more essential form of privacy—a place free of unrecognized or unconsented influence from which decisions can be made that have normative significance. The data presented to people navigating a series of Web pages, for instance, may be simply an algorithmic consequence of their responses to a sequence of cues—these responses are not protected data, and are created by the interaction of the person and the system. Nonetheless, a well-structured interface may allow the Web designer to control the user's purchase or other behaviour, which might otherwise be deemed a private choice.5 Conflation of data privacy with privacy, and of the private with the personal, create considerable difficulties for legal and commercial reform and should be clarified. This reflects two operational considerations. The first seems quite mundane: who owns the data? There is a common belief that data pertaining to a natural person should ‘belong’ to that person but, at least under European Union (EU) law, personal data cannot be owned (though they can be controlled). Obviously, the situation is even cloudier in respect of data that are personal to multiple individuals. The second concerns the distinction between data and information. Context or an interpretive frame is needed to convert mere data to potentially valuable or harmful information, so data subjects, controllers and ‘users’ may not understand them in the same way; additional clarity is needed regarding the ownership and control of these frames and the extent to which different parties should anticipate or negotiate the ethical consequences of applying them to specific data. Where the context involves others’ personal data, the information may not be personal in the same way as the underlying data.6
A fundamental right of data privacy—as implemented in the European context—gives to data subjects exclusive, non-transferrable and essentially inalienable control over ‘their’ data. This is reinforced by the rights of data subjects to know what data pertaining to them are held by data controllers, to insist that these data be correct and used for authorized purposes, to grant or withhold consent to further processing, new purposes and automated processing, and (under the General Data Protection Regulation7) to ask that data be deleted under certain circumstances. Such data may acquire economic value from ‘personalization’ of goods and services to match the data subject or ‘impersonally’ from application of data analytics to larger datasets containing the subjects' data in ways that do not permit or rely on identification; this value is generally not addressed by such rights. In other jurisdictions, data are regarded as subjects' beneficial property; they (or access to them) can be exchanged, traded and generally subjected to the same contractual conditions and productive processes as other information assets. Neither extreme is without ethical drawbacks, but both have developed work-arounds to cover significant exceptions. The fundamental data right can be compromised when necessary for health, safety, law enforcement, security and other specified concerns. The economic right has been supplemented by, for example, domain-specific provisions for mandatory erasure and for breach notification. However, their coexistence in a globalized data economy has led to issues that neither of the canonical forms seems able to handle.
A second aspect of the ‘new oil’ use of data arises from the application of contractual or informal privacy protections to data that are created jointly by individuals' interactions with each other or with automated systems; these data pertain or belong to relationships rather than individuals. In this sense, privacy ought perhaps to be replaced by a suitable concept of ‘privity’—a collection of differentiated (relational as well as property) rights and consents. This could extend both privacy rights and complementary access rights to provide structures for access and permitted operations (use, processing and modification) that are shared, time-bound, role-defined, state-contingent and purpose- and/or process-limited.
The foregoing considerations apply mainly to the use of data in economic and other interactive contexts. A related issue concerns the use of data to: analyse human behaviour; draw inferences as to human preferences, capabilities and other attributes; and devise, implement and evaluate policies that affect behaviour. This essay has little so say on this topic, but we note that any such scientific use of data should take into account the underlying structures of access and knowledge, because the behaviour of an individual in response to information or data will often depend on what that individual knows (or believes) about the quality of the data and the extent to which others know them, know that they are known, ad infinitum. Consider a simple example of trade formation. One individual chooses whether or not to make an offer to another. In some states of nature, this will be beneficial to one or both parties; in others not. If the individual receives information suggesting that the trade will be beneficial to himself (and possibly to the other individual) he might make the offer. But this depends on what he thinks the other person knows. Suppose that the first individual believes that the counterparty knows only whether the trade is beneficial to himself; conditional on the counterparty accepting the offer, the first individual would not want the trade so the offer would not be made. If, on the other hand, the information observed by the first party is common knowledge between the two, an accepted offer can be interpreted as one that is beneficial to both sides, so the offer would be made (and, if suitable, accepted). Therefore, the analytic significance of data used to model or interpret behaviour and to predict the future may depend on the entire structure of knowledge and belief about the data themselves. In cases where common knowledge and belief are critical, or where the accuracy or reliability of information is essential in order to obtain beneficial outcomes, there may be an ethical case for the creation and (even mandated) use of authoritative data.
In order to generate observable consequences, data are typically communicated between individuals. Much of our policy and many of the models used to analyse data and to develop improved methods of handling them are predicated on a ‘content transfer’ function for communication; data are transmitted in order to share them. Ultimately, this can have a broader purpose, such as: obtaining payment; coordinating joint action; shifting liability or responsibility; signalling intention; or eliciting further information (in exchange, or by observing the recipients' reactions to the new data). To determine how data should be managed and communicated, it is useful (if extraordinarily difficult) to consider the purposes for which communication occurs (e.g. as an informational transfer or an instrumental or expressive act) and the value to which communication and its consequences give rise.
But communication is also burdensome; many of the data used to analyse human behaviour and to formulate or implement policy are supplied in response to requests or obligations. This type of form-filling has several potential weaknesses. One is that the requestor and the supplier of such data may have very different understandings of the value and quality—or even the definition and meaning—of the data supplied. This can lead to inaccuracies in the data themselves or in the structures of interpretation applied to them. These errors may proliferate and diffuse as the data or its derivatives (information, ratings, models and decisions) propagate through economic, scientific and societal networks. An example is provided by routine government requests for information to be provided by businesses and citizens. If these requests are repeated, they may be mis-reported; in consequence, many versions of the same information may find their way into public databases. This might not be a problem in itself if the errors are random; most data are noisy or uncertain. But differences among such observations may well not be exogenous, fixed or products of conscious manipulation. Rather, their structure is derived from a host of societal and organizational factors, which are seldom if ever recorded with the data, let alone taken into account. A further problem arises when data provided in one context (purpose, organization or format) are re-used for another. In this case, data that were fit for the original purpose (and for which appropriate consents may have been obtained) may be misleading in the new context—again for reasons that are only jointly visible if all parties involved consider them together. In such cases, there are few systematic arrangements for ensuring configuration control—e.g. a single authoritative source accompanied by a clear and unambiguous data model or description. Moreover, possibilities for such data sharing are often constrained by legal impediments or by misalignment of the incentives and costs involved. Finally, when such data change, there is often little incentive for the office holding the original data to inform those who used earlier versions, or for them to reconsider their decisions or actions.8
A second example is provided by a communication game on a network.9 Consider three individuals, each of whom has access to private data about the state of the world. The structure of these data is common knowledge; each knows what the other might have observed, but not the precise contents of the observation. Suppose first that the individuals are all together in a (small) room. They communicate iteratively, as follows: (i) each publicly reveals a message reflecting his initial data; (ii) each revises his information based on the others’ revelations and utters a second signal; (iii) knowing each other's possible initial observations and knowing with certainty the messages heard by each other participant on the first round, each can then further refine their information. At each iteration, their knowledge of the true state becomes more precise, until it converges to a common knowledge assessment, which reflects the true state, their initial data structures and the precision of the language they use to communicate. Now replace the ‘small room’ public messages with private ones along a network. Each party sends a message to each of the others and receives a message from them, but does not know the messages exchanged between the two others. At the end of the first round, everything is as it was before. But to interpret a message received from one of the others in any subsequent iteration a player must now form a conjecture as to what that player might have heard from the remaining participant. Thus, while information converges in the first case (where everything is common knowledge), in the networked case (with semi-private messages) information may diverge. From the data science perspective, models seeking to link behaviour observed by the scientist to ‘objective’ data observed by the subjects may face fatal difficulties unless more is known about the structures of knowledge and belief surrounding their communication.
4. Economic and market forces
Most economic systems operate within legal structures based on the reliability of meaningful consent; people are assumed to have acted in their own interests unless it can be shown that they have been: subject to duress (including market power); affected by the actions of others over which they have no influence (externalities and public goods); or systematically and materially misinformed. Problems arising can therefore be ‘solved’ by limiting undue influence, internalizing externalities and improving information. However, it cannot always be safely assumed that individuals can meaningfully consent to the collection and use of data that affect them, that they can understand decisions taken on the basis of data analysis well enough to challenge them or even that they can understand whether any problems, inefficiencies or injustices stem from inaccuracy in the data or problems with the algorithms or models used to process or act on them.
Moreover, even where individuals can understand their data and its use, giving and maintaining consent may be extraordinarily burdensome, to the extent of wiping out the net benefits of protection. Without an assurance of net benefit, it cannot be assumed that even voluntary arrangements have desirable efficiency properties, or that a data controller who provides data access as a means of obtaining or inferring consent has usefully or effectively transferred responsibility back to the data subject. Therefore, consent does not moot all ethical problems; we should consider whether informed consent is practical and whether it might, in turn, create additional difficulties for data science and analytics. For instance, if the ability meaningfully to consent is correlated with a feature of scientific interest, selection bias in consent-based experimental or empirical data-collection protocols may lead to erroneous conclusions. If individuals respond to what they regard as burdensome or unwarranted data requests or activities by falsifying or randomizing data, the implications for scientific analysis of these data may be even worse.
To some extent, problems that arise when the volume and intensity of data flows or emotional loading distort individual decisions can be handled by automation, through ‘algorithmic agency’. This is not without its problems; in addition to the well-known difficulties involved in algorithmic regulation , consider a dilemma that might face a wearable electronic device that administers medication to a patient when the health data that it monitors indicate the need for a new dose. Suppose the patient, for temporary reasons or in response to unmetered side effects, instructs the machine not to give the dose. Should the device obey the ‘somatic’ or the ‘intentional’ version of the patient? If the device were a human being who had established a relationship with the patient, there might be a basis for resolving this dilemma. Even a machine might be able to model the patient's ‘true intent’ given enough historical data and suitable machine-learning techniques. The machine might even prepare for such an eventuality by presenting the patient with a series of experimental problems in order to gather the requisite information. But each such problem is an experience, likely to change the behaviour of the individual (however subtly); the machine might easily slide from finding out how best to serve the patient to programming the patient to be easier to serve. The carrier of this paradox is the narrow data channel through which the machine and the patient interact.
This example also sheds light on the nature of intelligence. Concepts like informed consent and rational choice depend not merely on process mechanics but on the ability and willingness of those seeking or giving consent to understand the implications and act accordingly. In this respect, it is increasingly difficult to differentiate between the decision, the ‘mind’ that decides and the flows of data and information that trigger the mind into action. Moreover, the need to make a decision may cause further communication and data transfer in order to obtain advice or opinions, coordinate action or manage and balance consequences. In this setting, it is not the intelligence and rationality of the individual that should concern us, but rather the collective intelligence or rationality of the complex of interacting individuals and the flows of data through this dynamic network .
In such situations, where strategic interaction among multiple parties with different information and objectives produces results that serve none of the participants, the economist reaches for the mechanism design toolkit. One might hope for a mature science of data and data-mediated interactions that used such tools to rewrite the rules of the data science game or the data-driven (evidence-based) policy game in order to minimize such problems or to employ subtler mechanisms based on a combinations of slight nudges and the self-organization of complex socio-technical systems. But it is still too early for precise recommendations or predictions; collective or coercive ‘nudges’ have their own ethical problems and the tools needed to assess their impacts are still in their infancy, especially in cases of self-regulation.
5. Post-human responsibility: algorithms and algorithmic regulation
Increasingly, decisions that affect human destiny—or shape human behaviour—are made by algorithms, or complex networks involving algorithms, sometimes in interaction with human beings. As with any such socio-technical system, there is a need to understand what these bits of code are doing, and to control, complement, exploit or compensate for their activity. However, human regulation of algorithms is not always easy. For instance, knowing the source code of an algorithm that has been exposed to flows of data does not allow an auditor to evaluate its performance, and programming algorithms for operation in self-organized and dynamic contexts where they may be exposed to arbitrary flows of data and behaviour of other components is forbiddingly complex. Therefore, placing responsibility on programmers may not solve the problem of regulating algorithms.
A complementary problem concerns the regulation of human behaviour by algorithms. Again, the difficulty of auditing or designing algorithmic and automated decision-making systems gets in the way. But there is a further complication. Over time, experience accumulates and the technological and economic spheres evolve; algorithms ‘learn’ and adapt, human behaviour changes and the systems in which they are embedded rewire themselves. This evolutionary process may be affected by complexity considerations—in particular whether intelligent behaviour and learning in each part of the system will lead to intelligent behaviour and learning overall. Some simplified models (like the competitive economy) have this kind of ‘scale-free’ property; the interaction of perfectly competitive and rational individuals leads to an economic system that behaves like a large ‘compromise’ rational individual. But this breaks down in other contexts; the emergent consequence of micro-scale intelligence may be systemic stupidity or chaos. For instance, speed tends to trump model complexity and sophistication in computerized trading systems. Therefore, fast, simple moving average models hard-wired into silicon have a slight but significant evolutionary performance advantage over econometrically and theoretically superior models for price discovery and trade execution; ultimately, the former may take over the population (at least for significant volumes of trade). This will in turn affect the overall informational efficiency of markets. Again, we have a ‘false monotonicity’; a little improvement in speed and data processing capacity may be good, but it does not follow that more is better, or that ‘improvement’ can be reversed if it does not deliver the hoped-for benefits.
Moreover, the prevalence of ‘cybernetic’ (human and machine) systems raises fundamental questions about how (if at all) human governance mechanisms, especially those relying on intangibles like rationality, privacy and consent, can be extended.
It is not obvious that a network of interacting rational individuals will behave in a collectively rational way. As noted above, rational collective behaviour is more likely if all the individuals are insignificant or if a few key entities have the right combination of information, rationality and power. Beyond this, the system must rely on self-correcting mechanisms such as learning, experimentation, negotiation and bargaining. These, in turn, work best when the components of the system are within ‘touching distance’ of each other; operating at similar scales and speeds and aware of substantially overlapping information. Where these conditions break down—as for instance when the automated parts operate at vastly greater speed and/or scale than the human parts—rationality cannot be assumed and fixing responsibility on the human participants or the programmers of the automated parts may provide neither efficiency nor ethical clarity.
It is also not obvious that objects and code can be made fully to respect the privacy of human beings without granting a level of access (to the automated agent and potentially to others) that violates privacy. Even in the simplest case of a ‘thing’ or algorithm that serves the interest of a single individual, its potentially unobservable or ungovernable interactions with other parts of the system mean that it may not be sufficiently trustworthy to inherit the privacy interests of its ‘owner’.
Even consent is subject to these issues; it may be difficult for individuals meaningfully to consent to the capture and re-use of their data when the ultimate consequences may be beyond anticipation or comprehension, let alone to do so in high-speed, dynamic market environments. There is therefore a tendency to use default policies, in either one-off or automated decision-making. The degree to which this can provide meaningful consent is being actively studied.10
From the data science research perspective, it remains an open question whether—in the presence of such problems—the awareness and consent of data subjects to data collection and processing meets the requirements of responsible research and innovation. This applies in particular to studies of the behaviour of human beings in situations where awareness or consent falls well short of even the best feasible standard. In other words, making subjects aware and asking for consent may affect their behaviour in ways that undermine the external validity11 of the results; alternatively, some subjects may refuse to participate or withhold or distort their data in ways that influence the findings. On the other hand, problems associated with rational choice may become less important if data are automatically generated and/or collected. But then it would be critical to establish in advance how much human attention has been paid to the generation of the data.
The most realistic results would, of course, be obtained by mimicking the conditions of ‘real life’. But it is not wholly obvious that the advance of scientific knowledge provides adequate justification.12 If science is regarded as a justification, do studies that use personal data have to follow scientific method and/or prove to be useful? In particular, if endogeneity is an issue and—as a result—the data do not provide neutral and unstructured observations of ‘the world’ how are the resulting observation, interpretation and exploitation to be dealt with ethically, let alone analytically? How do the scientific and ethical frames interact?
Finally, data of different types or drawn from different domains of activity may expose variations in the nature and impact of explicit or implied consent. One would not expect consent to operate in the same way for data that are observed, elicited, voluntarily submitted, produced by models (including profiling and form-fitting), masked or obfuscated, encrypted, encoded or embedded in other objects. The human privacy and data protection concerns may be limited to the recoverability of the original data, the identification of the data subject or the legitimacy of re-use; but the scientific concerns may be far deeper. Yet both are ethically significant.
Most proposed codes fall well short of the nuances these considerations imply. However, some protections—in law or convention—address these concerns. Mechanisms that enable or even require data subjects to take control of their data include breach and access notification (which facilitate but do not compel further action by data subjects) and rights to demand erasure or correction (which do require subjects to become involved, but generally do not apply to cases where data are not used in ways likely to directly affect the vital interests of the individual subject). Other approaches entail passive arrangements offering specific blanket (and often unavoidable) data protections, including limitations on access, the time for which they can be stored, the uses to which they can be put and ways in which they can be combined.
The fundamental questions, therefore, are whether such complex systems of control can balance data subject interests with the needs of science, and whether it is possible and proportionate to customize or modify them in order to accommodate new forms and uses of data.
6. Endogeneity and other scientific reflections
To summarize some of the issues raised above, an economic perspective on the ethics of data science research should consider whether data intensity calls into question the ethical underpinnings of economic structures and economics as a science. In doing so, it must further consider whether tensions between the interests of data subjects and those of science can be reconciled and whether policy and decisions based on data science can preserve this compromise. The separation of scientific and personal interests rests in large part on the ‘separateness’ of data from the personal and the scientific domain. In other words, data capture processes should not perturb the actions of individuals and the data should represent unbiased observations of an independent ‘real’ world.
However, endogeneity may render this a vain hope. Profiles based on observations may be self-fulfilling, especially if data subjects are made aware. In other words, the collection of data shapes the world to which they pertain. If the enforcement of notification and consent is sufficiently ubiquitous as to remake the world in the image of the experiment, this may cause no problems. But any failure of the real notification and collection policies to meet scientific standards raises a critical weakness in evidence-based policy. This is analogous to criticisms that have been made (e.g. ) of the ‘gold standard’ randomized clinical trials (RCTs) in biomedical (and latterly policy) research. RCT populations are rarely as ‘messy’ (e.g. subject to multiple debilities) as real ones and, while participants may not know the treatments they provide or receive, it is common knowledge that they are participating in an RCT.
Along with this, the balance of empirical and experimental methods may create an extreme form of reductionism; there are already signs that, as far as commerce and the law are concerned, people are no more than the data recorded about them. Because data can be used to create or capture value in so many ways, only some of which involve the data subject directly, there is an emerging ‘datavore’ tendency to capture as much data as possible, using any available access or relationships, in the hopes that they will turn out to be valuable in future. This may be regarded as abusive or even illegal in some jurisdictions if the data go beyond the minimum necessary13 to fulfil the original purpose. In essence, this tendency involves a shift from data collection as a means to an end (e.g. delivering better services to data subjects) to an end in itself, with the usual troubling consequences. Moreover, it may shade over into more active forms of abuse if the relationship is used to ‘nudge’ or programme data subjects into revealing more information or taking actions that do not serve their interests.
From the analysts' perspective, endogeneity and the use of purely mathematical or statistical tools creates another issue: imprecision of data. In addition to quantifiable risk and noise, data or the phenomena to which they ostensibly relate may be uncertain14 or even ambiguous.15 This applies to analysis of previously collected (e.g. empirical) data and also to the collection of new data, since the experiment itself may change the underlying reality . Along the way, certain assumptions underlying data-driven scientific methods are called into question. These include: the necessity of reproducibility; the use of statistical concepts of robustness; the benefits of more data over less; and the proper way to factor in distributional coverage, depth and ‘statistical representativeness’. Note also that different disciplines address these issues in different ways; in consequence, interdisciplinary data science may form fragmented conventions.
One final consideration arises from the use of data-driven methods in financial market settings. A variety of asset pricing models  have been developed to help investors to predict asset prices or identify the ‘true’ value of an investment. Many started life as theoretical models, but most were implemented as (regression-based) empirical models. Perhaps the best known is the Capital Asset Pricing Model, which identifies a specific parameter measuring the contribution of a given asset to the performance of a reference portfolio of assets. This reflects not simply the variability of the asset's returns, but their correlation with the returns of the other assets in the portfolio. The original equations were found not to reproduce empirical behaviour very well, so the model was modified, first by adding additional ‘factors’ (e.g. the Fama-French 3 factor model) and later by recasting investment decisions to recognize that holding an asset provides both a stream of returns and the opportunity to sell at a later date. A related empirical approach involves using a formula like the Gaussian Copula  to derive default probabilities of the assets underlying a derivative from the prices of credit default swaps.
These models all raise a theoretical possibility of potential concern to data science. Each model began life as a way of reducing the volume of data for analytic purposes, and performed acceptably. Eventually, these models began to be used to shape trades and to design assets. When the volume of trades based on the use of the model passed a (notional) threshold, the model moved from describing the world to shaping the world. In particular, the use of the model generated empirical data histories that continued to affirm the model's validity past the point where its underlying assumptions had ceased to hold—as a direct result of its use.
This raises the following question: how (if at all) can data science produce models that retain validity when they are used to shape the predominant behaviour on which the data are based? This is not simply a matter of self-consistency, because it asks the practical question of whether data science would converge to such models, given: (a) an arbitrary starting point, (b) ethical and practical limitations on the observability of the system, and (c) the potential for new ‘species’ of models (and the data they use) to penetrate the economic ecosystem.
This paper raises more questions than it answers. It sought to address the ethical practice of data science by considering the use of data science to shed light on ethical issues in the economic domain, especially in relation to such intangibles as rationality and consent and the benefits of derived properties like efficiency, privacy and competition. At the same time, it raised some questions regarding the implications of using economic evidence, questions and methods for the ethical content of data science applied to human behavioural data. In doing so, it questioned the extent to which the standard canons of economics and of empirical science can usefully be applied to social sciences (or at least economics) and the degree to which endogeneity may prevent a separation of the ‘scientific’ and ‘human’ domains.
In lieu of findings or definitive answers, this leads to four warnings from the coal face.
— Beware of false monotonicity—even if speed, data, consent, privacy, etc., are good, more may not be better.
— Recognize purpose limitation—controls on collection and use and desirable properties of data for one branch of science or one type of decision may not be equally desirable in all contexts.
— Acknowledge that consent is a double-edged tool—while it helps ensure a measure of accuracy and enhances responsibility by ensuring that data subjects understand what they are sharing and the purposes to which it is put, it may undermine the utility of the data and the science and policy that result.
— Pay particular attention to endogeneity—data records exist because someone thought to capture and codify them.
I declare I have no competing interests.
I received no funding for this study.
I am indebted to participants at Alan Turing Institute workshops at the University of Warwick, the University of Oxford and the LSE and to colleagues in the EU-funded EINS Network of Excellence on Internet Science (especially Chris Marsden, Tamas David-Barrett and Robin Dunbar) and to seminar participants at the Telecommunications Policy Research Conference series. All errors remain my own personal, if not private, property.
One contribution of 15 to a theme issue ‘The ethical impact of data science’.
↵1 This topic was extensively examined in the mid-1980s and subsequently (e.g. [1,2]) and revisited in the wake of Bayesian developments in mathematical statistics  and data science/machine learning [4–6].
↵2 The EU General Data Protection Regulation (footnote 7) applies additional conditions to processing of sensitive data.
↵3 Pertaining to multiple people, for instance data that relate to me and to others or data used to characterize my connections, or groups or populations to which I may be deemed to belong.
↵4 Depending for their meaning on circumstances of generation, collection, storage or use.
↵5 This is directly related to voting theory results that suggest (roughly speaking) that, in the absence of an equilibrium outcome, control of the order in which alternatives are considered by a voting body allows the design of an agenda that will ensure the selection of any given choice starting from any other choice .
↵7 The General Data Protection Regulation (Regulation (EU) 2016/679) is intended to strengthen and unify personal data protection for individuals within the EU. It replaces the Data Protection Directive 95/46/EC; it was adopted on 27 April 2016, and will take full effect on 25 May 2018 after a 2-year transition period to allow adaptation of existing national laws and practices that are not (yet) in line with this Regulation.
↵8 These considerations are expressed in the so-called ‘Once-only Principle’ that forms part of the European Commission's current (2016) eGovernment Action Plan.
↵11 That is, generalizability of results to populations not subject to the experimental information and consent protocols.
↵12 Examples can be found in experiments on privacy preferences in which individuals are induced to reveal sensitive information by a variety of cues. Before engaging in the experiment they could not anticipate the information they would be asked to reveal or the consequences of recalling and revealing it, which include subjective cost as well as potential external privacy risk.
↵13 For instance, the GDPR (see footnote 7) stipulates, in ch.2, article 5, Par. 1(d)1., that data controllers must ensure that ‘Personal data shall be … (c) adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed (“data minimisation”)’.
↵14 In the sense of Knight : ‘Uncertainty must be taken in a sense radically distinct from the familiar notion of Risk, from which it has never been properly separated…. The essential fact is that “risk” means in some cases a quantity susceptible of measurement, while at other times it is something distinctly not of this character; and there are far-reaching and crucial differences in the bearings of the phenomena depending on which of the two is really present and operating…. It will appear that a measurable uncertainty, or “risk” proper, as we shall use the term, is so far different from an unmeasurable one that it is not in effect an uncertainty at all’.
↵15 People are often more willing to take risks when the odds can (plausibly) be known than when the results are ambiguous, even when the known probability offers a near-certain guarantee of failure and the ambiguous prospect could be a guarantee of success .
- Accepted September 13, 2016.
- © 2016 The Author(s)
Published by the Royal Society. All rights reserved.