
Could theorising become something of the past? Does the future herald a world which has no need for asking ‘Why?’. Writing in the July 2008 edition of Wired magazine, Chris Anderson explores these questions and whether the emergence of the “Petabyte Age” spells the end to theory.
Anderson is Editor-in-Chief of Wired magazine and holds a Bachelor of Science degree in physics from George Washington University and studied quantum mechanics and science journalism at the University of California at Berkeley. He writes that the arrival of the digital computer some sixty years ago provided the means by which information could be made readable. Forty years later and the arrival of the Internet made it possible for this information to be reachable. The development of search engines has then pulled this “net” for obtaining information into one global database. Consequently, as a child of the Petabyte Age, Google is now treating this wealth of information and data as a laboratory within which to develop faster and more accurate data handling and analysis.
Proceeding to explain the context of the Petabyte Age, Anderson condenses his explanation to this – “more is different”. If one imagines a path beginning with the folder, progressing to the filing cabinet and then the library (representing the transition from kilo-, to mega- to tera-bytes), we have now reached the end of the line, run out of organisational analogies and arrived at the epoch of the petabyte. Anderson argues that a new way of considering and viewing data is now required which firstly views data mathematically and then establishes a context for it later. Google’s success in the marketing and advertising world is the perfect example of this fact. It does not know why one page is better than another; simply if the statistics of incoming links say that it is then it is!
Why?
Turning to address the implications of this new age, Anderson conjectures that these massive amounts of data and applied mathematics potentially make obsolete any other tool which may have been used previously in order to ascertain why people do what they do. The question of why does not have a place within this kind of world, simply because it is irrelevant. If you have data that tells you what people do coupled with a level of precision that allows you to measure it accurately, then with enough data the numbers literally speak for themselves.
But more than just revolutionising our approach to advertising and marketing, Anderson suggests that this new approach has a much more fundamental target in mind – the field of science. From our earliest recollections of science lessons at school, most will remember that scientific method is built upon the idea of testable hypotheses. These models formed in our minds help to paint a picture of what we think and perceive will happen in a given situation between particular agents. The next stage in the process is for the model to be tested through experimentation which will then confirm or dismiss the theoretical model being proposed.
Correlation versus causation
Simply because there is a correlation between x and y does not provide sufficient grounds for a conclusion to be made. The causation could be due to any number of unknown or coincidental factors. Some form of relationship and connection between x and y needs to be identified, thus creating a model by which the data collected can be confidently and accurately linked together, in turn helping to formulate our understanding of a particular phenomenon. As Anderson notes, “data without a model is just noise”. However, the arrival of the Petabyte Age seemingly brings with it a very ‘noisy’ entrance.
Presented with access to incredibly large amounts of data, the traditional scientific method is becoming obsolete. Put simply, correlation is enough. The analysis of data no longer requires the need to engage in ponderous hypothesising when the data can be thrown “into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot”, according to Anderson.
The challenge to biology
The writer comments that biology is beginning to follow this same path which physics has already embarked upon. Referring to the familiar biological models of “dominant” and “recessive” genes which determine a strict Mendelian process, he observes that in actual fact these models have presented a great simplification of reality. Discoveries concerning gene-protein interactions and other aspects of epigenetics have presented a stark challenge to the view of DNA as destiny as well as other previously held genetic assumptions. For example, evidence has now been put forward that the environment can influence inheritable traits, something that was once considered impossible from a purely genetics perspective.
Let us accept that there is something of substance in what Anderson is setting out. It is true that at times scientific modelling can be incorrect and can present an over simplified understanding of reality. What is more, forming a model can take time and delay scientific endeavour and development. Transitioning to a more active, “act now, think later” correlation based approach could be an advantage. However, does this also herald an end to the noble endeavour of growing and enriching our knowledge base? In this age of massive data accumulation, does it mean that fewer discoveries are something to be favoured? Or does the future present a whole new perspective on what it means to discover something?
Criminal profiling
The example of criminal profiling provides a helpful example to use in assessing some of the implications which arise from Anderson’s predictions. In this case, behavioural and investigative tools are employed in order to help investigators build up a profile of unknown criminal subjects or offenders. With the emergence of the Petabyte Age, the potential to access and analyse larger and larger data sets and compound the results looks set to significantly impact upon this kind of profile construction. Consequently this kind of profiling serves as another example of how the biological sciences could well follow the path of abandoning traditional scientific approaches in favour of the Petabyte method. But how accurate is the data being used to build up the profile?
Just because there is an abundant burgeoning of data and information at our disposal does not necessary mean that all of it is useful and reliable. Rather disturbingly, on more than one occasion the media have reported on cases where innocent people have become the victims of community abuse and rejection following totally incorrect data being accessed which supposedly confirmed their paedophilic tendencies. At the same time stories also abound of supposedly “known” paedophiles having slipped through the net of various data checks and screenings allowing them to go on and commit horrific crimes of abuse. The accuracy with which data can be sifted and screened may well improve in the future but at this present time, with widespread concerns over security and the handling of personal information there is also reason to question the validity and viability of the information being accessed.
Secondly, simply acting on and formulating a profile based on correlation as opposed to causation allow the deeper analysis of data to be omitted from the process. Knowing that someone is likely to be violent or 75% more likely to commit the same offence again is useful but without knowing why the data does not help to constructively deal with and resolve the issue at hand. The “traditional” approach of modelling helps to test thinking and understanding in the pursuit of developing our knowledge further. Rather like completing a large scale jig saw puzzle we begin to formulate a deeper appreciation of the world in which we live and why things happen like they do. In the case of criminal profiling, developing a model from the data can help to assess how future intervention could help curb further violent outbursts or identify the triggers which cause the outbursts. In so doing a far more explicit and pragmatic profile can be established.
As someone once said “we value what we can measure because what we measure can be given a value”. In the Petabyte Age some would suggest that numbers speak for themselves and as individuals we may feel a certain reliability and definition to life’s problems through the supply of data. Nevertheless what data needs to be discarded and what data needs to be retained still needs to be decided and in order to do this some form of theory is surely necessary.
Digging the grave of traditional science?
Anderson may well call for science to ask what it can learn from Google and effectively dig the grave of traditional science, but what does the Petabyte Age really promise? Anderson probably does go too far in his predictions and critical response to how science has been traditionally carried out. Yet the effects of petabyte, as seen through the example of Google, are something not to be disregarded totally. There are some very interesting social and ethical challenges which will require careful and comprehensive discussion within the science, science communication and public policy communities.
Rather than being mere futuristic thinking gone mad, petabyte and faster computers could well help to provide a complimentary path to traditional scientific endeavour, thereby establishing an interdependency relationship between the petabyte and knowledge accumulation as opposed to one eclipsing the other. The future will still require some form of explaining and speculation causing the death of theory to be far from imminent.