This article poses an interesting question. Sometimes one has enough data to make accurate predictions without having an understanding of what causes the phenomenon (a model). Nowadays, it’s getting easier and easier to get huge datasets, which are often sufficient to do this.
For example… Google uses massive amounts of misspellings to give ‘on the fly’ corrections. It also uses massive corpora of bilingual texts, such as their French/English translation engine by feeding it Canadian documents which are often released in both English and French versions. But they don’t have any theory of language doing smart stuff in the background.
So are theories redundant, or obsolete, in a world where one can do proper predictions without them?
Wired’s own Chris Anderson explores the idea:
Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show.
The point here is that statistics can find patterns in basically any area; so maybe we don’t need an specific science to take care of those problems.
There are issues with this line of thinking. Of course, correlation doesn’t imply causation, so doing just this we’d be blind to cause-effect relationships:
Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.
Comments by Deepak:
We all know that more data means new approaches to science, especially since this has happened so quickly.
We’ve always worked with partial understanding, or in the case of medicine, less than partial understanding, but that’s precisely why medicine is beginning to fail. Not knowing mechanisms, etc is what results in a VIOXX. Not knowing why is what creates the next disaster.
Trying to solve the exact same problems as Google, we have a camp that does think that knowing ‘why’ is important: the semantic web proponents. Under this paradigm, the web would become a huge ontology. And machines would operate with propositions (RDF triplets) to deduce new knowledge. In this case, you do know how the machine reached certain conclusion. They do face the same huge datasets (i.e., try to operate with ‘the entire web’ at some point; not now, since only a small fraction of the sites use RDF at all), but instead of using the raw content that is prepared for human consumption, they will use machine-ready content.
If after plowing though petabytes of data, a semantic search engine reaches an interesting conclusion, at least it can show us the logical path it used. The promise for pharmaceutical companies is that they could find new drugs and interactions by just letting the algorithms traverse a corpus of, say, proteins. But, again, in this case, there is no ‘human’ postulating a theory either.
Probably, what all this means is that we scientists will need to adapt our methods to collaborate with these smart machines. There are things, like deep search, that are better left to them; whereas some other, like tagging images, are really hard for machines but trivial for humans.