Everyone today is talking about big data. It is one of those buzz words that everyone wishes (and many claim) that they used it first. But what is it really? Remember, the term “Big Bang”, as applied to the formation of the universe, was first coined by Fred Hoyle, a scientist who considered the theory of continuous creation to be a better model of the universe than a single creation event (which he objected to as too similar to religious interpretations) as a derogatory name for the theory that it has become known as. Nowadays the Big Bang has become generally accepted as the prevailing model, though increasing numbers of scientists are now looking towards alternative explanations.
So, is the term big data simply lots more data than ordinary data? Is it simply a big database or does it have a more profound meaning; for instance is new science involved?
Advocates of the big data concept would certainly have us believe that there is new science. The accepted wisdom is that big data will allow us to do things that were previously impossible. Any organization that uses it will have a competitive advantage; it will be possible to track customer behaviour on a micro-scale; it will allow us to create new processes and products; it might even save the planet. What is undeniable is that huge amounts of data within science, commerce and socially is being generated and knowing what to do with it is a huge problem. We think that it might be useful, and in a handful of cases it has proven to be so, but in the main we have very little clue about what to do with it.
In fact there is a connection between the Big Bang and big data. Both are about information, or if you prefer, entropy. From the moment of the Big Bang the entropy (disorder if you prefer) of the universe began to increase; it is still increasing 13.77 billion years later and it will continue to increase until the universe ends with a bang or a whimper. Big data goes the other way. Initially it has a high degree of entropy, but as we sift through it and sort out what is meaningful and what is not, in other words we differentiate between the signal and the noise, its entropy decreases. The more information that can be extracted from the data the lower its entropy is and the higher is its value.
A big data archive takes up a large amount of storage space, but what is important is the information density. There is always the possibility that there is little or no extractable information, and until that information can be extracted any value that it has is merely potential. The dilemma is how to extract real (rather than potential) value. The problem is that traditional data analysis techniques are far too slow for big data. Sampling only works if the information density is high, which it isn’t. And the problems are exacerbated by its unstructured nature.
For further information on data management and how archiving can bring together disparate data in a more seamless and accessible file archive, please see www.mimecast.co.uk/Products/Mimecast-file-archive.