I have a quibble with the following paragraph from Introducing Windows Azure for IT Professionals:
The problem with big data is that it’s difficult to analyze it when the data is stored in many different ways. How do you analyze data that is distributed across relational database management systems (RDBMS), XML flat-file databases, text-based log files, and binary format storage systems?
If data are in disparate file formats, that’s a pain. And from an IT perspective that may be as far as the difficulty goes. But why would data be in multiple formats? Because it’s different kinds of data! That’s the bigger difficulty.
It’s conceivable, for example, that a scientific study would collect the exact same kinds of data at two locations, under as similar conditions as possible, but one site put their data in a relational database and the other put it in XML files. More likely the differences go deeper. Maybe you have lab results for patients stored in a relational database and their phone records stored in flat files. How do you meaningfully combine lab results and phone records in a single analysis? That’s a much harder problem than converting storage formats.
* * *
For daily tips on data science, follow @DataSciFact on Twitter.