Tuesday, November 19, 2013

What is Big Data?

I started my first winter book review with the book titled "Doing Data Science". I found the content very rich and to review the whole book in one post would have been too much to fit into my rule-of-thumb that a book review post should never be over 1000 words. All this goes back to how I review books and how each of my writing and reflection needs to stay to a reasonable amount of content.

I like how this book is written from the implementation of big data perspective, I like how this book is written from the teaching people about data science perspective, I like how this book is written from the hands-on getting it done perspective. I really like this book from the outset of its reading.

The preface and introductory chapter contained a valuable amount of information that helped put the whole data science subject into context. These two chapters each helped in the following ways;
  • Preface
    1. Origins - this fell into two main themes, the origins of the course the book was based upon and the origins of the book itself. What became clear is the book was written to bring clarity on the subject of data science and big data. And what is hype and what is concrete and historical. The book sees much of what is currently occurring within big data as hype; data science has existed for a while ( > 10 years ) and the current practices within big data have origins with traditional statistics against big data sets...
    2. Supplemental Reading - this is an amazing reading list and provides valuable insight into the breadth of the data science subject domain. The supplemental reading fell into the following six categories, these speak volumes about the domain of data science; 1. Math, 2. Coding, 3. Data Analysis and Statistical Inference, 4. Artificial Intelligence and Machine Learning, 5. Experimental Design, and 6. Visualization. What surprised me was how a number of the books listed were a part of my statistics, visualization and collective intelligence readings from a few years back. 
  • Introduction
    1. The hype - the book acknowledges that the hype around big data and data science is extensive and there are a few drawbacks to all the hype;
      • Currently there is no common terminology around big data or data science.
      • It shows a disrespect to those working working in this field for many years.
      • Creates a noise-to-signal ratio that could turn people away the longer it continues
      • It simplifies the broadness of what is required to be successful in data science
      • Working with large volumes of data is as much a craft as a science
      There is also an amount of truth, and lessons to be learned, within all the hype. The important highlights are;
      • smart people with some of the required skills, should be able to develop the other skills they need to be data scientists
      • its the integration of both on-line and off-line real-time data that is different, we now have a culturally saturated feedback-loop.
      • there are ethical and technical responsibilities to be considered
    2. The role - of the data scientist or team doing the work requires the following skills. Though having these qualities in a team is better than an individual. (and it is difficult to find all these skills in one person).
      • Computer Science
      • Math
      • Statistics
      • Machine Learning
      • Domain Expertise
      • Communication and Presentation skills
      • Data Visualization
    3. Team structure - is well described in both the skills listed above and in figure 1-4 from the book. For me its a well balanced team with a variety of people from different fields working together in solving big data problems.
    4. Thought experiments - the introduction (and the whole book) uses well articulated thought experiments. These get the reader thinking about the current subject through questions and problems to solve.
My initial thoughts after skimming the whole book and considering the details of the first few chapters took me back to all my readings with very large databases (VLDB) a number of years back. The VLDB SIG has been around for close to 40 years and if you view some of the early conference agenda... big data hasn't changed that much in 40 years. Extracting, Cleansing, Transforming and Loading the data is as important as it ever was, and there are many well known practices in this domain. The heavy lifting isn't with the implementation of the technology; but with the nature of the content, the goals of the analysis effort, and with well designed reporting and visualizations. The statistics, and related approaches, are as important as ever.

So... what is big data?
In my opinion, it is the coalescing of many large (even humongous) data sources. These sources can be real-time, on-line, off-line and otherwise and the reporting and visualization should represent this dynamic nature of the data. Really smart analysis (statistical and other) should be available ASAP so intelligent and automated decisions can be made. Big data projects should have a balanced team where team members each possess a number of the skills required to make the team complete. The team should be given free reign to experiment and explore while staying rigorous to project management practices (Agile, Lean or otherwise).