Tuesday, May 06, 2014

Big Data; Similarities and Differences

Compare and contrast; VLDB and Big Data.
I've been a data guy for over 25 years. My undergrad degree is in Technology with specialty in Database Management Systems (DBMS). The focus of my whole career has been on the data... I believe if the data is wrong (even just a small amount of it), most of the other related IT is pretty much useless and any reporting or analytics should be taken with more than a little skepticism. In my opinion, its all about the data, and it's correctness or accuracy. This has been a cornerstone to my career, all I do has elements of advocating for data quality.

I continue to be entrenched in data related projects. My current project is focused upon opening up Machine-to-Machine (M2M) data exchange using satellite networks and RESTful API's. Very cool and very relevant to open data / big data. I continue to work with and read about data (big and small)... but, I don't see that much has changed from the Very Large Database (VLDB) discussions of the past 40 years. Don't get me wrong; the amount of data has never been so big, the ability to process data has never been greater, the algorithms and models have matured, and the technologies to support big data have never been better. I see more similarities than differences when talking about big data when including references to VLDB and past data analytics.

This blog post sets out to describe the technologies and processes that support big data and how these are more similar than different to big data processing of the past and present. This post describes the accompanying image from its top to bottom; with descriptions provided for how I see the real world as related to data and the purpose of each step (or grey box) in the big data realm. I do believe that many of these boxes (or process) have remained the same in relation to the processing of data (big and large).

Real world
Sources of data
Data comes from many sources! It is good to keep in mind that every small amount of data can be collected when considering the entirety of data creation vs. data collection. As an example; data is created in massive quantities as every person moves through their day. Heart rate, body temperature, calories burned, eye movements, blood sugar levels, foods eaten, decisions made, walking pace, etc, etc... And all these data attributes change throughout each persons day. All of this and an massive number of other data attributes is what make up data creation in the real world. When you consider all this data is created by every person, every second of every day it becomes a massive amount of data creation. This example only includes people as the sample, data creation is even greater when every object on the planet could be considered as a data creator.
The point I am wanting to make is that in the real world there is a lot of data being created all the time, and only a small amount of it is actually being collected. What is being collected is already considerable and comes from a plethora of sources (and to consider this is only the beginning of data collection in an internet of everything world). This is a high level list in what I see as the current set of data creators / collectors;
  • log and event data - server logs, click-through events, page views, API's called, etc...
  • transactional data - traditional data processing systems across all industries / organizations / institutions.
  • multimedia data - movies, images, photographs, music, etc.
  • geolocation - latitude, longitude and other relevant location / movement data
  • unstructured data - unorganized or having no data model or pre-defined structure.
  • device data - data made available through small or handheld devices
  • sensor data - data coming from sensors attached to objects (remote or otherwise) - in time, this is where the greatest amount of data will originate.
  • streamed data - audio, video, astronomical, etc.
  • human data - data about people, in its broadest sense.
The methods of data collection have remained similar over the last 40 years (well, much longer, but...). I see data collection as capturing the details of a real world event and making it digital by recording the event using an electronic device. This capture occurs in many ways as described in the previous list of creators / collectors. The important part of collection is in consolidation, where the relevant data sources and attributes are identified and brought together (either physically or virtually) for processing. I do agree the greatest change for the current big data is the three V's of big data; volume, variety, and velocity. I do believe the collection methods are more similar than different over the past 40 years, they are just happening at greater volume, with more variety of sources, and coming into systems with a greater velocity.

Processing (ECTL)
Preparing the data
Processing is about getting the data from many different sources into a state and place where it can be analysed. I cut my data processing teeth with Extract, Transform and Load (ETL) work. I consider ETL as the harvester of data. I see processing as the consolidation of many a data sources by Extracting the data from the data collection systems, Transforming the data from these different sources so it will fit together, and then be Loaded it into a system (usually a data cube type technology) for analysis and reporting. As time progressed I began to include a data Cleanse into the traditional ETL process. Cleansing is about preparing the data for a greater transformation rate. Not all data can be transformed into a common data store, cleansing increases the success of the transformation. And once the processing is complete the raw (or originating) data may be discarded for results have been calculated or the meta-data determined. This discarding does not occur in all processing situations, but keep in mind the end result of processing may be cleansed and transformed raw data or some amount of calculated or meta-data. Again, I don't see this has changed much over the last 40 years, except for the increase in volume, variety, and velocity. The methods and approaches remain the same...

Storage is where the data resides after it has been captured and processed. This may occur in real-time and reside in an in-memory storage rather than physically stored on a disk or other solid-state device. The data may also reside in an Operational Data Store (ODS), in a Data Warehouse (DWHS), in files, or as raw data. The storage may be long-term or short-term, there does need to be a place for the data to be stored so it can be analysed and used for calculating results or developing insights. I do see storage as being one of the areas creating differences in the way big or very large data is stored, processed, and analysed. In the past, data needed to be stored to a physical device, like disk. But now virtual memory is large enough, at a reasonable cost, where storage approaches are allowing the database (storage) to be entirely in-memory. This can fundamentally change how large volumes of data can be processed, and how database technology is implemented. The traditional row-locking database is no longer required as the latency created by disk input / output (IO) no longer exists when the whole data store can be in-memory. The approach to database technology design can fundamentally change without needing to manage the issues of on-disk storage as a part of the traditional database.

Exploratory Data Analysis
Once you have your data all in one place (or as you are bringing the data together in one place, Storage) you can start with Exploratory Data Analysis (EDA). I think of this as play; laying all the data out on the table and looking at it from different perspectives, flipping it around, stacking it in different ways, molding it into the shapes that it can stay in. Using tools and techniques designed and developed to begin sense-making with the data collected. Remember, its exploratory. And other than the wonderful new technologies (or tools) that have emerged recently, the approaches to EDA haven't changed that much over the last 40 years.

Data Cubes / Multi-dimensional Cubes
I like cubes, particularly multi-dimensional cubes. Always have, even as merged data sets. There is something that makes sense to me in regards to loading related data sets into a technology that builds insight into the data relationships. Online Analytical Processing (OLAP) is a multidimensional analysis approach that has been around also for 40 years.

I do see EDA and OLAP as similar, but I believe the tools and techniques available to (or developed for) EDA as broader and deeper than the tools and techniques associated with OLAP.

Bringing intelligence to the data

Business Intelligence / Data Analysis
The term business intelligence always fascinates me for historical reasons; it is well described how the term was first articulated over 150 years ago. The idea of bringing together disparate sources of "large" data for competitive business advantage isn't new. What is relatively new (last 60 years or so) is the use of computers and digital data storage for the processing and analysis. I do consider business intelligence and data analysis was born out of the data warehousing stream of big data and a lot of the algorithms and statistical models will find there data processing roots in traditional large data initiatives. I consider machine learning the new comer in the big data realm, for it is only until recently that the volume of data and the commodity priced hardware, software capacity, etc. has created the thinking / need behind machine learning.

Machine Learning, Algorithms, Statistical Models
I see the troika of machine learning, algorithms, statistical model as the intelligence side of "big data". Collectively the three hyper-links in the previous sentence give great descriptions of these three parts of deriving intelligence and knowledge from data. The big part of these three is that they automate the creation of the "intelligence", they allow data to be consumed and then "knowledge" created so decisions can be made in real-time (by the computer, or network; depending how you look at it) to impact the way further information is presented.

It is important to note that machine learning, algorithms, statistical models (as indicated by the data flow arrows) often get data directly from storage (real-time or otherwise) and don't include the EDA step. This is often due to once data and its sources are understood no more exploration is required and the data can be accessed directly for intelligent processing.

Data Product
A data product is information (precise or otherwise) that has been derived from Business Intelligence, Data Analysis, Machine Learning, Algorithms, or Statistical Models. These products can be added to the attributes of another product or be used to focus a decision or alter a user experience to better suit the specific viewer. In its simplest terms; a data product is what is harvested from your Google search terms to display focused and specific adds on your views of subsequent (and seemingly unrelated) web pages.

Reporting, Dashboards, Visualizations, and Communications
One of the areas where I continue to be amazed is the growth and innovation within the graphical representation approaches to displaying data. A look into some of the open source frameworks will amaze - a look at the D3.js library is an excellent example. I believe their are four main aspects to rendering data, these are;
  • Reporting - I consider reporting to be text or graphical documents (printed or otherwise), spreadsheets, emails, etc. that report on information or knowledge derived from the processing of data. 
  • Dashboards - I consider dashboards as the digital single page rendering of current an important decisions for the day-to-day activities of a business, enterprise, organization, etc...
  • Visualizations - the graphical visualizations of processed data 
  • Communications - outbound communications derived from data processing can be in many different forms. Using rich media and emerging technologies to increase variety, frequency and management of communication channels can assist with big data efforts.
Decision differ from data products as they are cognitive / intelligent activities done by people. They use the reports, dashboards, visualizations and communications with provide the knowledge to make informed decisions. All of the steps in gathering, processing, and analyzing the data down the right side of the image leads to supporting the human activity of decision making. Even though the process are similar within big data and traditional large data the end product in the traditional is most often as input to people, where the big data side creates data products used by machines / computers.

Similarities and Differences
I dislike using percentages to indicate differences, but; I would say that the tools, techniques, and approaches to big data compared to traditional large data is > 80%, where there is an 80% similarity between the two. Over the last 40 years big data has been present (under different names) and used in decision making, research, science, etc... The tools, techniques, and approaches are a trajectory that has build upon what has happened in the past; So what we have today with big data has built upon many technologies and techniques developed over the past 40 or more years.