Compare and contrast; VLDB and Big Data. |
I continue to be entrenched in data related projects. My current project is focused upon opening up Machine-to-Machine (M2M) data exchange using satellite networks and RESTful API's. Very cool and very relevant to open data / big data. I continue to work with and read about data (big and small)... but, I don't see that much has changed from the Very Large Database (VLDB) discussions of the past 40 years. Don't get me wrong; the amount of data has never been so big, the ability to process data has never been greater, the algorithms and models have matured, and the technologies to support big data have never been better. I see more similarities than differences when talking about big data when including references to VLDB and past data analytics.
This blog post sets out to describe the technologies and processes that support big data and how these are more similar than different to big data processing of the past and present. This post describes the accompanying image from its top to bottom; with descriptions provided for how I see the real world as related to data and the purpose of each step (or grey box) in the big data realm. I do believe that many of these boxes (or process) have remained the same in relation to the processing of data (big and large).
Real world
Sources of data |
The point I am wanting to make is that in the real world there is a lot of data being created all the time, and only a small amount of it is actually being collected. What is being collected is already considerable and comes from a plethora of sources (and to consider this is only the beginning of data collection in an internet of everything world). This is a high level list in what I see as the current set of data creators / collectors;
- log and event data - server logs, click-through events, page views, API's called, etc...
- transactional data - traditional data processing systems across all industries / organizations / institutions.
- multimedia data - movies, images, photographs, music, etc.
- geolocation - latitude, longitude and other relevant location / movement data
- unstructured data - unorganized or having no data model or pre-defined structure.
- device data - data made available through small or handheld devices
- sensor data - data coming from sensors attached to objects (remote or otherwise) - in time, this is where the greatest amount of data will originate.
- streamed data - audio, video, astronomical, etc.
- human data - data about people, in its broadest sense.
The methods of data collection have remained similar over the last 40 years (well, much longer, but...). I see data collection as capturing the details of a real world event and making it digital by recording the event using an electronic device. This capture occurs in many ways as described in the previous list of creators / collectors. The important part of collection is in consolidation, where the relevant data sources and attributes are identified and brought together (either physically or virtually) for processing. I do agree the greatest change for the current big data is the three V's of big data; volume, variety, and velocity. I do believe the collection methods are more similar than different over the past 40 years, they are just happening at greater volume, with more variety of sources, and coming into systems with a greater velocity.
Processing (ECTL)
Preparing the data |
Storage
Storage is where the data resides after it has been captured and processed. This may occur in real-time and reside in an in-memory storage rather than physically stored on a disk or other solid-state device. The data may also reside in an Operational Data Store (ODS), in a Data Warehouse (DWHS), in files, or as raw data. The storage may be long-term or short-term, there does need to be a place for the data to be stored so it can be analysed and used for calculating results or developing insights. I do see storage as being one of the areas creating differences in the way big or very large data is stored, processed, and analysed. In the past, data needed to be stored to a physical device, like disk. But now virtual memory is large enough, at a reasonable cost, where storage approaches are allowing the database (storage) to be entirely in-memory. This can fundamentally change how large volumes of data can be processed, and how database technology is implemented. The traditional row-locking database is no longer required as the latency created by disk input / output (IO) no longer exists when the whole data store can be in-memory. The approach to database technology design can fundamentally change without needing to manage the issues of on-disk storage as a part of the traditional database.
Exploratory Data Analysis
Once you have your data all in one place (or as you are bringing the data together in one place, Storage) you can start with Exploratory Data Analysis (EDA). I think of this as play; laying all the data out on the table and looking at it from different perspectives, flipping it around, stacking it in different ways, molding it into the shapes that it can stay in. Using tools and techniques designed and developed to begin sense-making with the data collected. Remember, its exploratory. And other than the wonderful new technologies (or tools) that have emerged recently, the approaches to EDA haven't changed that much over the last 40 years.
Data Cubes / Multi-dimensional Cubes
I like cubes, particularly multi-dimensional cubes. Always have, even as merged data sets. There is something that makes sense to me in regards to loading related data sets into a technology that builds insight into the data relationships. Online Analytical Processing (OLAP) is a multidimensional analysis approach that has been around also for 40 years.
I do see EDA and OLAP as similar, but I believe the tools and techniques available to (or developed for) EDA as broader and deeper than the tools and techniques associated with OLAP.
Bringing intelligence to the data |
The term business intelligence always fascinates me for historical reasons; it is well described how the term was first articulated over 150 years ago. The idea of bringing together disparate sources of "large" data for competitive business advantage isn't new. What is relatively new (last 60 years or so) is the use of computers and digital data storage for the processing and analysis. I do consider business intelligence and data analysis was born out of the data warehousing stream of big data and a lot of the algorithms and statistical models will find there data processing roots in traditional large data initiatives. I consider machine learning the new comer in the big data realm, for it is only until recently that the volume of data and the commodity priced hardware, software capacity, etc. has created the thinking / need behind machine learning.
Machine Learning, Algorithms, Statistical Models
I see the troika of machine learning, algorithms, statistical model as the intelligence side of "big data". Collectively the three hyper-links in the previous sentence give great descriptions of these three parts of deriving intelligence and knowledge from data. The big part of these three is that they automate the creation of the "intelligence", they allow data to be consumed and then "knowledge" created so decisions can be made in real-time (by the computer, or network; depending how you look at it) to impact the way further information is presented.
It is important to note that machine learning, algorithms, statistical models (as indicated by the data flow arrows) often get data directly from storage (real-time or otherwise) and don't include the EDA step. This is often due to once data and its sources are understood no more exploration is required and the data can be accessed directly for intelligent processing.
Data Product
A data product is information (precise or otherwise) that has been derived from Business Intelligence, Data Analysis, Machine Learning, Algorithms, or Statistical Models. These products can be added to the attributes of another product or be used to focus a decision or alter a user experience to better suit the specific viewer. In its simplest terms; a data product is what is harvested from your Google search terms to display focused and specific adds on your views of subsequent (and seemingly unrelated) web pages.
http://selection.datavisualization.ch/ |
One of the areas where I continue to be amazed is the growth and innovation within the graphical representation approaches to displaying data. A look into some of the open source frameworks will amaze - a look at the D3.js library is an excellent example. I believe their are four main aspects to rendering data, these are;
- Reporting - I consider reporting to be text or graphical documents (printed or otherwise), spreadsheets, emails, etc. that report on information or knowledge derived from the processing of data.
- Dashboards - I consider dashboards as the digital single page rendering of current an important decisions for the day-to-day activities of a business, enterprise, organization, etc...
- Visualizations - the graphical visualizations of processed data
- Communications - outbound communications derived from data processing can be in many different forms. Using rich media and emerging technologies to increase variety, frequency and management of communication channels can assist with big data efforts.
Decision differ from data products as they are cognitive / intelligent activities done by people. They use the reports, dashboards, visualizations and communications with provide the knowledge to make informed decisions. All of the steps in gathering, processing, and analyzing the data down the right side of the image leads to supporting the human activity of decision making. Even though the process are similar within big data and traditional large data the end product in the traditional is most often as input to people, where the big data side creates data products used by machines / computers.
Similarities and Differences
I dislike using percentages to indicate differences, but; I would say that the tools, techniques, and approaches to big data compared to traditional large data is > 80%, where there is an 80% similarity between the two. Over the last 40 years big data has been present (under different names) and used in decision making, research, science, etc... The tools, techniques, and approaches are a trajectory that has build upon what has happened in the past; So what we have today with big data has built upon many technologies and techniques developed over the past 40 or more years.