The idea of building a data lab is emerging from my ocean data conversations and how to best utilize my knowledge and skillset within this opportunity. In my mind, the service offering would be twofold;
- Data Engineering / Software Development consulting and services with focus on ocean data. We will do the heavy lifting of extracting, cleansing, transforming, and loading your data. And then we will help with analysis and visualizing the data. We are comfortable working in both the open source and Microsoft technology stacks.
- Standing up (and data loading) the technology stack for the data lab. You are going to need to host all this compute power and storage somewhere. It could be on-premise. Most likely, it will be in the cloud. We can help with this also. We could build it in Azure, using the Microsoft technology stack. Or we could build it using an Open Source stack on top of Linux in any of the hosted environments of Azure, AWS, or Rackspace.
- Use open source where you can. Fortunately, many of the infrastructures, tools, frameworks, and programming languages for the data lab are open source.
- Automate the build so it can be built and torn down with ease. This would eliminate the need for the stack to always be running.
- Store the data at it's source, if possible. Fetch, and load, the data when you automatically rebuild the stack. Keep in mind this limits the amount of big data you can store locally, and loading large amounts of data can cumulatively take days. Be mindful of this.
Note: this stack is to showcase the services capabilities. A full data lab would also need the ability to both persist and fetch data. It's going to take some time to build the data lab!
The Data Lab Technology Stack
The deployment of this technology stack will use open source wherever possible running on a Linux (Ubuntu) Server hosted at Rackspace. The rational for these decisions are;
- Little to No licensing costs
- Strong familiarity with Rackspace as hosting company
- Existing domain name (endeavours.com) hosted with Rackspace
- Extensive experience with Ubuntu Linux in a hosted environment
- Familiarity with deploying data intensive solutions using the ELK stack
- Experience programming in Python
Note: The deployment of this technology stack will happen in phases, where each phase will complete with some basic tests to ensure the stack behaves as desired.
Phase 0 will be a basic ELK stack running on an Ubuntu Linux server hosted at Rackspace with access via the endeavours.com web domain. The use case for where the data comes from, how we transform it, the analysis, and visualization is still to be determined. This use case will be used for testing this first iteration of the newly stood up data lab. Exciting times!
During phase 1 we will add the Python programming language to the technology stack and use it for two purposes;
- Apply a model to the data using Python.
- Present the processed data to a web page for display.
During phase 2 we will add Kafka as an infrastructure resource, identify some additional data sources, and pre-process the data before it gets loaded into ElasticSearch.
Phase 3 and beyond
Investigate the Apache Data lab stack, add Spark to our lab, add a data workbench...