Monday, March 08, 2021

Building the Data Lab Technology Stack

The idea of building a data lab is emerging from my ocean data conversations and how to best utilize my knowledge and skillset within this opportunity. In my mind, the service offering would be twofold;

  1.  Data Engineering / Software Development consulting and services with focus on ocean data. We will do the heavy lifting of extracting, cleansing, transforming, and loading your data. And then we will help with analysis and visualizing the data. We are comfortable working in both the open source and Microsoft technology stacks.
  2. Standing up (and data loading) the technology stack for the data lab. You are going to need to host all this compute power and storage somewhere. It could be on-premise. Most likely, it will be in the cloud. We can help with this also. We could build it in Azure, using the Microsoft technology stack. Or we could build it using an Open Source stack on top of Linux in any of the hosted environments of Azure, AWS, or Rackspace.
How do you build a low priced, large compute, technology stack to support data engineering efforts, implement a data lab, and showcase these new services capabilities. The low price is the key factor given the current startup state of this ocean data endeavour. Particularly, when you think of the cost of compute for processing and storing large amounts of data. I believe the the best way forward is as follows;

  1. Use open source where you can. Fortunately, many of the infrastructures, tools, frameworks, and programming languages for the data lab are open source.
  2. Automate the build so it can be built and torn down with ease. This would eliminate the need for the stack to always be running.
  3. Store the data at it's source, if possible. Fetch, and load, the data when you automatically rebuild the stack. Keep in mind this limits the amount of big data you can store locally, and loading large amounts of data can cumulatively take days. Be mindful of this. 

Note: this stack is to showcase the services capabilities. A full data lab would also need the ability to both persist and fetch data. It's going to take some time to build the data lab!

The Data Lab Technology Stack

The deployment of this technology stack will use open source wherever possible running on a Linux (Ubuntu) Server hosted at Rackspace. The rational for these decisions are;

  • Little to No licensing costs 
  • Strong familiarity with Rackspace as hosting company
  • Existing domain name ( hosted with Rackspace
  • Extensive experience with Ubuntu Linux in a hosted environment
  • Familiarity with deploying data intensive solutions using the ELK stack
  • Experience programming in Python

Note: The deployment of this technology stack will happen in phases, where each phase will complete with some basic tests to ensure the stack behaves as desired.

Phase 0

Phase 0 will be a basic ELK stack running on an Ubuntu Linux server hosted at Rackspace with access via the web domain. The use case for where the data comes from, how we transform it, the analysis, and visualization is still to be determined. This use case will be used for testing this first iteration of the newly stood up data lab. Exciting times!

Phase 1

During phase 1 we will add the Python programming language to the technology stack and use it for two purposes;

  • Apply a model to the data using Python.
  • Present the processed data to a web page for display.

Phase 2

During phase 2 we will add Kafka as an infrastructure resource, identify some additional data sources, and pre-process the data before it gets loaded into ElasticSearch.

Phase 3 and beyond

Investigate the Apache Data lab stack, add Spark to our lab, add a data workbench...