Tuesday, May 06, 2014

Big Data; Similarities and Differences

Compare and contrast; VLDB and Big Data.
I've been a data guy for over 25 years. My undergrad degree is in Technology with specialty in Database Management Systems (DBMS). The focus of my whole career has been on the data... I believe if the data is wrong (even just a small amount of it), most of the other related IT is pretty much useless and any reporting or analytics should be taken with more than a little skepticism. In my opinion, its all about the data, and it's correctness or accuracy. This has been a cornerstone to my career, all I do has elements of advocating for data quality.

I continue to be entrenched in data related projects. My current project is focused upon opening up Machine-to-Machine (M2M) data exchange using satellite networks and RESTful API's. Very cool and very relevant to open data / big data. I continue to work with and read about data (big and small)... but, I don't see that much has changed from the Very Large Database (VLDB) discussions of the past 40 years. Don't get me wrong; the amount of data has never been so big, the ability to process data has never been greater, the algorithms and models have matured, and the technologies to support big data have never been better. I see more similarities than differences when talking about big data when including references to VLDB and past data analytics.

This blog post sets out to describe the technologies and processes that support big data and how these are more similar than different to big data processing of the past and present. This post describes the accompanying image from its top to bottom; with descriptions provided for how I see the real world as related to data and the purpose of each step (or grey box) in the big data realm. I do believe that many of these boxes (or process) have remained the same in relation to the processing of data (big and large).

Real world
Sources of data
Data comes from many sources! It is good to keep in mind that every small amount of data can be collected when considering the entirety of data creation vs. data collection. As an example; data is created in massive quantities as every person moves through their day. Heart rate, body temperature, calories burned, eye movements, blood sugar levels, foods eaten, decisions made, walking pace, etc, etc... And all these data attributes change throughout each persons day. All of this and an massive number of other data attributes is what make up data creation in the real world. When you consider all this data is created by every person, every second of every day it becomes a massive amount of data creation. This example only includes people as the sample, data creation is even greater when every object on the planet could be considered as a data creator.
The point I am wanting to make is that in the real world there is a lot of data being created all the time, and only a small amount of it is actually being collected. What is being collected is already considerable and comes from a plethora of sources (and to consider this is only the beginning of data collection in an internet of everything world). This is a high level list in what I see as the current set of data creators / collectors;
  • log and event data - server logs, click-through events, page views, API's called, etc...
  • transactional data - traditional data processing systems across all industries / organizations / institutions.
  • multimedia data - movies, images, photographs, music, etc.
  • geolocation - latitude, longitude and other relevant location / movement data
  • unstructured data - unorganized or having no data model or pre-defined structure.
  • device data - data made available through small or handheld devices
  • sensor data - data coming from sensors attached to objects (remote or otherwise) - in time, this is where the greatest amount of data will originate.
  • streamed data - audio, video, astronomical, etc.
  • human data - data about people, in its broadest sense.
The methods of data collection have remained similar over the last 40 years (well, much longer, but...). I see data collection as capturing the details of a real world event and making it digital by recording the event using an electronic device. This capture occurs in many ways as described in the previous list of creators / collectors. The important part of collection is in consolidation, where the relevant data sources and attributes are identified and brought together (either physically or virtually) for processing. I do agree the greatest change for the current big data is the three V's of big data; volume, variety, and velocity. I do believe the collection methods are more similar than different over the past 40 years, they are just happening at greater volume, with more variety of sources, and coming into systems with a greater velocity.

Processing (ECTL)
Preparing the data
Processing is about getting the data from many different sources into a state and place where it can be analysed. I cut my data processing teeth with Extract, Transform and Load (ETL) work. I consider ETL as the harvester of data. I see processing as the consolidation of many a data sources by Extracting the data from the data collection systems, Transforming the data from these different sources so it will fit together, and then be Loaded it into a system (usually a data cube type technology) for analysis and reporting. As time progressed I began to include a data Cleanse into the traditional ETL process. Cleansing is about preparing the data for a greater transformation rate. Not all data can be transformed into a common data store, cleansing increases the success of the transformation. And once the processing is complete the raw (or originating) data may be discarded for results have been calculated or the meta-data determined. This discarding does not occur in all processing situations, but keep in mind the end result of processing may be cleansed and transformed raw data or some amount of calculated or meta-data. Again, I don't see this has changed much over the last 40 years, except for the increase in volume, variety, and velocity. The methods and approaches remain the same...

Storage is where the data resides after it has been captured and processed. This may occur in real-time and reside in an in-memory storage rather than physically stored on a disk or other solid-state device. The data may also reside in an Operational Data Store (ODS), in a Data Warehouse (DWHS), in files, or as raw data. The storage may be long-term or short-term, there does need to be a place for the data to be stored so it can be analysed and used for calculating results or developing insights. I do see storage as being one of the areas creating differences in the way big or very large data is stored, processed, and analysed. In the past, data needed to be stored to a physical device, like disk. But now virtual memory is large enough, at a reasonable cost, where storage approaches are allowing the database (storage) to be entirely in-memory. This can fundamentally change how large volumes of data can be processed, and how database technology is implemented. The traditional row-locking database is no longer required as the latency created by disk input / output (IO) no longer exists when the whole data store can be in-memory. The approach to database technology design can fundamentally change without needing to manage the issues of on-disk storage as a part of the traditional database.

Exploratory Data Analysis
Once you have your data all in one place (or as you are bringing the data together in one place, Storage) you can start with Exploratory Data Analysis (EDA). I think of this as play; laying all the data out on the table and looking at it from different perspectives, flipping it around, stacking it in different ways, molding it into the shapes that it can stay in. Using tools and techniques designed and developed to begin sense-making with the data collected. Remember, its exploratory. And other than the wonderful new technologies (or tools) that have emerged recently, the approaches to EDA haven't changed that much over the last 40 years.

Data Cubes / Multi-dimensional Cubes
I like cubes, particularly multi-dimensional cubes. Always have, even as merged data sets. There is something that makes sense to me in regards to loading related data sets into a technology that builds insight into the data relationships. Online Analytical Processing (OLAP) is a multidimensional analysis approach that has been around also for 40 years.

I do see EDA and OLAP as similar, but I believe the tools and techniques available to (or developed for) EDA as broader and deeper than the tools and techniques associated with OLAP.

Bringing intelligence to the data

Business Intelligence / Data Analysis
The term business intelligence always fascinates me for historical reasons; it is well described how the term was first articulated over 150 years ago. The idea of bringing together disparate sources of "large" data for competitive business advantage isn't new. What is relatively new (last 60 years or so) is the use of computers and digital data storage for the processing and analysis. I do consider business intelligence and data analysis was born out of the data warehousing stream of big data and a lot of the algorithms and statistical models will find there data processing roots in traditional large data initiatives. I consider machine learning the new comer in the big data realm, for it is only until recently that the volume of data and the commodity priced hardware, software capacity, etc. has created the thinking / need behind machine learning.

Machine Learning, Algorithms, Statistical Models
I see the troika of machine learning, algorithms, statistical model as the intelligence side of "big data". Collectively the three hyper-links in the previous sentence give great descriptions of these three parts of deriving intelligence and knowledge from data. The big part of these three is that they automate the creation of the "intelligence", they allow data to be consumed and then "knowledge" created so decisions can be made in real-time (by the computer, or network; depending how you look at it) to impact the way further information is presented.

It is important to note that machine learning, algorithms, statistical models (as indicated by the data flow arrows) often get data directly from storage (real-time or otherwise) and don't include the EDA step. This is often due to once data and its sources are understood no more exploration is required and the data can be accessed directly for intelligent processing.

Data Product
A data product is information (precise or otherwise) that has been derived from Business Intelligence, Data Analysis, Machine Learning, Algorithms, or Statistical Models. These products can be added to the attributes of another product or be used to focus a decision or alter a user experience to better suit the specific viewer. In its simplest terms; a data product is what is harvested from your Google search terms to display focused and specific adds on your views of subsequent (and seemingly unrelated) web pages.

Reporting, Dashboards, Visualizations, and Communications
One of the areas where I continue to be amazed is the growth and innovation within the graphical representation approaches to displaying data. A look into some of the open source frameworks will amaze - a look at the D3.js library is an excellent example. I believe their are four main aspects to rendering data, these are;
  • Reporting - I consider reporting to be text or graphical documents (printed or otherwise), spreadsheets, emails, etc. that report on information or knowledge derived from the processing of data. 
  • Dashboards - I consider dashboards as the digital single page rendering of current an important decisions for the day-to-day activities of a business, enterprise, organization, etc...
  • Visualizations - the graphical visualizations of processed data 
  • Communications - outbound communications derived from data processing can be in many different forms. Using rich media and emerging technologies to increase variety, frequency and management of communication channels can assist with big data efforts.
Decision differ from data products as they are cognitive / intelligent activities done by people. They use the reports, dashboards, visualizations and communications with provide the knowledge to make informed decisions. All of the steps in gathering, processing, and analyzing the data down the right side of the image leads to supporting the human activity of decision making. Even though the process are similar within big data and traditional large data the end product in the traditional is most often as input to people, where the big data side creates data products used by machines / computers.

Similarities and Differences
I dislike using percentages to indicate differences, but; I would say that the tools, techniques, and approaches to big data compared to traditional large data is > 80%, where there is an 80% similarity between the two. Over the last 40 years big data has been present (under different names) and used in decision making, research, science, etc... The tools, techniques, and approaches are a trajectory that has build upon what has happened in the past; So what we have today with big data has built upon many technologies and techniques developed over the past 40 or more years.

Saturday, May 03, 2014

The St. John's NL 40 million population

I was inspired by a conversation I had earlier this week. Actually I was inspired by many conversations this week. A really great week all around. One of the great conversations was about the size of the St. John's NL business marketplace as it is associated with what is within reach by a single hop flight. Well... I consider the St. John's population to be over 40 million and includes two of the largest business cities on the planet (London & New York).

So when looking at the direct flights available from St. John's and consider the cumulative population of these cities [ London (12.6 million), New York (19.1 million), Toronto (6.4 million), Montreal (3.8 million) ] and their collective global financial influence, the market for St. John's is massive and with solid financial footings.

If you are growing a business or thinking about starting a business in St. John's (or any city within single hop flight of your own city) the market is a lot bigger than you think. So maybe shift how you perceive your market, reach out across your cities direct flights, consider what is at the other end, and book some flights. Use the global communications network to your advantage, visit each of these cities on a regular basis, budget for it in your business planning. Given the St. John's mid-Atlantic location, the future is indeed bright!

Managed Endorsements

I'm approaching my 10th anniversary on LinkedIn and I have found it a magnificent record of my professional life. The fact that it is published to the web is a positive side benefit. Surprisingly (or maybe not), I use it as my system of record for my professional life. When I enter into situation that may require a resume, the first thing I do is I make sure my LinkedIn profile is up to date. And I make updates to my resume reconciled against my LinkedIn profile. It is the easiest and best organized place to keep my professional profile information.

So when a past associate from Mozilla pointed to the year old blog post about "empty endorsements" I started reflecting about how I disagree with this. Don't get me wrong, I have the utmost respect of the work done by Erin and Alex. And the world is a whole lot better place because they are in it. As we move further into our digital, connected, and social media lives... the idea of online or digital endorsement becomes increasingly important. And staying connected with people is our connected knowledge (*we store our knowledge in our friends*); and really over a life well lived we don't know when things will come full circle. So staying connected to people in multiple ways, and acknowledging (or endorsing) a persons skills or knowledge you are familiar is the right thing to do. I do know I recently endorsed Erin for her leadership skills. I didn't do this lightly, I was mindful when I did it. I spoke with Erin a number of times during my time with Mozilla and I observed how she led a group, she is a good leader. So when I was prompted by LinkedIn (an option she has chosen to use) on a skill she has included with her profile, I thought about it and made the endorsement. From my experience, I will always consider Erin a good leader. If Erin (or anyone) truly believes LinkedIn endorsements are empty, I will politely suggest they turn off the ability to be endorsed.

From my perspective my LinkedIn endorsements are not empty, either given or received. They could be if I wasn't mindful when I gave an endorsement, or didn't consider which endorsements I displayed. (I regularly prune / update my online profiles). I believe in social media and paying it forward. I believe our personal reputations are an accumulation of all our contributions, recommendations, endorsements, badges, interactions, etc... across all the locations we participate and contribute online (and more importantly, offline).

Tuesday, April 15, 2014

Virtual Community of Practice User Stories

I continue to explore the conundrum of "How do you build a Community of Practice in a closed environment where you can't reach out due to client confidentiality?" The background to this can be found in my previous post titled "Virtual Community of Practice Conundrum". In this post I list what I consider to be the user stories for this cross-boundary community of practice. The purpose of these stories is so we can design the technical infrastructure to facilitate such a community. But first we need to identify the community needs using non-technical terms.
User Roles
I see three primary user roles in which to base the user stories, these are;
  1. Steward - this role provides stewardship (and administration, when necessary) of the community. They are usually community members who mostly have an eye to keeping to community healthy and active. Sometimes they take on an administrative role when technical issues arise.
  2. Member - someone who participates by contributing and engaging with the community. This participation can come in many forms; leading discussions, adding rich media content, organizing companion face-to-face activities, speaking up and adding to discussion, linking to relevant and related materials, using the community hashtag, and consuming content from multiple devices.
  3. Lurker - someone who consumes the community content from many of their devices, yet never participates by contributing content. Don't underestimate the value of lurkers to your community!
User Stories
This is the set of user stories I have identified for the community of practice which crosses organizational boundaries while also honoring client confidentiality. Please feel free to add others as comments to this blog post...
  • As a member I want to participate in community discussion.
  • As a member I want to learn new things regarding the communities subject domain.
  • As a member I want a way to pull notifications.
  • As a member I want to be able to block notifications.
  • As a member I want to add content (text, images, video, presentations, etc) to the community.
  • As a member I want to link to external resources.
  • As a member I want to share openly without violating client confidentiality.
  • As a member I want a profile page or ability to link to a profile page so people can get to know me.
  • As a member I want to invite friends and peers to the community.
  • As a member I want a way to reach out to other members.
  • As a member I'd like multiple ways to participate (even face-to-face...)
  • As a lurker I want to view community content across all my devices.
  • As a lurker I want my read-only participation to remain anonymous.
  • As a lurker I want to have the ability to become a participating member.
  • As a lurker I'd like multiple ways to participate (even only a spectator)
  • As a steward I want a way to push notifications to community members.
  • As a steward I want to prevent confidential information entering the community.
  • As a steward I want to remove content and block members who are adding inappropriate content (ie. spam, adult content, sales information, self promotion).
  • As a steward I want to reduce internet trolls.
  • As a steward I want a common hashtag(s) for the community.

Monday, April 14, 2014

Virtual Community of Practice Conundrum

-- T L D R ----------------

What do you need to consider when building a Community of Practice (CoP) that spans organizational boundaries where client confidentiality needs to be honored. There are a plethora of things to be considered when building an online (virtual) community of practice, these include; the team and the contexts' relationship with openness, the memberships ability to be self-determined, how online communication will be broadened and followed, and how the internet is the platform.


How does a Community of Practice (CoP) steward itself across organizational boundaries? What are the requirements and restraints to successfully build a CoP when openness and confidentiality are in conflict with one another? What technology platforms support a closed virtual learning community but can integrate with a busy and confidential work-schedule? Can you leverage all the innovative social media technologies, which can be very well applied to learning (CoP), from a closed virtual community? This series of blog posts sets out to answer many of the questions... but first,

Some background
I've been building Communities of Practice (CoP) for over 10 years. It started during my M.Ed where I was bringing together 15 years of professional technology experience with 10 years of college level teaching and online learning. I now see my skills and knowledge being well applied with building large complex information technology systems and being a seasoned educational technologist. I also believe it is important to provide some of my personal background, beliefs, and experience on the subject of CoP. What I believe is particularly relevant to this post is my time spent with both open projects and corporate enterprise projects. These projects include; Mozilla, Mediawiki, WikiEducator, Wikiversity, P2Pu, Bowen eGovernment, UNESCO, Open Data, CLEBC, ICBC, Commonwealth of Learning, and other smaller projects. This experience has exposed me to open democracies, open spaces, open communities, open boards of directors, open, open, open. Where the sharing of information is crazy transparent and all meetings are open to everyone. This experience has also exposed me to large enterprises; closed, locked-down, proprietary, and obfuscated information exchanges. Having other people filter information is the standard practice in many of these other closed type organizations. My work has occurred at the extremes of both of these types of organizations, but also, much of my work has happened in the middle ground. From all this it is my belief that open communication is better than the alternative; particularly, when wanting to encourage learning or when building a community of practice. It is best to let employees be their own filters of information, and exchanging information helps everyone learn. The current and emerging pedagogical approaches also supports openness vs. closed. I believe it is important for learners working within a CoP to be able to reach out to the larger community and draw in resources from these other external communities.

Some history (or key technologies)
What do I consider the key information technologies and attributes of learning communities that have emerged over the last 20 years and have the biggest impact on virtual and online learning communities of practice?
  • Communities of Practice - obviously the work of √Čtienne Wenger is big here. The idea that "most learning does not take place with the master, it takes place among the apprentices" IMO is important to building a virtual learning community.
  • Open Approaches - Open Space, Open Leadership, Open Data, Open Etc... In my opinion, and experience, openness is very important to learning and in successfully building a community of practice. New people, new opinions, ongoing mentorship and peer learning needs to be refreshed. Without openness it becomes closed and stagnant. I have yet to experience a CoP that remains active beyond a year without having new people involved and ideas coming in from outside the group. Openness is key.
  • Autodidactism / Heutogogy - Participants in CoP need to be self-determined learners. This means you can just lurk in a community, at some point you need to participate. It is the internal commitment to learning in the subject domain that must come first. Then participation in the community becomes part of the self-determined learning. In my opinion this is one of the keys to a valuable CoP. The commitment level of the members.
  • Visual Communication - People need to share through a variety of mediums, this should not be restrained within the CoP. Often the results of a visual meeting (or otherwise) can be shared for record keeping, review and to prompt further discussion, etc...
  • Social Media - social media is not social learning, but it is important in building community and allowing people to participate online where they want and when they want.
  • Tagging - social tagging can be an excellent way to draw a community together by allowing members to share there learning and related reference materials across different social media platforms.
  • Platforms - having an online place to host the community is essential. But given a solid tagging approach this doesn't have to be a traditional platform, it could be the internet as a whole. What is important is that it is accessible by everyone - from everywhere, on any device. In the end, you need to consider the whole internet as the platform.
Some assumptions
Most learning occurs outside of traditional approaches, it occurs 24 hours a day, and is a continuous activity that includes (and should not be limited too) the use of open social media tools. A community of practice is social learning and is further enhanced with access to online and virtual communities. Blended learning is important, as face-to-face time (when available) should be encouraged. Even if the face time is among a sub-set of the community members.
Being a self-determined learner is important as it provides the intrinsic motivation to deepen the engagement within the CoP. It is a valid amount of participation to only lurk within the CoP, but there does come a point where members need to dive in and participate. Learning will be deeper and broader, but the motivation needs to be there. And most often for long term commitment to participate comes from an intrinsic motivation.

The conundrum
How do you build a Community of Practice in a closed environment where you can't reach out due to client confidentiality?

In the next post on this theme we will discuss the requirements of a community of practice where client confidentiality is key. The thinking being if we can correctly identify the requirements, we will then be able to identify a platform best suited to cross organizational boundaries.

Thursday, January 02, 2014

Proposed ACAITA Event Schedule

Thursday January 16th Lunchtime
Location: 1:00 - 2:30 pm Erin's Pub
- Roles of the IT Architect
- Approaches to building the ACAITA

Thursday January 30th Online 
Location: Google Hangout
- Professional Development of the IT Architect
- Approaches to building the ACAITA

Thursday February 13th Lunchtime
Location: TBD
- Certification paths available to the IT Architect
- Managing complexity within IT Architecture

Thursday February 27th Online 
Location: Google Hangout
- Certification paths available to the IT Architect
- Teamwork among the Solution and Enterprise Architects
- in other words: Roles and Responsibilities of the Solution and Enterprise Architects

Basic Vertically Aligned Synergistically
Partitioned (VASP) Architecture
Thursday March 13th Evening
Location: Erin's Pub
- Impact of Open Source on IT Architecture
- Professional Development of the IT Architect
- Approaches to building the ACAITA

Thursday March 27th Online
Location: Google Hangout
- Discussion / Review of the Snowman Architecture
Approaches to building the ACAITA