It is always difficult to detach oneself from the immediate business schedules and requirements of the daily project work, and to take time to write a company blog post. Often a change of scenery helps a lot and as it is now beautiful Springtime in Finland, I considered either taking my laptop and materials to some outdoor café or perhaps to the city library for a really proper environment to inhale and write up this piece about Big Data.
Nevertheless, I ended up reserving a quiet conference room for myself and for my trusty headphones, which is always a good alternative.
It seems there's quite a lot of buzz around Big Data in Finland, and all around you hear how important it is. However, there are precious few concrete examples of what it actually means in practice, and very little has been done to guide the industries and businesses towards this new future.
People give lots of reasons and excuses of why Big Data is not happening. It's either the lack of experts, the lack of a suitable business model or a customer, or it's all someone else's problem. The real reason why it is not happening as much as it should be is simply the lack of vision and initiative. The lack of vision, initiative and higher cognitive functions is often replaced with mechanistic selection of tangentially relevant courses to point the engineers to, as part of competence development. Surely a couple of Hadoop courses can replace a technology strategy.
The industry and the business won't be aligned for the future, if we can't create a clear vision of what Big Data is, and know where one's respective business is located in the vision and what is its role in creating this future.
My aim is to clarify the Big Data concept, the vision and the current state of the affairs in Finland. A blog post can only go so far, though, and anyone who has an interest in a more thorough and context-dependent analysis should contact Cybercom Advisory Team.
Big Data - What is it really?
Big Data concept was originally coined by the media companies, mainly Google and Yahoo, and most of the examples about Big Data are about this business domain. The examples talk about web documents and profiling advertisements, about PageRank and BigTable. Companies who are in different business fields are left with some tools and technologies mainly around NoSQL and clouds, and without a clear vision of how these tools and Big Data are relevant at all in their respective fields.
From bird's eye perspective Big Data means that the exponential growth of data in all domains and fields leads to the situation where new ways to manage the data are needed, and new ways to benefit from the data will emerge.
Big Data has several main segments relevant to most of the Finnish industries and businesses: Fast Data, Large Volumes of Existing Data, NoSQL and Cloud, Open Data, Deep Data and Visualization.
Fast Data refers to data which is often collected by sensors and which has a primary use where the data gets old and useless pretty fast. I will also include logs in this segment, as the primary purpose is considered to be alerting about the logged event as soon as it happens.
This does not mean that the data becomes totally invalid soon after collection; often the historical data has still a lot of value especially for deeper analysis, but the application for the primary purpose of the data needs to get it fast. Good examples are for example industrial process sensors, intelligent traffic GPS sources, and video and sound streams.
In fast data, there are two issues: The data needs to be distributed to all the consumers quickly and efficiently, and the amount of data is cumulatively so large that its storage and use requires special methods. In Finland, fast data is often simply discarded, and the longer term value is lost.
Many logs are discarded, or simply left in the field, even though for example elevator use patterns might give us valuable business insights.
It should be noted that fast data sources are almost always distributed into the world, and the connectivity might not be absolute and perfect. This generates some distinctive challenges of the fast data segment.
There are several technologies that help us with data distribution, depending on the type, volume and the consumer of the data we are distributing, for example: Content Distribution Networks, Rest/WebSocket, and XMPP.
The data storage for the fast data is often row oriented, and we can use many traditional Big Data solutions for it. In addition, there are some special solutions for common use cases, especially related to log data, like Elasticsearch, Logstash and Kibana stack. Cybercom has special competence related to this area.
Large Volumes of Existing Data
Sometimes we actually do have large volumes of data already collected, with potential value in analysis and use if only we would get it into a form that allows analysis. Such data could be for example: Patient records of hospitals, old genealogy documents, and thousands of unsorted and forgotten documents on shared network drives and Sharepoint shares.
This data often comes in a document form, sometimes on paper or old media. However, the data can also be in a video or sound form. Even whole spaces can be considered to be essentially data and digitized as big data, as we see in 3D virtual reconstructions of archeological sites, tourist destinations, museums and such.
Large volumes of existing data has several distinctive characteristics. First, the bulk of the work is digitizing the information, or collecting it into a datastore which enables analysis, summarization and correlation. Second, the data is already old, so the new value we can derive is through deeper analysis of the data.
Digitizing large amounts of existing data has to rely on specialized companies who are good at this specific thing, for example Microtask, who specialize in distributing manual work to numerous independent human contributors. OCR and Natural Language Processing approaches can be used also to create explisit structure to enable further analysis. After the data has been retrieved, the value can be often derived by Open Data or Deep Data approaches.
NoSQL and Cloud
This is possibly the best understood part of the Big Data vision, because it is directly realized in all the software companies, simply through the software engineers' direct competence. It is easy to see how complex data systems need environments to deploy them to, and hard disk, flash drive and memory capacity to store the data to. Often computational capacity in the form of powerful CPU and GPU instances are required as well, especially in relation to Deep Data.
Cybercom Cloud is well suited for Big Data applications, and we have related IaaS and PaaS initiatives.
Most Big Data is out of reach, in closed systems and in forgotten databases. Often the instance responsible of the data never realizes that the data could be and should be opened for common use. However, there is a lot of potential value in this data not accessible to the instance holding the data.
Open Data is often understood to be governmental data, but it also includes a lot of data published by private companies and organizations. This requires understanding from the executives of these organizations of how their market value will increase through sharing the data.
Since Open Data is mostly about interconnecting multitudes of data sources, the core characteristics of this segment are the interfaces and data fusion technologies. Interfaces, and data and query end points need to be able to handle the volume of queries they will receive, and also enable rich combination with other data present in the internet.
Data combination requires natural IDs and proper timestamps (UTC) as preconditions, and data model design is a critical aspect of Open Data. Query interfaces need to enable rich filtering of the results, to enable combination with other data sources without downloading your entire data out for each query. Technologies such as REST and RDF/Jena help in opening the data.
The public APIs need to be documented and publicized also, otherwise they will simply be forgotten and left unused.
Deep Data refers to the value we can now derive from analyzing the data with computationally expensive, data-intensive tools. This value was out of reach before, as we didn't have the required computational capacity to extract it. It is quite analogous to modern mining methods capable of exploiting resources which were previously out of reach.
Analysis is based on data-intensive algorithms, and is often in the form of finding new, previously unknown connections and patterns in the data. Algorithms can be roughly divided into stream-based algorithms, such as map-reduce-combine and graph-based algorithms, such as Markov Random Fields and other probabilistic networks, deep neural networks, and semantic search.
Different distributed numerical methods and tools can be used, such as Pregel, cloud GPU instances and iterative map-reduce.
Typical applications would be diagnostics, sentiment analysis, and business intelligence. It should be noted that sentiment analysis should not be understood in a constrained fashion, as in mining Twitter feeds for positive and negative comments, although it can mean that also. Sentiment analysis also applies for example to harvesting potential riskful events from patient records, and in general in any case where you need to somehow sort documents into positive and negative ones based on the overall flavor of the content. Continuing the same thought, diagnostics similarly doesn't mean simply storing and analyzing logs; it can also mean storing and analyzing microphone audio streams to detect soon-to-be problems in escalators, automatic doors or in engines.
Big Data requires new user interface concepts, components, paradigms and methodologies to present the acquired value in data to the user, otherwise the value is lost. For this purpose, we have rich and dynamic interactive graphing components, both in 2D and 3D, new kinds of 3D user interface tools such as Unity3D, and augmented reality and virtual reality interfaces.
In addition to presenting the data to the user, the user needs to interact with the data and its projections and parameters. The simplest case is where a user selects a time interval in a long time series of data, but still even this case highlights the importance of proper summarization of the whole data to enable the user to select the meaningful data segments. Data visualization is closely related to the data analysis, and the actual analysis and discovering new patterns takes place through different views to the data. In Big Data context, the views are not trivial and static, but highly dynamic and interactive and modifiable foldings, projections, clusterings, summarizations, correlations and transformations of the data.
User interface in Big Data context is highly context sensitive work, and Cybercom Design Studio is well prepared to create a window to your extremely high-dimensional data.
Big Data - Current Situation
Some say that most companies only have "not so big data". This refers to the fact that in studies, only one third of the companies using Big Data technologies have databases over 1 TB in size. Obviously the issue is not that the data does not exist; it's just that it's not stored, for a lack of an immediate requirement. So, this metric actually indicates that the world is not done yet, and these companies probably do not yet collect and store all the data they could benefit from.
We are starting to see big applications being implemented in all the Big Data segments, across multiple industries, in healthcare, in intelligent traffic systems, and in public Open Data.
Big Data - Immediate Future
I'll just say this is about immediate future, to underscore that the Big Data vision in its whole is something that happens right now, and not in five years.
It is estimated that the whole human genome sequencing will cost $100 per person in the year 2015. To say that Finnish medical community in whole is grossly unprepared for this and its consequences is not an overstatement.
We are witnessing a growing tide of opening of public data sources as Open Data, as part of Government 2.0 and similar visions. Interconnecting the data and its value is growing all the time as the data connectivity grows in power of open data amount. This is simply because the number of potential data item pairs we can make statements of is already the second power of the number of data items we have.
In the near future, Open Data will include open personal (big) data as well, distributed to third parties through OAuth-style authorization mechanisms. Personal genomic data will be one central application in this trend, but it includes other things as well, like personal GPS tracks provided securely to third parties to target advertisement and services.
Software industry is in the role of an expert in this future vision, and it's role is to employ evangelists to teach the potential customers and the industries in whole about the potential value Big Data opens up. To give meaningful advice and guidance, the software industry needs to have an intimate dialog with the customer markets and industry segments.
Big Data is not about a bag of tools; it's about a vision we need to realize.