Principal Data Scientist and Big Data and Analytics Lead, Fujitsu Australia Limited


Pramod Singh, Principal Consultant – Big Data at Fujitsu provides an insight into Big Data…

Big Data is one of the leading strategic technology trends in the year 2013. Hence, most of the leading vendors of Information Management are building capabilities around Big Data Some of the leading organizations have started planning to add Big Data to their data warehouse and data integration infrastructure. However many IT leaders and Information Managers still ask the question “what on earth is this”! Is Big Data a completely new concept or an old concept with sugar coating?

What is the Big Data Concept?
Big Data involves data sets that are big enough to obscure their underlying meaning or data sets for which traditional methods of storing, accessing, and analyzing infeasible or are breaking down. There are several definitions of Big Data proposed by various groups, however the following three characteristics are almost universally mentioned in these definitions:
Volume: Refers to the fact that Big Data involves analysing comparatively huge amounts of information, typically starting at tens of terabytes.
Velocity: Reflects the sheer speed at which this data is generated and changes. For example, RFID tags and smart metering are driving an increasing need to deal with torrents of data in near-real time.
Variety: Data today comes in many formats – from traditional databases to hierarchical data stores created by end users and transactional systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions.

Data can be tagged as Big Data if it satisfies any of the two above stated characteristics. Therefore you can verify whether your business data falls into a Big Data category.
What are the previous concepts and technologies?

There were discrete efforts to manage and analyse data of the above mentioned characteristics.

  1. Advanced Data Warehouse Appliance: New data warehouse technology based on Massive Parallel Processing (MPP) architecture, columnar storage and in-memory analytics capability have evolved to provide high speed query performance and platform scalability for the management of high volume structured data. MPP database architectures have a long pedigree. Advances in technology with columnar storage and in-memory processing capabilities have reduced the costs of storage and improved performance of the Data Warehouse (DW) solutions. The most popular advanced DW appliances in the market are Teradata, SAP HANA, EMC Green plum, HP Vertica, Oracle Exadata and IBM Netezza. Open-source Database products in this category, such as Ingres and PostgreSQL, reduce software-license costs and allow DW-appliance vendors to focus on optimization rather than providing basic database functionality. A new approach for performing analysis is also introduced in some DW appliances where analytics algorithms can be run in the database engine itself so that the data movement can be restricted and high performance of analytics can be achieved.
  2. Clickstream and Real-time Data: Clickstream is a record of a user’s activity on the internet. Clickstream data is generated as a result of the succession of mouse clicks each visitor makes on a website. This data can be collected and processed in real time. The process of collecting, analyzing, and reporting such data can be at two levels – traffic analysis and e-commerce analysis. Traffic analysis tracks how many pages were browsed, how long it takes to load and how much data is transmitted. Whilst e-commerce analysis uses clickstream data to determine the effectiveness of the site as a channel-to-market. In order to capture, manage and analyse clickstream data several technologies have been developed. Some of the most popular tools for providing real-time insights from clickstream data are Webtrends, Adobe Omniture, and Google Analytics. Real-time data can also be captured and analysed from the core business application systems when a transaction is processed. There are some technologies developed to capture and integrate transactional data in real-time, for example, Oracle GoldenGate, Informatica PowerExchange and Attunity CDC are the technologies for Real-time data replication. Popular appliances to process and analyse real-time data are Oracle RTD, SAP HANA and Splunk.
  3. Unstructured Text Data: Unstructured data does not have a pre-defined data model. They are typically text-heavy, but may contain dates, numbers, and facts. Unstructured Information Management Architecture (UIMA) provides a common framework for processing unstructured text to extract meaning and create structured data about the information. Some popular Text mining and Text analytics tools are IBM SPSS, AeroText, Omniviz and SAS. These tools are able to parse and extract relevant information from the text files and analyse text data along with associated structured data. KNIME and RapidMiner are some open source tools for Text analytics.

There are many other technologies developed to manage and analyse unstructured data from many sources such as of satellite images, radar data, audio and video signals.

These concepts and technologies have existed for the past few years and undertake their defined work within a limited scope very well, however they are distinct and their integration to get a holistic view of business insights by utilizing the complete set of available data of an organisation is nearly impossible.

What is New in Big Data Approach?
The main objective of the development of Big Data technology has been to incorporate all the above mentioned data management and analytics concepts and provide a cheap option for Total Cost of Ownership (TCO). With the advent of open source Big Data technologies – Hadoop and Cassandra’s NOSQL a new wave of data management has emerged. Big Data technologies provide a concept of utilizing all available data through an integrated system. Big Data can add structured, unstructured and/or semi-structured data captured from transactions, interactions and observations systems to the mix:

  1. Transactions Data: This can be a historical transactions data from core business application systems; structured data at rest and data which is captured but unused
  2. Interactions Data: Social media data – data from LinkedIn, Twitter, Facebook, etc., and Web content – clickstream, web logs, video, images, etc.
  3. Observations Data: “Internet of Things” – streaming data, and machine generated data – data generated from sensors, GPS, RFID, Mobile, etc.

A high level typical Big Data solution architecture is below:BigData-Architecture

Now a new question arises, is Big Data technology a replacement to a traditional data warehouse system? In order to answer this question, we need to understand the core concept of Hadoop technology. Hadoop is a distributed file management and analysis approach where data is split into equal sized flat file and stored in distributed platform and processed. It does not support updates or changes into the existing record which is an important process in a data warehouse system. Hence, in other words in current stage of the technology Hadoop cannot be a replacement to a traditional data warehouse.

Therefore, Big Data can be implemented as an extension of a traditional Data Warehouse to support real time and unstructured data management and analysis. Several Business Intelligence (BI) and DW technology vendors have tried to integrate Hadoop technology with their DW technology to make a seamless analysis of structured and unstructured data stored on two different environments from single user interface and produce integrated business analytics results.