Monday, 30 June 2014

Analyzing Big Data

Let us take a quick look at some of the ways in which big data is processed, maintained and analyzed to provide valuable insights.

Quite a good volume of data had been available since a decade, but the extent to which it was used was less than today.  We have started exploring data - that was already available, and new data that is coming in every other milli second (streaming data) - to make valuable decisions.

Types of Data
Data can be categorized as structured, unstructured and semi structured.
  1. Structured data is data that is in a pre-defined format.  Data stored in databases, and spread sheets are examples of structured data.  Structured data can be analyzed easily.  
  2. Unstructured data refers to data that does not have a pre-defined format. Sentences, texts, stories, pictures are all examples of unstructured data.  Text mining tools have to be used to uncover data in unstructured format.
  3. Data in xml files (and other markup languages) are semi-structured - some parts of the data is structured.
Data Cleansing
Analysis performed from inaccurate, erroneous, or duplicate data will reduce the value.  Hence it is required that the data available is cleansed before analysis.

For example, a person in India gives a 5 digit pin code in his address.  The system has to immediately highlight that the data is inaccurate.

Maintaining a Golden Record for every entity
For every entity 'a single version of true data' has to be stored.  Details of operations that are performed on data has to be stored and a copy of the data before modification also needs to be maintained.

For example, a customer of a Bank has a Savings account and a Fixed Deposit.  It is good to maintain one record of the customer, containing all his details (like name, date of birth, gender, address), rather than having two copies of this data.

Analyzing Streaming Data
Data that is transferred at a high speed rate is known as streaming data. An example that we might have noticed is heart beat monitors attached to patients.  Other examples include network signals and transactions over the internet.  In some cases, monitoring streaming data becomes very important.  Product that can easily ingest and analyze data can help when critical decisions have to be made using streaming data.

Data Integration and Governance
A software that can integrated data from multiple systems and provide a complete 360 degree view of each entity involved is a required.  Also at each stage of the processing, managing data quality is important.

For example, let us again consider a bank customer, having Rs. 50,00,000/- Home Loan, Rs.60,000/- balance is credit card and a Savings account with balance less than Rs.1,000/-  If the bank can get a complete view of this customer and drill in to his history records, it is easy for the bank manager to decide, if the customer approaches for a car loan for Rs, 10,00,000/-

Data Exploration
Sometimes, data outside the organization (eg., number of Likes in Facebook, Analysis done by government or third party agencies) also become crucial during analysis.  A software that helps to uncover value from data in internal and external sources and is a key component of big data analysis.

For example, a popular brand wants to compare its performance in various cities in a county.  It also wants to compare itself with its competitors based on the Like votes in Facebook, in those cities.

Predictive Analysis
Predictive Analysis makes use of historical data and current data to make predictions about the future.  This is one of the most frequently used analysis technique.

For example, based on the number of enterprises that have started using big data and analysis, it is possible to predict the number of data scientists required after five years.
Some of the products that provide these capabilities are given below.
IBM InfoSphere Information Server
IBM InfoSphere Master Data Management
IBM Information Integration and Governance
IBM InfoSphere BigInsights
IBM Stream Computing
IBM InfoSphere Data Explorer
IBM SPSS Software

No comments:

Post a Comment