The word ‘Big Data’ refers to a large volume of data which can be better understood by analyzing the amount of data which is produced by certain companies.
- Walmart can handle customer transactions which may be up to 1 million transactions every hour.
- Facebook generates 40 billion photos from its user base.
- Decoding the human genome previously took 10 years to process, now it can be processed in one week.
To summarize Big Data, we can say that in a world that is data-driven, Big Data is a field which finds a way to analyse, extract and systematically process these huge volume of a data in a much efficient way compared to its conventional data-processing tools or software. The power of Big Data is not just limited to search and analyse the dataset but it can also share, transfer, visualize, query, update and secure the dataset.
Big Data is usually stored in a database and can be queried to analyse and extract the particular information. Big Data emerged in the market in the first decade of 21st century and firms like Google, eBay, LinkedIn were all build up on the concepts of Big Data from the beginning.
Key characteristics of Big Data
It basically operates on three concepts
- volume of data collected or stored
- variety of the information
- speed with which it is collected
The first characteristics is the enormous volume of data produced. For example, a typical PC might have 10 gigabytes of storage and can thus generate huge amount of data. Boeing 737 can generate 240 terabytes of flight data during a single flight across the US. The data created and consumed by smartphones, sensors embedded in everyday objects constantly generate billions of new updated data feeds containing environmental, location and other information.
The second characteristics is the speed at which data is captured. For example, clickstreams and ad impressions capture user behaviour at millions of events per second. Machine to machine processes exchange of data between billions of devices. Everyday sensors in embedded devices can generate massive data in real time. Online games supporting millions of users simultaneously produce multiple inputs per second.
The third characteristics is the wide variety of data which is handled. Big Data can handle 3D data, audio and video, geospatial data, structured and unstructured data, log files and social media data.
Storing Big Data
Analysing the data characteristics
- Selecting data sources for analysis
- Eliminating redundant data
- Establishing the role of NoSQL
Overview of Big Data stores
- Data models like graph, key value, document, column family
- Hadoop Distributed File System
Selecting Big Data Stores
- It is the key factor to choose the correct data store based on the characteristics of the data collected.
- Moving the code to data
- Implementing polyglot data store solutions
- Aligning the business goals to the data store which is appropriate
Processing Big Data
Integrating disparate data stores
- Mapping data to programming framework
- Connecting and extracting data from storage
- Transforming data for processing
- Subdividing data in preparation for Hadoop MapReduce
Employing Hadoop MapReduce
- Creating the components of Hadoop MapReduce jobs
- Distributing data processing across server farms
- Executing Hadoop MapReduce jobs
- Monitoring the progress of job flows
Types of data in Big Data
When the volume of data flow is seamlessly enormous the structure of the data becomes crucial in processing them meaningfully. All data goes through a process called ETL which is extract, Transform and Load before the data goes for analysis. There can be structured, unstructured and semi-structured type of data.
Structured data type refers to highly organized and have a particular format. Their dimensions are defined by the set of parameters. The structured data can be readily and seamlessly stored and accessed from a database by using simple search engine algorithms. For example, a spreadsheet containing the employee details such as name, salary, Employee ID, address etc. stored in a tabular format. The first step towards processing structured data is ‘cleaning the data’ which narrows down the data to only relevant points. With structured data enterprise data can be easily merged with the relational. Very little preparation needs to be done to make the raw data compatible.
Unstructured data does not conform to any fixed structure. Hence, to analyse and process unstructured data is a very tedious and time-consuming process. For example, emails are unstructured form of Big Data. It is found that not more than 20% of data is structured. In order to make unstructured data conformant to a format which is readable is a very time-consuming process. For unstructured data the ETL process is not as simple as in case of structured data. In other words, the unstructured data is first cleansed and then transforming the data in some sort of structured data. This can be achieved by tools like text parsing, natural language processing and developing content hierarchies via taxonomy. It involves complex process which blends the process of scanning, interpreting and contextualising functions.
Semi structured data are basically unstructured data but has metadata attached to it. For example, if a photo is clicked from a smartphone, the image is automatically logged with the device time, device time at the time of capture and device ID. Another example which could be sited with this respect is the email sending. Whenever an email is sent, it is logged with the to and from address, device time, IP address and other information.
Advantages of Big Data
- It is not just a way to store petabytes and exabytes of data but also its ability to make better decisions and take meaningful actions at the right time.
- Technologies like Hadoop gives the flexibility to store data without even knowing how to process these data.
- Technologies like Hive, MapReduce and impala can give the power to run queries without changing the structure of data underneath.
- Big Data can be used to target customer-centric outcomes and build a better information ecosystem.
- It is one of the reasons for Internet and social media boom.
M.Tech (VLSI Design and Embedded system)
BS Abdur Rahman University