How to store such huge data which is beyond our capacity?

Sathvika Kolisetty
6 min readSep 17, 2020

INTRODUCTION

Earlier everything was going on fine when there was no Internet but after the Internet, Technical Industries like Google, Facebook, etc. Started facing the issue. Users are increasing day by day and so there data also. There are approximately 4.57 billion Internet users in the world and in 1 year almost 346 million new users have come.

WHAT ARE THE ISSUES FACED?

Any entry made by the user and that is stored in the Database is Data and that data can be used by industries for commercial purposes but one issue came here that is day by day data increased exponentially and now the questions came up -

Where to store data?
If stored how to process data?
How to retrieve data faster?
How to stored and retrieve data at Real-time?
How to find raw data for the industry?
How to manage that untapped data?

HOW DATA IS INCREASING?

  1. SOCIAL MEDIA — Social Media is a place where people connect with each other by online mode and share their emotions and journey by images, audios, videos, etc.Social Media is one of the important factors of Big Data. Instagram, Facebook, Whatsapp, takes alot of data like personal details, pictures, likes or reactions, etc.
  • FACEBOOK — Facebook is a social media platform that has almost 2.7 billion active users until the second quarter of 2020. Facebook generates 4 petabytes of data per day. People can chat and upload images, videos, etc. on Facebook.
  1. GOOGLE- Google is a Search Engine that has 4 billion users and it processes 3.5 billion searches per day and if we break down this it processes 40,000 searches per second on an average. Google processes approximately 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.
  2. INTERNET OF THINGS(IoT) — IoT connects with a device and makes it smarter. Nowadays we have a smart A.C., smart room, etc. Due to IoT we humongous amount of data is generated. It is assumed that till 2025 41.6 billion of data will be generated by IoT devices.

There are many more things due to data is Increasing.

WHAT IS BIGDATA?

Big data is a problem. Big Data is a tsunami of data that is increasing exponentially day by day.Examples of big data are — Science, Astronomy, Sensor Networks, Medical records, Social Data, etc.Problems with big data:

  1. Huge Volumes
  2. Data in different types and Format
  3. Impacting the Business

CHALLENGES

  1. STORING THE DATA — The data is coming in huge volume and where to store it is a big issue. To store a huge amount of data in a traditional system is not possible.To buy one expensive hardware with a huge volume storing capability is not a good idea because it will raise another issue.We have one file of 500 MB but we have only 200 MB of storage left now what to do?
  2. VARIOUS FORMATS OF DATA- Earlier, we used to store data in Relational Database but currently, 80% of the data is Unstructured Data. Also now there are different types of data:
A.STRUCTURED DATAB.UNSTRUCTURED DATAC.SEMI-STRUCTURED DATA

it‘s hard to handle this data in a traditional manner.

3. PROCESSING DATA FASTER- let’s take one example, we have one harddisk of 100MB and we stored data there but now more data is coming so we increased its size from 100MB to 500MB but now more data is coming and we are increasing it’s size again and again. Now all data is stored but did you thought about How we will be going to retrieve this data or process this data?Though the CPU speed, RAM Memory, Disk Capacity have improved alot, the thing not improved is the speed. From the last 7–10 years, the read/write speed of a disk is 80 MB/Sec.So these are the problems faced by industries when the data converted to Big Data.

TYPES OF BIG DATA

A.STRUCTURED DATA

The Relational Database is known as Structured data which is in the form of Row and Column.

Example- Stock Information, Credit Card details, Medical records of the hospital, Bank Records etc.

Facebook especially make their own query language based on SQL which handles Big Data Known as Hive Query Language.

B.UNSTRUCTURED DATA

Unstructured data which are images, audios, videos, etc. Almost 80% of the data is unstructured data. It is generated more by Social Media.

C.SEMI-STRUCTURED DATA

JSON, XML, CSV File, Tab Delimited files,log files etc are semi-structured data.

Log files are the files that store the data when we login till logout to any application. Like on Facebook when we log in, what activity is done by us, when we logout .everything is stored in log file.

Characteristics of Big Data:

Big Data is categorized by 3 important characteristics.

  1. Volume
  2. Velocity
  3. Variety

Volume

There are many form of data generate:

  • Generated from hospitals keeping record all patients,doctors,nurses ,medical staff etc.
  • By social media .
  • By google drive ,drop box.
  • By organization.etc..

This is call volume or size of data.

Velocity

Input and Output of data.

Example — If I make one post on LinkedIn.And how fast it stored and how fast it is processed and retrieve by other.Speed of data.

Variety

Data is in many format.Data type like cvs,excel,video , song ,text ,pdf,photos etc..This is called variety of data.

DISTRIBUTED STORAGE

Distributed Storage means when the file can’t be stored in one P.C. and we split the file and store it in different P.C.Let’s understand with an example — we have a file of 100 MB and we have storage of 50 MB and we can’t store it like this. So we can do one thing rather than storing it by vertical scaling we can store it in a horizontal scaling manner.

VERTICAL SCALING (SCALE-UP)- We can add more storage to the same hard disk. It stores the data but at the time of Retrieval or processing the data it increases the read/write or input/output time.HORIZONTAL SCALING (SCALE OUT) - Add more P.C. rather than adding storage. The Advantage of horizontal Scaling is it stores the data but also retrieves and processes it at a faster rate which is good for Industries.

SOLUTION TO BIG DATA

The Solution to Big Data was Given by DOUGH CUTTING which is HADOOP. Hadoop’s name is given because his son’s elephant toy name was Hadoop.

HADOOP IS A FRAMEWORK WRITTEN IN JAVA LANGUAGE.

Hadoop stores and processes data in a distributive manner and in a parallel way.

TWO MAIN COMPONENTS OF HADOOP ARE:

1.HDFS(Hadoop distributed file system) -> for distributed system2.MapReduce -> for processing and parallel working

HADOOP ARCHITECTURE

MASTER/SLAVE ARCHITECTURE-

NameNode is the Master and SlaveNode is the DataNode.

NameNode is expensive hardware and stores metadata.

DataNode is Commodity Hardware and Stores the Files with the replication factor and input split.

#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor
#worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation

--

--