What is Big Data? How Tech Giants process Big Data?

Pradeep Kumar
3 min readSep 17, 2020

Hey Everyone….

In this blog you will gain intuitive knowledge on Big Data, Distributed Storage, Hadoop, how Tech Giants such as Google, Facebook, Instagram, Amazon, etc. process the Big Data.

What is Big Data?

In this technology world data is everything. Data is a piece of information that has to be processed, stored, manipulated and retrieved as per user’s requests. Big Data is nothing but collections of data whose volume (or) size is much larger than the actual storage or hardware that is available in a machine to process this data. Most of the companies are experiencing difficulties processing this huge data that is generated on a daily basis. Facebook has revealed that it receives around 500 TB of new data per day to process.

How companies overcome Big Data?

Facebook has revealed that its system processes 2.5 billion pieces of content and 500+ TB of data each day. Its processing 2.7 billion like actions, 300 million photo uploads and for half an hour it roughly scans 105 TB of data.

Google receives nearly 40,000 web searches per second and the search totals will grow continuously as more and more people are given access to the internet.

Tech giants like such companies overcome Big Data by using Cloud Computing technology to store and process such huge Data. Concepts like Distributed storage, Distributed computing, Networking are deployed to overcome Big Data.

Distributed Storage :

A distributed storage is a cluster of interconnected systems which acts as a one single master systems providing its hardware resources for processing Big Data. For example, there is a master system which has 10 GB of storage to store data and connected to 4 slave systems of each 10 GB, then the master system can provide 50 GB of storage as a one single unit for storage. One such software to achieve this concept is Hadoop. Clusterfs, S3, Coph, etc. can also be used.

Hadoop :

Hadoop is an open source software developed by Apache Software Foundation to achieve Distributed Storage and processing of Big Data using MapReduce programming model. Hadoop was written in java and it required jdk to be present in a System in which it runs. Hadoop Distributed File System (HDFS) is a protocol used in Hadoop to transfer data. The is one Master machine called the Name Node and many Slave machines called Data Node to achieve the topology.

MapReduce Model :

  • Name Node : It is a daemon running in the Master machine. It stores the directory tree of all the files in the file system. It stores the meta data of the blocks of data stored on the data nodes.
  • Data Node : The data is actually stored in these nodes in blocks. These nodes sends the reports on the blocks of data to the name node.

Facebook has the world’s largest Hadoop Cluster. It is used by Facebook for data warehousing and data analytics.

--

--