Big Data

Hey Everyone,

Today I'm going to share and explain to you an interesting topic which I've just learned. Ever wondered where your data gets stored? I'm not talking about your internal storage or Hard Disk. Think bigger! How Facebook, Google, Amazon, Netflix etc. store data. As they are the giants of the IT industry and they generate an insanely large amount of data like zettabytes to petabytes of data each day because they have millions of users who use their service regularly. Due to this, they have a large amount of data to handle and they have to handle this data within milliseconds to provide an efficient service to their users.

Let's start it with a creepy note many of you Android users might be using "OK Google" your own personal assistant and even to unlock your phone you just have to say those two magical words but how without any touch response this works? Well, the answer will give you chills!! Google is constantly recording every single thing you say around your phone it may be a small chat with your relatives, friends or you watching your favourite program google records everything. EVERY Single THING.

Now while you let that sink in I will move on to the big question.

What is Big Data?

The data we are processing currently is huge and at a very high speed that to from lots of sources also the data is unstructured with huge variety. Before the concept of big data, we were only dealing with structured data which was pretty easy to manage and process. As using a particular database with some simple algorithms our task is accomplished. In the modern world with the digital age, there is an insanely huge amount of unstructured data which is to be processed. For example, the data from Facebook is more than 100s of terabytes, Instagram, Twitter, G+ generates 100s of terabytes and there is new video being uploaded as you read each sentence of this article. Netflix is being streamed by millions of users amounting about 300000 hours of video per minute. These mix of data as audio, video, pictures, texts are of different category and as they don't have a common structure they are termed as Unstructured data.

Now we can get an idea of how much amount of data is being generated on a daily basis. Another example which I've learned is the cross country flights which uses lots of sensors as it travels generates lots of sensor data. Each engine per hour generates 20TBs of data. So assuming almost 30000 twin-engine flights travel around the year for 5 hours each day generates almost 2 billion petabytes of data.

We can now get an idea of how much data is being generated.

So, How this data is being processed and stored?

So now we know how the data is generated and these data have to processed within a short span. No one wants their Instagram or Facebook profile to take forever to

load. The data is generated by humans or machines in Industries like sensor data and each data will be having a certain amount of privacy. First, the Big data is classified in 4 forms as defined by IBM as the 4V's.

Volume

Defining the volume of data is being generated maybe bytes to exabytes.

Velocity

The speed at which the data is generated.

Variety

The data maybe structured or unstructured or both.

Veracity

The data may have anomalies, ambiguity it basically defines the accuracy of data(true data).

The storage devices are cheaper than ever and the processing speed is skyrocketing as i9 from Intel is available for commercial PCs. The real challenge is transferring data through input and output channels pretty much like using your pen drive to store movies. Even though the processing speed is high no supercomputer can handle this much amount of data. To explain this consider a computer having a transfer speed of 1GB/s to transfer 100TB of data it will take around 27 hours. So we use a Distributed system network.

In distributed file system we use 100s of computers to analyse and process the data by this method it just takes 27 minutes. So the solution is distributing the data across lots of computers. This technology is used by Hadoop.

Hadoop?

Hadoop is a big data analytics software developed by Apache which uses a distributed file system to manage a large amount of data. They literally have an elephant in their logo. It is being used by several companies to manage their data. Hadoop is an open source platform. An example is LinkedIn which generates over 100 billion personalized recommendations. The ads you see are being customized acording to your needs by big data analytics softwares like hadoop.

Hadoop works by storing data in clusters and not in a single location. Now hang on with me its about to get technical. There are two nodes responsible for storing data the master node also called as name node and data node. The master node stores the data to the main node along with the structure defining where and what data is being stored. To move/write a data we need a client which communicates with the master node which defines the location for the data. The data nodes are prone to failure so instead of storing data in one node we store in multiple data nodes for safety. This is how data is read and written this is called the HDFS(Hadoop distributed file system).

Similar to Hadoop there is Redshift and Big query by google. We all know Big Query by Google will eventually replace Hadoop because it is obvious.

So, I guess you understood something new about big data glad I could help.

Fact Flash:

Hadoop's market is expected to go to about 1billion dollars by 2020. On Google alone, 1.2 trillion searches are being performed each year and the numbers are growing. About 1.7 Mb of new data is being generated per second by every person on the entire planet.

YOU CAN ALSO SUGGEST ANY TOPIC YOU NEED OR GIVE SOME SUGGESTIONS IT WILL BE UPDATED WITHIN 48 HOURS.

.....................Keep Calm and Love Tech...................

Engineering Advancements

Big Data