Why big data technologies are more accessible than ever

19 April 2021

With the shift to the cloud, big data processing is more accessible and affordable than ever before. Frameworks like Hadoop and Spark have played a central role and can’t be ignored in a modern data stack. To understand how you can get the most out of these big data frameworks, we take you through the story behind the technologies and the crucial role they play today.

A revolution in the world of data

Before we go into the why and how of Hadoop’s creation, we’ll take you all the way back to 2003. It was in October of that year that Google started a revolution in the world of data. They published an article describing a new foundation for their search engine, called the Google File System (GFS in short).

While Google was crawling websites across the globe, storing and processing such large amounts of data on a single machine was expensive, time consuming and in some cases even impossible. That’s why Google built GFS, the first scalable and distributed file system especially designed to store and enable processing large volumes of data.

Google understood early on that distributed storage and processing capacity was necessary to enable further growth. It didn’t take long for Facebook, Yahoo and others to decide to follow in their footsteps.

What is a distributed file system?

When we speak of a file system, we’re talking about the storage of the data, not yet about accessing and processing the data. So, what exactly is a distributed file system and how does it differ from a local file system?

On a local file system, like the one your own operation system is using, the data in your folders is stored on the hard drive of a single machine, your local computer. With a distributed file system your data is stored within a cluster of machines (a cluster being two or more computers connected to each other).

With a distributed file system your data is stored within a cluster of machines, called nodes.

Imagine having a large data file of 150GB in size. If you store it on a local file system, it will take up 150GB in size of your local hard drive (that is part of your local pc = machine that you are using when storing the file). By using a distributed file system, the file would be split up in blocks (configurable in size) and stored throughout a cluster of machines. Let’s say we have a cluster of 5 machines, then the file would be split up in parts of 30GB distributed across the cluster.

But what is the advantage of distributing files over a cluster of machines? This answer to that question brings us to distributed processing.

What is distributed processing and what are the advantages?

Distributed processing, also known as distributed computing, means using more than one computer or processor to perform the processing for an individual task.

Looking back at our previous example, the 150GB data file was distributed over a cluster with 5 nodes, each storing 30GB of data. Now imagine it would take a single machine 5 hours to process such a file. When using distributed processing, each computer (node in the cluster) will process its 30GB part of the data file in parallel.

This results in much faster processing times (ex. 1 hour instead of 5) and makes such systems scalable. Why scalable? A single computer can process a 150GB in a couple of hours depending on its resources (CPU, memory etc.), but if you’re dealing with TBs of data, this is nearly impossible. Distributed processing enables the system to dynamically scale and adjust its own computing performance.

What is Hadoop?

With GFS, distributed computing saw the light of day and lay the foundation of other solutions, like Hadoop. Let’s have a look at how Hadoop came to be.

Apache Hadoop is an open-source framework that allows distributed processing of large data sets across clusters of computers using simple programming models. Hadoop consists of multiple components of which the Hadoop Distributed File System (HDFS) is one the major ones.

It was Doug Cutting who started the Apache Hadoop project about three years after GFS was announced. He was working on a project called Apache Nutch, an open-source web crawler, and was searching for a way to process the large volumes of data that were generated by Nutch. Based on Google’s GFS articles, he created HDFS within Hadoop.

Around the same time that GFS was created, Google built a framework to process the data on their file system. This distributed processing framework was called MapReduce. Doug Cutting wasn’t trying to reinvent the wheel, which is why GFS and HDFS are nearly identical and he made MapReduce part of the Hadoop framework.

HDFS is the distributed storage layer of the Hadoop framework and MapReduce is the distributed processor.

So HDFS is the (distributed) storage layer of the Hadoop framework and MapReduce is the distributed processor (originally designed by Google).

The downsides of Hadoop

The fact that Apache Hadoop uses MapReduce to process data also has some downsides. MapReduce is reading and writing its data from disk without caching. This basically means that MapReduce is dependent on the speed of the hard disks in each of the nodes which in general can cause slower execution times. There are articles claiming that Hadoop applications spend more than 90% of the time doing HDFS read-write operations.

Something else you need to know about MapReduce is that uses three major data operations to enable distributed processing: map, shuffle and reduce. This means that developers that write MapReduce applications need to hand-code each operation to map it with one of these. This increases complexity and makes it harder to find people who master the skills to write such applications.

And then came Spark

The inconveniences of MapReduce became especially apparent when running machine learning applications on top of Hadoop. Apache Spark was initially developed in 2009 to improve the efficiency and address the shortcomings of Hadoop, while keeping the benefits.

Spark was initially developed to address the shortcomings of Hadoop.

Both frameworks are made for distributed processing. Spark however, is using a different programming abstraction that is called RDD (Resilient Distributed Dataset). The main difference is that RDD supports in-memory processing, which makes all the difference. With MapReduce, each node in the cluster needs to constantly read-write from HDFS (there is no caching layer), while with Spark each node can read from the distributed storage, load parts of the data (or entirely depending on the memory) in its RAM (working memory) and use it directly during processing. This makes Spark in some cases 100x faster than MapReduce.

MapReduce was designed for batch processing, while the world was demanding for more streaming data pipelines. Spark has a streaming component that is called “Spark Streaming”. It ingests the data in mini-batches (small chunks of data) and performs transformations on-the-fly.

That leaves the question: what kind of storage layer is Spark using? Today, Spark is almost always used in combination with HDFS as its distributed storage layer. It is even so that in the more recent Hadoop distributions, Spark is part of the Hadoop components. So that’s where both frameworks come together, bringing out the best in each other.

Does that mean that MapReduce is dead?

Is MapReduce then completely dead? No, Hadoop is still one of the most used big data frameworks and Spark is most of the times just part of it. MapReduce is still used (in some cases) for heavy-weight batch processing especially when the data extends the size of the available RAM (work) memory.

Cloud technologies and distributed processing for the win

We talked about distributed storage and distributed processing, as they are at the heart of big data processing. Remember that distributed means multiple computers that work together in a cluster. Before the cloud this meant that a company needed to have this cluster of processors on-premise which resulted in high investment costs, not to mention the skills necessary to set up the cluster itself.

By using cloud technologies, we are able to store huge volumes of data (in a cheap way) on what is called object stores. Object stores are just a fancy name for cloud file stores like AWS S3, Azure Blob/Data Lake Storage, Google Cloud Storage.

Nothing new you might say, but what if I told you that these storage services can serve as your distributed file store (HDFS) without any additional investment? Each of the cloud providers have multiple services that allow you to consume the data directly from your object store and process it in a distributed way. You only pay for the processing resources at the moments that you need them.

You need to remember that cloud solutions enable cheap storage in the form of an object store. We can use this layer as a central landing zone of the data (Data Lake). This storage layer can be consumed on demand in a distributed way by multiple cloud services. This way of working makes big data processing available to everyone, instead of only only those willing to invest big in on-premises architectures.

Which solution is best for you?

At Datashift, we typically work with an ecosystem of technologies and pick the ones that best fit the needs for a specific use case. Curious about what the best technologies are for your data needs? Get in touch and let’s discuss your challenges.