Building a centralized data hub in the cloud

Challenge

Set up a scalable data platform that interacts with a wide variety of data sources

Many of our clients have their data stored in various sources such as payment systems, CRM platforms, application trackers, etc. Before the cloud era, most businesses selected the primary sources that were crucial for their operational reporting and stored the related data in on-premises relational databases. But as access was limited to databases only, leaving out the prime data sources themselves, it was very hard to enable Big Data & Data Science use cases or analyze potentially valuable information. Some of the bigger corporations had already dealt with that challenge by implementing an on-premises data lake that centralized most of their data. The deployment of such technologies, however, didn’t just come at a considerable investment cost. It also required highly specialized skills to set up and maintain.

The current shift to cloud computing enables new and different ways of working with data, especially as data source landscapes are broadening. As our clients know that cloud storage comes at a lower price than on-premises data storage and is highly scalable, they are increasingly looking for cloud-based solutions to help them face the many challenges in growing their business and technical operations. That’s why one of our clients, a provider of subscription-based services, asked us to set up a multi-purpose scalable data platform that interacts with a wide variety of data sources in their fast-evolving business environment.

Approach

Combine an event-driven cloud data architecture with serverless compute resources

The multi-purpose scalable data platform that we developed for our client is built on five pillars:

1. Event-driven data processing

Specific events trigger data processing tasks (or other actions, for that matter). When one of our client’s employees creates a new customer in the CRM platform, for example, a series of data processing tasks are triggered by this platform. Or, when a new file is stored on the data lake, the platform itself provides the trigger. Whatever the data source or the trigger, event-driven processing enables real-time reporting use cases because small chunks of data are continuously being processed on the fly. Moreover, especially when using serverless compute resources like AWS Lambda, those operations are quite cheap to handle.

Error monitoring and logging is another excellent example of how event-driven data processing can create value for our client’s business. As all applications send their logs to an AWS S3 bucket, each log file is automatically read by an AWS Lambda function triggered each time a new log file is created on the data lake. Whenever the AWS Lambda function detects an error in the log file, it automatically sends a message to the related Amazon SNS topic and notifies all subscribers on the fly.

2. Operational data models for nested source objects

Some of our client’s data sources have pretty complex data structures and often store the data in a semi-structured JSON format. The data sources typically send objects via an API or a webhook to the new data platform, with objects such as customers, payments, and discounts existing independently from one another.

We focused on making the data usable for our client’s business. By creating an underlying normalized data model, we made the data accessible for further ETL processes and self-service tools such as Power-BI or Tableau.

3. Data partitioning and optimized file formats on AWS S3

AWS S3 serves as the data lake storage platform, providing the central landing zone for our client's data. As the amount of data stored on S3 increases rapidly, data is partitioned while being stored. Because partitions are related to the S3 folder structure and data can be filtered on partitions, there is no need to scan all S3 folders when accessing data on the S3 data lake.

Typically, data from a given source is stored per day on the S3 data lake. All data for a specific day are stored in one folder, which has the date as its name. To access the data for that day and scan as little data as possible, the filter only needs to include the related partition. Filtering out any redundant partitions enables faster processing times and reduces costs.

In addition, larger files are always stored in Parquet format. As Parquet automatically compresses the data, files are smaller and cheaper to process. Parquet is also a columnar file format that holds its own metadata layer and is better suited for analytical purposes.

4. AWS Lambda serverless compute resources

Access to serverless compute resources such as AWS Lambda makes it possible for our client to use compute power only when needed and pay only for what is used. In particular, we set up several APIs that constantly ingest data originating from various event streams, triggering a series of AWS Lambda functions. While we wrote those functions in Python, AWS Lambda also supports a bunch of other programming languages such as Node.js or Java.

As all endpoints run on AWS API Gateway and AWS Lambda without using a server, there is no need for load balancing, auto-scaling groups, etc. Compute resources simply scale based on what is needed, and this happens automatically. Did you know that you receive 1 million free requests per month to your serverless AWS Lambda functions?

5. Amazon Athena interactive query service

Amazon Athena is an interactive query service that enables our client to directly query their data on the S3 data lake using standard SQL. AWS Athena is built on Presto, which Facebook initially developed so their data analysts could run interactive queries on its large data warehouse in Apache Hadoop. As a result, Presto is perfectly capable of querying large amounts of data in no time.

Amazon Athena is used as a serverless processing engine for our client’s bigger datasets on their S3 data lake. It enables them to reach impressive average execution times of three minutes (or less!) on 20-30GB data sets. Furthermore, as they can quickly query their data without having to set up and manage any servers, our client only pays per query for the amount of data that was read. In addition, they save on per-query costs thanks to the data partitioning and optimized file formats that we implemented on their S3 data lake.

Impact

A centralized data hub which serves as a hatch for other applications and helps create business insights and value

The centralized data hub that we built for our client does not just store data from a wide range of sources. It also models them, casting data in a business-worthy structure and combining them to enable in-depth insights. An excellent example is how our client can now connect their application tracking data with their commercial data to check how churning customers typically behave before they churn.

The data platform is not a final landing zone for our client’s data. They can use technologies like Amazon Athena to investigate in which data sources they further want to invest, for example. Or they can easily deploy other applications such as data enrichment by Machine Learning models, pushing the results to their app to make user-tailored recommendations. Their centralized data hub acts as a 360-degree layer surrounding their data sources and powers different use cases for generating impact and value.

Building a centralized data hub in the cloud

Powering new use cases for generating business impact and value

Set up a scalable data platform that interacts with a wide variety of data sources

Combine an event-driven cloud data architecture with serverless compute resources

A centralized data hub which serves as a hatch for other applications and helps create business insights and value

Shift from data to impact today

Offering

Jobs