Setting up an automated data-driven solution

S3, Snowflake, and dbt empower major media company’s viewer engagement reporting

Challenge

Boost viewer engagement reporting to deliver insights that impact the business

One of our clients, a major media company, was facing increased pressure to step up their viewer engagement reporting. As new market players continued to disrupt linear media consumption, it became increasingly important for our client to get up-to-date and actionable insights on viewer engagement.

Existing reports proved not only to be static and too limited in scope, but also took too much time and resources to produce. That left too little time for data analysts to do what they’re best at: adding value to the data to deliver insights that impact the business.

Hence, our client turned to us to help them boost the scope of their viewer engagement reporting processes – and do this in a way that creates value for end-users as soon as possible while delivering a future-proof, secure, and high-performance platform.

Approach

Implement an automated data-driven solution built on S3, Snowflake, and dbt

Together with our client, we decided to go for an automated data-driven solution using cloud-based tools. We settled on S3 cloud storage, Snowflake shared data architecture, and dbt SQL-first transformation workflows as the key technologies, and opted for AWS as the cloud computing platform – mainly because its broad service offering allowed us to efficiently run, integrate and monitor the tools that were needed to implement this solution.

S3 cloud storage

To ingest data into S3, we used both pull-based and push-based methods. We pulled in data by calling APIs of external tools used by our client while we pushed data into S3 as file exports from other tools. Some data were pushed from carefully curated data products managed by our client or its partners. As part of a staging workload, we then performed checks on CSV, JSON, or PARQUET files stored on S3 before loading them into Snowflake tables. A typical example is checking the schema of CSV files. This kind of check was crucial to prevent data from being loaded incorrectly into Snowflake because Snowflake natively reads CSV files by column position.

Event-driven data movement into scalable Snowflake warehouses

Snowflake was used as primary data storage because of its unmatched scalability, flexible cost management, and various powerful features that extend beyond just a database engine. An excellent example is how Snowflake manages and scales compute resources (such as CPU, memory, and temporary storage) in T-shirt-sized entities called warehouses. As warehouses can be resized at any time to accommodate the need for changing compute resources, we started our client with the X-Small warehouse size. That provided a cost-effective solution for the initial development and reporting needs. Also, setting up separate warehouses for production and non-production workloads made it easy to manage and track costs and enabled us to continuously optimize for performance.

Snowpipe is another example of a powerful Snowflake feature we used a lot. Snowpipe made it possible us to automatically process and load data from CSV, JSON, or PARQUET files into Snowflake tables as soon as those files were available on S3. Thanks to this event-driven data movement from S3 into Snowflake, the data in Snowflake was always up-to-date and ready for further use downstream. Finally, we set up a cost-monitoring dashboard based on the views built into Snowflake to provide our client with complete transparency on Snowflake usage and help them make informed decisions on optimizing usage.

SQL-first data transformations with dbt

At this stage, we still faced one of the most challenging tasks in building an automated, data-driven solution: managing, implementing, and documenting the transformation of raw data into data that is ready for consumption by business users. For this, we used dbt, a technology that provides an innovative approach for managing and running data transformations from a central repository such as Github/Gitlab. dbt makes it easy for analysts and engineers to collaborate on complex data transformations transparently.

dbt is a deceptively simple tool that did so much for us in this project. Consider, for example, the automatic management of dependencies between data transformations. No need to manually schedule transformations a, b, and c before transformations x, y, and z if the latter depend in some way on the former. dbt automatically figures out any dependencies and runs transformations in the correct order. That proved a massive time-saver in building streamlined data pipelines and helped reduce errors.

We also used dbt extensively for data validation, such as checking identifiers that are supposed to be unique. That was particularly useful because such uniqueness constraints cannot be enforced in Snowflake. By validating data with dbt, we were able to catch potential errors early in the process and avoid any downstream issues. In addition, dbt enabled us to automate data integrity validation and documentation, killing two birds with one stone: saving time and ensuring data quality.

The number one reason we love dbt so much is that it allows us to “push governance left” - building data governance best practices right into the data transformations themselves. Features such as integrated documentation down to the field level, automated data lineage and validation are a godsend for any conscientious data engineer.

Impact

A high-performance data platform that can continue to evolve and grow with changing business needs

Now that an automated, data-driven solution has been built on S3, Snowflake, and dbt, our client can share actionable insights on viewer engagement with end-users – and do that efficiently. End-users have been freed from labor-intensive manual analyses, and data ingestion, storage, and transformation can be fully automated. Overall, we delivered a strong foundation for a secure and high-performance data platform that can continue to evolve and grow with changing business needs.

Aside from the technical achievements, the most important thing we realized throughout this project was the close collaboration with our client. Clear communication, focused training sessions, and dedicated support during all steps helped us ensure that changes went smoothly for all involved. In addition, we keep the implemented solution up-to-date and aligned with current and new business requirements through ongoing support and platform maintenance.

Thanks to the unique mix of proven technologies such as S3, Snowflake, and dbt, and our profound expertise, we helped our client overcome the many challenges of boosting client engagement reporting. If you, too, are looking at how to get insights from your data to impact your business, get in touch to chat about how we can help.

Shift from data to impact today

Contact datashift