Img-Reboot

Transforming Data Processing with AWS Data Mesh and Databricks

Summary

The client, a Fortune 500 MRO distributor, faced significant challenges in real-time data processing and machine learning. To address these issues, they sought to build a data product-oriented Data Mesh using AWS and Databricks. This initiative aimed to improve data accessibility, processing efficiency, and governance across their new architecture.

Goal

Deploying Databricks for real-time data processing and adopting a Data Mesh Architecture on AWS to enhance operational agility and scalability for a Fortune 500 MRO distributor.

Approach

In the AWS environment, Tredence implemented a robust Databricks architecture using a hub-and-spoke model:

The spoke was built and managed by the Tredence team in collaboration with the client’s platform hub team. The goal was to construct a scalable backend infrastructure to support Databricks' Platform while adhering to best-in-class standards.

According to the design:

  • Three cluster provisioning subnets and three transit subnets across different AZs ensure high availability and fault tolerance for network traffic management.
  • VPC endpoints establish secure, private connections to AWS services, bypassing public internet routes.
  • Route tables were configured to direct traffic efficiently, ensuring compliance with security policies.
  • Network configurations, including CIDR blocks and subnet associations, prevent IP conflicts and align with network strategies.
  • Security groups and network ACLs enforce stringent traffic controls within the spoke, maintaining compliance and operational standards for Databricks workloads in AWS.

Tools and AWS Services Used

  • Amazon S3
  • Amazon EC2
  • Virtual Private Cloud and Endpoints
  • Subnets
  • Route tables
  • Transit gateway Attachments 
  • Security Groups and Network ACLs
  • Route 53 Outbound Resolver Rules 
  • IAM Roles and Policies
  • KMS CMK

Key Benefits

  • Developed 70+ reusable library functions to ensure standardized and efficient data processing.
  • Achieved a 4x reduction in cost and runtime by migrating workflows to Databricks and AWS.
  • Operationalized 10+ near real-time data pipelines to enable continuous data processing.
  • Automated the provisioning of 6 shared Databricks workspaces for streamlined operations.
  • Created dashboards on Databricks for enhanced data visualization and insights.

Results

Icon Boost

Developed a library of 70+ reusable functions to standardize data processing.

Icon Boost

Implemented a micro application for handling late-arriving data, ensuring timely processing.

Icon Boost

Created audit tables with integrated alters to reconcile data post-ingestion downstream.

Icon Boost

Implemented a cutting-edge architecture focused on domain-specific insights and overall data management.

Icon Boost

Enabled API-based job deployment through GitHub Actions, enhancing deployment efficiency.

Icon Boost

Enhanced Tredence Accelerator to support Near Real-Time (NRT) and streaming data processing.

Icon Boost

Designed and deployed a Notebook-based tool for flexible, user-driven data exports based on distribution rules.

Icon Boost

Developed dashboards on Databricks for insightful data visualization and enhanced decision-making.

Icon Boost

Achieved 4x cost and runtime reduction by migrating workflows to Databricks and AWS during POC.

Icon Boost

Operationalized 10+ Near Real-Time data pipelines to support continuous and efficient data processing.

Icon Boost

Automated the setup of 6 Shared Databricks Workspaces, improving resource management and collaboration.

Talk To Us