Fortune 500 Distributor Transforms Data Science with AWS Data Mesh & Databricks

Summary

The client, a Fortune 500 MRO distributor, faced significant challenges in real-time data processing and machine learning. To address these issues, they sought to build a data product-oriented Data Mesh using AWS and Databricks. This initiative aimed to improve data accessibility, processing efficiency, and governance across their new architecture.

Goal

Deploying Databricks for real-time data processing and adopting a Data Mesh Architecture on AWS to enhance operational agility and scalability for a Fortune 500 MRO distributor.

Approach

In the AWS environment, Tredence implemented a robust Databricks architecture using a hub-and-spoke model:

The spoke was built and managed by the Tredence team in collaboration with the client’s platform hub team. The goal was to construct a scalable backend infrastructure to support Databricks' Platform while adhering to best-in-class standards.

According to the design:

Three cluster provisioning subnets and three transit subnets across different AZs ensure high availability and fault tolerance for network traffic management.
VPC endpoints establish secure, private connections to AWS services, bypassing public internet routes.
Route tables were configured to direct traffic efficiently, ensuring compliance with security policies.
Network configurations, including CIDR blocks and subnet associations, prevent IP conflicts and align with network strategies.
Security groups and network ACLs enforce stringent traffic controls within the spoke, maintaining compliance and operational standards for Databricks workloads in AWS.

Tools and AWS Services Used

Amazon S3
Amazon EC2
Virtual Private Cloud and Endpoints
Subnets
Route tables
Transit gateway Attachments
Security Groups and Network ACLs
Route 53 Outbound Resolver Rules
IAM Roles and Policies
KMS CMK

Key Benefits

Developed 70+ reusable library functions to ensure standardized and efficient data processing.
Achieved a 4x reduction in cost and runtime by migrating workflows to Databricks and AWS.
Operationalized 10+ near real-time data pipelines to enable continuous data processing.
Automated the provisioning of 6 shared Databricks workspaces for streamlined operations.
Created dashboards on Databricks for enhanced data visualization and insights.

Results

Developed a library of 70+ reusable functions to standardize data processing.

Implemented a micro application for handling late-arriving data, ensuring timely processing.

Created audit tables with integrated alters to reconcile data post-ingestion downstream.

Implemented a cutting-edge architecture focused on domain-specific insights and overall data management.

Enabled API-based job deployment through GitHub Actions, enhancing deployment efficiency.

Enhanced Tredence Accelerator to support Near Real-Time (NRT) and streaming data processing.

Designed and deployed a Notebook-based tool for flexible, user-driven data exports based on distribution rules.

Developed dashboards on Databricks for insightful data visualization and enhanced decision-making.

Achieved 4x cost and runtime reduction by migrating workflows to Databricks and AWS during POC.

Operationalized 10+ Near Real-Time data pipelines to support continuous and efficient data processing.

Automated the setup of 6 Shared Databricks Workspaces, improving resource management and collaboration.

Talk To Us

Transforming Data Processing with AWS Data Mesh and Databricks

Summary

Goal

Approach

Tools and AWS Services Used

Key Benefits

Results

Talk To Us

Industries

Services

Solutions

Blogs

Data & AI 101

Client Success

Life at Tredence

Careers

Contact us

CSR Framework

Certifications

Follow us on