advanced
Data Engineering
Cloud
Analytics
ETL

Designing a Modern Data Stack Architecture

Olivier Soudée

Design a modern data stack with warehouse, ingestion, dbt, and BI layers. A phased rollout ships the core in 8-12 weeks.

5 min read
Direct answer
  • Modern data stack combines cloud data warehouses (Snowflake, Databricks, BigQuery), ingestion tools (Fivetran, Airbyte), transformation layer (dbt), and BI tools (Looker, Tableau) with orchestration (Airflow, Dagster)
  • Key benefits: rapid deployment (days vs months), pay-as-you-go pricing, infinite scalability, separation of storage and compute, and rich ecosystem of integrated tools
  • Start with warehouse selection based on workload (Snowflake for general purpose, Databricks for ML/AI, BigQuery for Google Cloud native), then add ingestion and transformation before analytics layer

The modern data stack has revolutionized how organizations collect, process, and analyze data. This guide explores the key components and design principles for building a scalable, cost-effective data architecture.

Prerequisites

This guide assumes familiarity with:

  • Cloud computing concepts (AWS, GCP, or Azure)
  • SQL and basic data modeling
  • ETL/ELT concepts
  • Data warehousing fundamentals

What is the Modern Data Stack?

The modern data stack is a cloud-native collection of tools that work together to handle the complete data lifecycle. Key characteristics include:

  • Cloud-native: Built for cloud scalability and economics
  • Modular: Best-of-breed tools that integrate via APIs
  • Managed: SaaS or managed services reduce operational burden
  • SQL-centric: Familiar interface for analysts and engineers

Core Components

1. Data Sources

Modern architectures must handle diverse data sources:

  • Operational databases: PostgreSQL, MySQL, MongoDB
  • SaaS applications: Salesforce, HubSpot, Stripe
  • Event streams: Mobile apps, websites, IoT devices
  • Files: S3, GCS, Azure Blob Storage

2. Data Ingestion

Move data from sources to your data platform.

Batch Ingestion Tools:

  • Fivetran
  • Airbyte
  • Stitch

Real-time Streaming:

  • Kafka
  • AWS Kinesis
  • Confluent Cloud

Design Considerations:

┌─────────────────────────────────────────────────────┐
│                    Data Sources                      │
│  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐       │
│  │ DB  │  │ API │  │Events│  │Files│  │ SaaS│       │
│  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘       │
└─────┼────────┼───────┼────────┼────────┼───────────┘
      │        │       │        │        │
      ▼        ▼       ▼        ▼        ▼
┌─────────────────────────────────────────────────────┐
│            Ingestion Layer                           │
│  ┌──────────────────┐  ┌──────────────────────┐    │
│  │  Batch (Fivetran)│  │  Stream (Kafka)      │    │
│  └────────┬─────────┘  └──────────┬───────────┘    │
└───────────┼───────────────────────┼─────────────────┘
            │                       │
            ▼                       ▼
┌─────────────────────────────────────────────────────┐
│                 Data Lake / Lakehouse               │
└─────────────────────────────────────────────────────┘

3. Data Storage

The storage layer is the foundation of your architecture.

Cloud Data Warehouses:

PlatformStrengthsBest For
SnowflakeSeparation of compute/storage, ease of useMulti-cloud, SQL-heavy
BigQueryServerless, ML integrationGCP ecosystem
DatabricksLakehouse, ML/AIAdvanced analytics
RedshiftAWS integrationAWS-native shops

Lakehouse Considerations:

  • Combines data lake flexibility with warehouse performance
  • Supports structured and unstructured data
  • Enables ML workloads alongside BI

4. Data Transformation

Transform raw data into analytics-ready models.

dbt (data build tool): The standard for transformation in the modern stack.

-- models/marts/customers/dim_customers.sql
{{ config(materialized='table') }}

with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('stg_orders') }}
),

customer_orders as (
    select
        customer_id,
        count(*) as order_count,
        sum(amount) as lifetime_value
    from orders
    group by customer_id
)

select
    c.customer_id,
    c.name,
    c.email,
    co.order_count,
    co.lifetime_value
from customers c
left join customer_orders co on c.customer_id = co.customer_id

5. Data Orchestration

Coordinate and schedule data pipelines.

Popular Tools:

  • Airflow / Astronomer
  • Dagster
  • Prefect
  • dbt Cloud

Key Features:

  • Dependency management
  • Failure handling and alerting
  • Monitoring and observability
  • Backfill capabilities

6. Business Intelligence

Enable self-service analytics and reporting.

BI Platforms:

  • Looker
  • Tableau
  • Power BI
  • Metabase
  • Mode

Semantic Layer: Consider a semantic layer for consistent metrics:

  • dbt Metrics
  • Cube
  • AtScale

7. Data Observability

Monitor data quality and pipeline health.

Observability Tools:

  • Monte Carlo
  • Anomalo
  • Great Expectations
  • Soda

Key Capabilities:

  • Freshness monitoring
  • Volume tracking
  • Schema change detection
  • Anomaly detection

Reference Architecture

┌─────────────────────────────────────────────────────────────┐
│                         DATA SOURCES                         │
│   Databases    SaaS Apps    Events    Files    APIs         │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    INGESTION LAYER                           │
│              Fivetran / Airbyte / Kafka                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    STORAGE LAYER                             │
│         Snowflake / BigQuery / Databricks                   │
│   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│   │  Raw Layer  │→│ Staging     │→│ Marts       │        │
│   └─────────────┘  └─────────────┘  └─────────────┘        │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                 TRANSFORMATION LAYER                         │
│                       dbt                                    │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                   CONSUMPTION LAYER                          │
│   BI Tools    Notebooks    Applications    ML Platforms     │
└─────────────────────────────────────────────────────────────┘

Summary

The modern data stack provides a flexible, scalable foundation for data-driven organizations. Key principles:

  1. Choose cloud-native tools that scale with your needs
  2. Embrace modularity with best-of-breed components
  3. Invest in transformation with dbt for maintainable data models
  4. Prioritize observability to catch issues before they impact decisions
  5. Enable self-service with semantic layers and quality BI tools

Next Steps

  1. Assess your current architecture against modern patterns
  2. Identify your highest-priority use cases
  3. Select your core storage platform
  4. Implement dbt for transformations
  5. Add observability early, not as an afterthought

Process Flow Diagram

flowchart TD
  A["Designing a Modern Data Stack Architecture"] --> B["Define scope"]
  B --> C["Map stakeholders & data assets"]
  C --> D["Implement controls & workflows"]
  D --> E["Measure outcomes and iterate"]

Sources & references

  1. FinOps FrameworkFinOps Foundation
  2. Gartner Glossary: Data IntegrationGartner

Frequently asked questions

What is the difference between modern data stack and traditional data warehouse?

Traditional: on-premise, tightly coupled storage/compute, rigid schemas, long implementation. Modern: cloud-native, decoupled architecture, flexible schemas (ELT vs ETL), rapid deployment, consumption-based pricing.

Which cloud data warehouse should I choose?

Snowflake: best all-around, multi-cloud. Databricks: best for ML/AI workloads and data science teams. BigQuery: best for Google Cloud ecosystems and serverless simplicity. Redshift: best for existing AWS customers with cost constraints.

What is dbt and why is it important?

dbt (data build tool) is the transformation layer that turns raw data into analytics-ready models using SQL and Jinja. It brings software engineering best practices (version control, testing, documentation) to analytics engineering.