AI Data ETL Pipeline
BACK

AI Data ETL Pipeline Architecture

Services used:

IAM, VPC, S3, AWS Batch, Fargate, ECR, ECS, Step Functions, CloudWatch

Languages used:

Python, Terraform (HCL), JSON

Objective: Orchestrate a scalable, fault-tolerant ETL pipeline that ingests Common Crawl WET files into S3, extracts/filters high-quality text, deduplicates, applies toxicity screening, normalizes and tokenizes, and writes AI-ready shards back to S3. Containerize stages with Docker and publish images to ECR. Automate orchestration with Terraform using Step Functions (Map) to fan-out per-file jobs and submit to AWS Batch running on ECS Fargate in private subnets. Provide egress via NAT for ECR pulls, use an S3 gateway endpoint for data IO, stream logs to CloudWatch, and secure access with least-privilege IAM. The result is an automated, resumable pipeline that prepares large-scale datasets for AI model development.

AWS Architecture

IAM

Configured least-privilege IAM roles for every component of the pipeline, including separate execution roles for AWS Batch, Step Functions, and ECS tasks. Attached only the minimum policies needed and used IAM role chaining so containers never require static AWS keys.

Roles: AWSBatchServiceRole, BatchJobExecutionRole, EcsTaskExecutionRole, EcsTaskRole, StepFunctionsExecutionRole.

Key permissions (least privilege):

Guardrails: Deny wildcard s3:* and unrestricted iam:PassRole; enforce MFA for console; use KMS CMKs on logs and buckets.

VPC

Deployed the pipeline inside a VPC with private subnets for compute (AWS Batch on ECS Fargate) to isolate processing from the public internet.

S3

Used Amazon S3 as the primary storage layer for every stage:

AWS Batch

Orchestrated compute workloads with AWS Batch, which handled job queuing, retries, and scaling.

Job Definitions per stage with:

ECS Fargate

Executed each Batch job on ECS Fargate, a serverless container runtime with no EC2 management overhead.

ECR

Hosted all container images in Amazon Elastic Container Registry (ECR).

Step Functions

Automated orchestration using AWS Step Functions (Map state) to fan out processing across multiple WET files in parallel.

CloudWatch

Centralized logging and monitoring with Amazon CloudWatch Logs and CloudWatch Metrics.