AI Data ETL Pipeline

BACK

AI Data ETL Pipeline Architecture

Services used:

IAM, VPC, S3, AWS Batch, Fargate, ECR, ECS, Step Functions, CloudWatch

Languages used:

Python, Terraform (HCL), JSON

Objective: Orchestrate a scalable, fault-tolerant ETL pipeline that ingests Common Crawl WET files into S3, extracts/filters high-quality text, deduplicates, applies toxicity screening, normalizes and tokenizes, and writes AI-ready shards back to S3. Containerize stages with Docker and publish images to ECR. Automate orchestration with Terraform using Step Functions (Map) to fan-out per-file jobs and submit to AWS Batch running on ECS Fargate in private subnets. Provide egress via NAT for ECR pulls, use an S3 gateway endpoint for data IO, stream logs to CloudWatch, and secure access with least-privilege IAM. The result is an automated, resumable pipeline that prepares large-scale datasets for AI model development.

AWS Architecture

IAM

Configured least-privilege IAM roles for every component of the pipeline, including separate execution roles for AWS Batch, Step Functions, and ECS tasks. Attached only the minimum policies needed and used IAM role chaining so containers never require static AWS keys.

Roles: AWSBatchServiceRole, BatchJobExecutionRole, EcsTaskExecutionRole, EcsTaskRole, StepFunctionsExecutionRole.

Key permissions (least privilege):

BatchJobExecutionRole: s3:GetObject/PutObject scoped to arn:aws:s3:::my-cc-pipeline-s3/*, logs:CreateLogStream/PutLogEvents to stage log groups.
EcsTaskExecutionRole: ecr:GetAuthorizationToken, ecr:BatchGetImage.
StepFunctionsExecutionRole: batch:SubmitJob, batch:DescribeJobs, and narrow iam:PassRole only to BatchJobExecutionRole.

Guardrails: Deny wildcard s3:* and unrestricted iam:PassRole; enforce MFA for console; use KMS CMKs on logs and buckets.

VPC

Deployed the pipeline inside a VPC with private subnets for compute (AWS Batch on ECS Fargate) to isolate processing from the public internet.

Compute in private subnets; no public IPs.
NAT Gateway only for controlled egress (ECR auth/image pulls).
Endpoints: S3 Gateway Endpoint (with a bucket policy limiting access to aws:SourceVpce), Interface Endpoints for CloudWatch Logs and Secrets Manager.
Security groups: Egress-only to S3/CloudWatch endpoints; no inbound from internet.

S3

Used Amazon S3 as the primary storage layer for every stage:

Buckets: …/extracted/, …/filtered/, …/detoxified/, …/deduplicated/, …/global_deduplicated/, …/normalized/, …/tokenized/.
Bucket Policies: require aws:SecureTransport=true; restrict to VPC endpoint; allow only pipeline roles.
Encryption: SSE-KMS with a customer-managed key; Bucket Key enabled.
Versioning + lifecycle to control cost; access logs to a separate log bucket.

AWS Batch

Orchestrated compute workloads with AWS Batch, which handled job queuing, retries, and scaling.

Job Definitions per stage with:

Task size (vCPU/mem) tuned per stage.
BatchJobExecutionRole as the job role.
Environment via Secrets Manager.
Retry strategies and timeouts per step.

ECS Fargate

Executed each Batch job on ECS Fargate, a serverless container runtime with no EC2 management overhead.

Runs Batch jobs in private subnets.
Pulls from ECR via NAT.
Task execution role only for image pulls/logs; separate task role for S3 data access.
No host volumes; /tmp ephemeral only; read-only root FS where possible.

ECR

Hosted all container images in Amazon Elastic Container Registry (ECR).

Private repositories per stage/image.
Lifecycle policy to expire old tags (keeps costs down).
Repository policy: limit pulls to account’s principals (task exec role).
Image scanning on push; signed images.

Step Functions

Automated orchestration using AWS Step Functions (Map state) to fan out processing across multiple WET files in parallel.

Map state to fan out over WET file list; max concurrency capped to 100 files per job.
Service integration with SubmitJob.sync.
Backoff/retry for transient errors; catch to route to a failure/notification path.
IAM: StepFunctionsExecutionRole limited to Batch APIs and narrow iam:PassRole to the job role only.

CloudWatch

Centralized logging and monitoring with Amazon CloudWatch Logs and CloudWatch Metrics.

Log groups per stage with KMS encryption.
Metrics/Alarms: job failure rate, execution status, runtime SLOs, and log-based metric filters.
Retention policies tuned by stage; alarm to Slack via SNS.