AI Data ETL Pipeline

BACK

AI Data ETL Pipeline Code

Stack:

Python, Terraform (HCL), JSON

GitHub:

Description: Built a Python-based, containerized ETL pipeline that processes large-scale Common Crawl WET files into AI-ready datasets. All data processing stages, ingestion, filtering, toxicity screening, deduplication, normalization, and tokenization, are implemented in modular Python scripts, each running in its own Docker container on AWS Fargate via AWS Batch. Infrastructure, including VPC networking, IAM roles, S3 storage, Batch compute environments, and Step Functions orchestration, is fully provisioned and managed with Terraform for repeatable deployments.

Python

Structure: Modular, stage-based scripts (ingest.py, filter.py, toxicity_filter.py, deduplicate.py, global_deduplicate.py, normalize.py, tokenize_llama.py), each isolated in its own container image.

Ingestion: Streams WET files from Common Crawl via smart_open and stores extracted text in S3.

Filtering: Uses fastText for language detection and jusText for boilerplate removal.

Toxicity Screening: Integrates Detoxify for content filtering.

Deduplication: Implements MinHash-based near-duplicate detection across datasets.

Global Deduplication: A separate stage merges and scans all previously processed data across multiple files, using distributed MinHash index comparisons to remove duplicates across the entire corpus, ensuring uniqueness at scale.

Logging: Each script writes structured logs to CloudWatch via stdout for centralized monitoring.

Security: S3 paths, credentials, and API keys are injected via AWS Secrets Manager; no hardcoded secrets.

Terraform

IaC: Defines all AWS resources, VPC, subnets, NAT Gateway, IAM roles/policies, S3 buckets, ECR repos, AWS Batch compute environments, job definitions, and Step Functions state machine.

Security in Code:

IAM roles scoped per service (least privilege).
S3 bucket policies enforce SSL-only and KMS encryption.
VPC endpoints provisioned for S3, CloudWatch Logs, and Secrets Manager.

Parameterization: Variables for WET file list, container image tags, and stage concurrency limits.

Docker

Images: One per pipeline stage, built from minimal Python base images for security and faster cold starts.

Build Process: Multi-stage Dockerfiles for small final image sizes.

ECR Integration: Images tagged by stage and version, pushed to ECR via IAM-authenticated CLI.

Security: Non-root user in all containers, read-only filesystem where applicable, pinned dependencies in requirements.txt.

Optimization: Layer caching for dependencies to reduce rebuild time during iteration.

Outcome

A reproducible, automated pipeline capable of parallel-processing hundreds of GBs of web text into a clean, deduplicated, and tokenized corpus suitable for large-scale AI model development. The codebase is fully modular, allowing new stages to be added without altering existing infrastructure.