AI Portfolio
HOME

AI

AI / NLP Engineering: LLM Pretraining Data Pipeline

View on GitHub

Overview

This project is a scalable, cloud-native AI data processing pipeline designed to prepare high-quality multilingual datasets for LLM pretraining. It combines advanced NLP techniques with robust AWS infrastructure to ingest, filter, clean, deduplicate, normalize, and tokenize billions of tokens from large-scale text sources such as Common Crawl.

The system ensures that only clean, relevant, and safe text is passed into the model training stage, directly impacting downstream LLM accuracy, safety, and efficiency.

AI / NLP Components

Language Identification: fastText

Content Extraction: justText

Toxicity Filtering: Detoxify

Normalization: ftfy / Moses

Tokenization: SentencePiece (LLaMA style)

Cloud Integration (AWS)

View Full Cloud Page
  • AWS Batch (Fargate on ECS) orchestrates AI/NLP tasks in isolated compute environments.
  • Amazon ECR stores Docker images for each AI/NLP pipeline stage, ensuring reproducible deployments.
  • Step Functions processes multiple files in parallel with per-stage monitoring.
  • Amazon S3 stores intermediate outputs at every stage for reproducibility.
  • CloudWatch Logs provides traceability for AI processing steps.
  • Terraform provisions all cloud resources as code for consistent deployment.

Code Implementation

View Full Code Page
  • Python 3.10 modular ETL scripts for each NLP stage.
  • Hydra configuration for reproducible parameter tuning.
  • boto3 for seamless AWS service integration.
  • S3-based logging for per-file metrics (kept/skipped line counts).
  • Docker for fully containerized portable execution in AWS Batch.

Impact

This project bridges cloud infrastructure and AI data engineering:

  • Handles large-scale noisy web data efficiently.
  • Improves dataset quality for safer and more accurate LLM pretraining.
  • Showcases end-to-end ownership: cloud provisioning, AI/NLP integration, and scalable deployment.
  •