AI Portfolio

HOME

AI

AI / NLP Engineering: LLM Pretraining Data Pipeline

Overview

This project is a scalable, cloud-native AI data processing pipeline designed to prepare high-quality multilingual datasets for LLM pretraining. It combines advanced NLP techniques with robust AWS infrastructure to ingest, filter, clean, deduplicate, normalize, and tokenize billions of tokens from large-scale text sources such as Common Crawl.

The system ensures that only clean, relevant, and safe text is passed into the model training stage, directly impacting downstream LLM accuracy, safety, and efficiency.

AI / NLP Components

Language Identification: fastText

Detects and retains only English content from raw text data (configurable for other languages).
Confidence thresholding ensures minimal false positives.
Handles noisy, multi-lingual documents common in web-scale data.

Content Extraction: justText

Filters out non-content and boilerplate text blocks from WET file extractions.
Retains only the main body text to improve dataset relevance and reduce noise.
Enhances semantic quality by discarding repetitive or low-value sections.

Toxicity Filtering: Detoxify

Identifies and removes toxic, offensive, or unsafe content.
Reduces harmful bias propagation in LLM outputs.
Configurable toxicity threshold for task-specific safety requirements.

Normalization: ftfy / Moses

Unicode normalization and cleanup.
Consistent punctuation, casing, and spacing for model tokenization.
Prepares data for uniform embedding/tokenization without format errors.

Tokenization: SentencePiece (LLaMA style)

Subword tokenization for efficient LLM vocabulary coverage.
Optimized for multi-million token batch processing.
Output sharded and stored for parallel training ingestion.

Cloud Integration (AWS)

View Full Cloud Page

AWS Batch (Fargate on ECS) orchestrates AI/NLP tasks in isolated compute environments.
Amazon ECR stores Docker images for each AI/NLP pipeline stage, ensuring reproducible deployments.
Step Functions processes multiple files in parallel with per-stage monitoring.
Amazon S3 stores intermediate outputs at every stage for reproducibility.
CloudWatch Logs provides traceability for AI processing steps.
Terraform provisions all cloud resources as code for consistent deployment.

Code Implementation

View Full Code Page

Python 3.10 modular ETL scripts for each NLP stage.
Hydra configuration for reproducible parameter tuning.
boto3 for seamless AWS service integration.
S3-based logging for per-file metrics (kept/skipped line counts).
Docker for fully containerized portable execution in AWS Batch.

Impact

This project bridges cloud infrastructure and AI data engineering:

Handles large-scale noisy web data efficiently.
Improves dataset quality for safer and more accurate LLM pretraining.
Showcases end-to-end ownership: cloud provisioning, AI/NLP integration, and scalable deployment.