Operations Analytics Portfolio Project

ETL Automation &Weekly Report

Automated end-to-end data cleaning pipeline for 3 operational data sources. Transforms 3 hours of manual work into 18 seconds.

3
Data Sources
18s
Runtime
156 hrs
Annual Savings
โ‚น62.4K
Cost Savings

Interactive Pipeline Demo

Watch the ETL pipeline extract messy data, transform it programmatically, and generate the weekly report. Click tabs to explore raw vs. cleaned data.

ETL Pipeline Demo

Live data cleaning simulation

Extract
Transform
Load
0
Orders Processed
0
Inventory Items
0
Employees
0 hrs
Hours Saved

Raw Orders Data (Sample)

3 different date formats, duplicates, typos
Order IDDateProductPriceQtyStatus
ORD-001 01/15/2024 Laptop $1,299.002.0Compleeted
ORD-002 2024-01-16Phone$899.001.0shiped
ORD-001 ๐Ÿ”15-Jan-2024 Laptop $1,299.002.0CANCELLED
ORD-003 NULLTabletNULL1.0Pending
๐Ÿ” Duplicatesโš ๏ธ NULL valuesโœ๏ธ Typos๐Ÿ“… Mixed formats
Pipeline runtime: ~18 seconds3 CSV files generated
View Python Source

What This Automates

Orders Data

  • ร—Duplicate order IDs
  • ร—3 date formats (YYYY-MM-DD, MM/DD/YYYY, DD-Mon-YYYY)
  • ร—Price strings with $ prefix
  • ร—Float quantities
  • ร—Status typos (Compleeted, shiped)

Inventory Data

  • ร—Extra whitespace in names
  • ร—Negative stock levels
  • ร—Mixed timezones (IST/UTC)
  • ร—Comma-separated cost prices
  • ร—Reorder threshold violations

Employee Data

  • ร—Mixed name casing (ALL CAPS, lowercase)
  • ร—Mixed overtime flags (0/1/Yes/No)
  • ร—Impossible hours (>24)
  • ร—Blank date rows
  • ร—Inconsistent boolean formats

How It Works

01

Generate Messy Data

Creates 3 CSV files with realistic data quality issues โ€” duplicates, nulls, typos, format inconsistencies.

python data/generate_messy_files.py
02

Run ETL Pipeline

Extracts, cleans, and transforms all 3 files. Logs every issue found and fixed.

python etl/clean_and_report.py
03

Review Report

Get a structured weekly report with executive summary, issue counts, and cost savings.

cat reports/weekly_report.txt

Tech Stack

PythonPandasETL PipelineData QualityLoggingAutomationOperations Analytics