Vishal Sreepada

About Me

I'm a Lead Data Engineer at Amazon with 5 years of experience designing and owning large-scale data infrastructure. I build end-to-end pipelines, event-driven architectures, and AI-powered automation systems that serve thousands of stakeholders across global operations.

I specialize in turning complex, high-volume data problems into reliable, scalable solutions. Cutting a 40-hour pipeline to 30 minutes, automating 6,000 dataset registrations, and leading security certification to unlock confidential-data processing at scale.

I hold a Master's in Computer Engineering from George Mason University and am currently pursuing an MBA at Ottawa University.

5+ Years Experience

40+ Pipelines Delivered

600+ Users Served

6K+ Datasets Automated

Experience

Amazon

Data Engineer II

May 2024 – Present · Austin, TX

Architected and delivered an AI-powered data catalog service automating metadata registration and enrichment for 6,000+ datasets. Reduced onboarding from ~60 minutes per table to zero human intervention using Amazon Bedrock and Step Functions orchestration.
Designed a self-service conversational AI interface for enterprise data lake onboarding, featuring a custom 4-factor confidence scoring algorithm and smart routing that handles 60% of queries in <10ms, targeting 200+ daily table onboardings.
Led org-wide migration to a next-generation dataset platform, owning documentation and cross-team coordination to ensure zero-disruption adoption across the Engineering Plan & Analytics organization.

Data Engineer

Apr 2022 – Apr 2024 · Seattle, WA

Cut a 1TB executive reporting pipeline runtime from 40 hours to 30 minutes (98% reduction) by identifying only 10GB was required for reporting, then implementing cross-account S3 crawlers with SparkSQL partition pruning.
Led consolidation of 5 legacy project management tools into a single unified data platform serving 600+ users across Europe. Owned schema mapping for 50+ tables, backfill scripting, and coordinated a planned single-day production cutover with zero data loss.
Built an event-driven warehouse safety monitoring system (Lambda + CDK) that automated ticket creation for collision-avoidance sensor failures, eliminating ~200 daily manual interventions and contributing to a 95% reduction in serious industrial incidents across sites with 800+ operators.
Engineered multi-vendor fleet telemetry pipelines extracting from 40+ API endpoints, executed a 2-year historical backfill into the enterprise data lake, and replaced a legacy Redshift cluster with a lake-native workflow for faster processing.
Owned end-to-end security certification to enable Confidential-classified data processing in the enterprise data lake. Authored architecture documentation, threat mitigations, and worked cross-organizationally to achieve clearance.

Blue.cloud

Associate Data Engineer

Jun 2021 – Apr 2022 · Greater Chicago Area

Built PySpark data pipelines in Azure Databricks for ETL from diverse file formats to Azure Data Lake Storage and Amazon S3, providing insights into customer usage patterns.
Developed and deployed AWS Lambda functions for data extraction, parsing, and ingestion of nested JSON from S3 into Snowflake.
Engineered Snowflake Snowpipe for real-time data loading with internal and external stages, defined roles, privileges, and virtual warehouse sizing for various workload types.
Provided L2/L3 production support for 100+ clients in the JLL Azara platform and contributed to integration infrastructure design across multiple teams.

George Mason University

Graduate Teaching Assistant

Jan 2020 – May 2021 · Fairfax, VA

Assisted in Computer Science and IST departments across courses including Big Data Technologies, Network Security, Security Accreditation of Information Systems, and Essentials of Computer Science.
Held weekly office hours, graded assignments, and developed curriculum materials in collaboration with faculty.

Projects

◆

AI-Powered Data Catalog Automation

Eliminated manual dataset onboarding for 6,000+ data tables. Reduced time from ~60 min/table to zero human intervention. Built an AI enrichment pipeline using a Strands Agent on Bedrock AgentCore Runtime (Claude Sonnet 4) to auto-generate table descriptions, column metadata, and READMEs with PII guardrails. Step Functions Distributed Map orchestrates bulk registration, cross-account EventBridge detects schema drift in real time, and incremental AI re-enrichment runs only for new columns.

Amazon Bedrock Bedrock AgentCore Strands Agents SDK AWS Step Functions Lambda DynamoDB Streams AWS Glue EventBridge S3 Athena Lake Formation SNS / SQS AWS CDK CloudWatch

◆

Executive Reporting Pipeline: 98% Runtime Reduction

Diagnosed and resolved a 1TB executive analytics pipeline running 40+ hours. Root cause: only 10GB of data was needed for reporting. Redesigned the pipeline with cross-account S3 crawlers, SparkSQL partition pruning, and targeted data filtering at the transform stage. Cut runtime to 30 minutes and unblocked leadership reporting workflows.

AWS Glue S3 Cross-Account Lambda Athena QuickSight Apache Airflow SES CloudFormation

◆

PMO Data Platform Consolidation

Decommissioned 5 legacy project management tools and migrated 600+ Europe-PMO users to a single unified data platform in a planned one-day cutover. Owned schema mapping for 50+ tables, authored SQL and Glue backfill scripts, engineered retrofit pipelines for downstream tool dependencies, and built a DocumentDB-backed risk module with hourly S3 flattening for real-time visibility.

AWS Glue Step Functions Lambda DocumentDB Aurora MySQL DynamoDB S3

◆

Warehouse Safety Monitoring Automation

Replaced a fully manual process (~200 daily anchor failure notifications) with an event-driven ticketing system using Lambda and AWS CDK. Tickets are created within seconds of collision-avoidance sensor failures and deduplicated on repeats. Eliminated manual overhead and contributed to a 95% reduction in serious powered industrial truck incidents across facilities with 800+ operators.

AWS Lambda AWS CDK SNS / SQS SES CodePipeline

◆

AI-Powered Data Lake Onboarding

Designed a conversational AI interface to automate table onboarding into the enterprise data lake. Architected a 4-factor confidence scoring algorithm (explicit mentions, field completeness, ambiguity penalty, context clarity) with smart routing that handles 60% of queries in <10ms using Bedrock AgentCore. DynamoDB Streams trigger Lambda orchestration for automated S3 directory creation, Glue catalog setup, and Athena table provisioning. Dynamic Airflow DAG builder generates pipeline tasks at runtime for each onboarded table.

Amazon Bedrock Bedrock AgentCore Lambda DynamoDB Streams Step Functions API Gateway CloudFront S3 Glue Catalog Athena Apache Airflow SNS SES CloudWatch

◆

Multi-Vendor Fleet Telemetry Pipelines

Engineered data extraction pipelines for fleet vehicle telemetry across three vendor sources (Raymond EU, Raymond NA, Hyster NA), pulling from 40+ paginated API endpoints with varying retention periods of 1 to 5 days. Executed a 2-year historical backfill into the enterprise data lake and replaced a legacy Redshift cluster with a lake-native workflow for faster processing. Pipeline template reused across multiple vendor integrations.

AWS Glue S3 Apache Airflow Athena EventBridge SNS Redshift

◆

Construction Management Data Pipeline

Built dual-source pipelines to support a phased migration from legacy Excel macros to a modern web application for construction project management — covering purchase orders, change orders, weather logs, and cost summaries. Reverse-engineered VBA macro logic, retrofitted 15 tables with complex multi-dataset joins, and handled sequential update dependencies to maintain data integrity. Pipeline template was adopted by two additional regional teams.

AWS Glue Aurora MySQL S3 Apache Airflow CloudFormation

Data Engineer at Amazon

About Me

Skills & Tools

Languages

Data Processing

Cloud

Databases & Storage

Orchestration & IaC

AWS Services

AI & Agents

Certifications