AI Glossary

Data Pipeline

An automated series of steps that collect, process, transform, and deliver data from source systems to where it's needed for AI model training or inference.

Components

A typical ML data pipeline includes: data ingestion (APIs, databases, files), cleaning (handling missing values, deduplication), transformation (feature engineering, normalization), validation (schema checks, quality tests), and storage (data lakes, feature stores).

Tools

Apache Airflow and Prefect for orchestration. dbt for transformation. Great Expectations for data quality. Apache Spark for large-scale processing. Feature stores like Feast for serving features to models.

Why It Matters

Data quality issues are the #1 cause of ML project failures. A robust data pipeline ensures consistent, reliable data reaches your models. The principle 'garbage in, garbage out' is especially true in ML.

← Back to AI Glossary

Last updated: March 5, 2026