Data Engineering and AI Infrastructure

108 terms in the Data Engineering and AI Infrastructure domain — each bilingual TR/EN with related-term graph.

ETL / ELTData PipelinesData LakeData WarehouseBatch ProcessingStream ProcessingMetadata ManagementFeature StoreVector DatabasesData Lineage

All Terms (108)

3 terms

🧮

Aggregate Table

A table structure that stores summarized results derived from detailed data in order to accelerate analytical queries.

🎯

Approximate Nearest Neighbor Search

A search approach in high-dimensional vector spaces that prioritizes speed and acceptable proximity over exactness.

📜

Audit Trail

A tracking mechanism that records who changed data or processes, when, and how.

6 terms

⏪

Backfill

The process of reprocessing historical periods or filling missing historical data gaps after the fact.

📚

Batch Backlog

A load condition in which scheduled batch jobs accumulate in the processing queue because they cannot complete on time.

🧰

Batch Job

A bulk data processing task that runs according to a schedule or trigger.

📦

Batch Processing

A processing approach in which data is handled in bulk at scheduled intervals and results are produced periodically.

💥

Blast Radius Analysis

A risk analysis approach that measures how many assets and which critical processes may be affected by a data change or failure.

💼

Business Metadata

The metadata layer that explains the business meaning, usage purpose, and enterprise definitions of data assets.

7 terms

🌐

Change Propagation Analysis

The process of analyzing how a change in a data asset or business logic will propagate across the platform.

🧊

Cold Storage Tier

An approach in which rarely accessed but still required data is kept in a low-cost storage tier.

📑

Column-Level Lineage

A detailed lineage level that traces which source columns each field was derived from and how it was transformed.

🧭

Conformed Dimension

A dimension structure reused across different fact tables and business domains with shared meaning.

👥

Consumer Group

A group of consumer instances that read from the same stream in a balanced and parallel way.

✨

Curated Zone

The data lake layer where data has been cleaned, structured, and made more suitable for analytical use.

🕛

Cutoff Time

A time boundary that determines until what moment a batch job will accept records for a given data period.

14 terms

🕸️

DAG

A core orchestration structure that models data processing tasks as a directed acyclic dependency graph.

📊

Dashboard Lineage

A trace structure that shows which datasets, queries, and transformations feed the metrics and visuals inside a dashboard.

📜

Data Contract Enforcement

An approach in which schema, quality, and delivery expectations are not only defined, but actively enforced by the system.

🌊

Data Lake

A storage layer where structured and unstructured data is kept in raw or lightly processed form at scale.

🗂️

Data Lifecycle Tiering

An approach in which data is moved across storage tiers as its access frequency, age, and business value change.

🛤️

Data Pipeline

A processing chain that reliably moves data from a source, through transformations, into one or more target systems.

📦

Data Product Lineage

A trace structure showing which sources feed a data product, how it is produced, and which consumers it serves.

🪪

Data Provenance

A source-reliability perspective that describes the origin, creation conditions, and processing history of a data element.

🏢

Data Warehouse

A structured, integrated, query-optimized data storage environment built for reporting, analytics, and decision support.

🗺️

Dataset Dependency Map

A mapping structure that systematically shows dependency relationships among datasets.

🔗

Dependency Management

The process of managing the dependencies among tasks, datasets, and execution orders within a data workflow.

🔗

Dependency Resolution

The process of determining in what order and under what conditions tasks and data assets in a workflow should execute.

📐

Dimensional Modeling

A modeling approach that organizes analytical data structures around facts and dimensions.

⚠️

Downstream Breakage Risk

A risk measure describing the likelihood that a change in a data asset will cause breakage in connected reports, models, or services.

6 terms

☁️

ELT

A modern approach in which data is loaded into the target platform first, and transformations are performed later inside the storage or compute layer.

🔄

ETL

A classic data integration approach in which data is extracted from source systems, transformed, and then loaded into a target analytical environment.

🏷️

Embedding Versioning

An approach for managing different embedding models or updated embedding-generation processes through versions.

📚

Event Schema Registry

A structure in which event schemas are centrally stored and their evolution is managed in stream-based systems.

🕒

Event Time

A time concept expressing when an event actually occurred, independent of when it reached the system.

1️⃣

Exactly-Once Semantics

A processing model that aims to guarantee each data event is logically handled exactly once by the system.

7 terms

✅

Feature Consistency Check

A validation process that verifies whether training-side and serving-side feature values are produced with the same logic and definition.

🗑️

Feature Deprecation Policy

A policy that governs the controlled retirement of feature definitions that are no longer recommended or supported.

📚

Feature Registry

A registry layer where feature definitions, versions, ownership, and usage states are centrally maintained.

🔌

Feature Serving API

A service layer that delivers the required features through a standardized interface during live prediction.

🏪

Feature Store

An infrastructure layer where machine learning features are centrally managed with reuse and training-serving consistency.

🏷️

Feature Versioning

An approach for managing changes in feature definitions as traceable versions over time.

✂️

File Pruning

An optimization technique that improves data lake performance by avoiding scans of unnecessary files during queries.

1 terms

📚

Glossary Alignment

The process of aligning business glossary terms with technical data assets in a semantically consistent way.

2 terms

🕸️

HNSW Index

An indexing method that uses a hierarchical graph structure for fast approximate neighbor search in high-dimensional vectors.

🔀

Hybrid Search

An approach that combines semantic vector search with keyword-based and filter-based classical retrieval techniques.

2 terms

♻️

Idempotency

The property of producing a stable, non-duplicated result even when the same operation is run multiple times.

💥

Impact Analysis

The process of assessing in advance which reports, models, tables, or workflows may be affected by a data change.

2 terms

⛓️

Job Chaining

An execution model in which batch jobs trigger one another sequentially based on output-input relationships.

⏰

Job Scheduler

A system component that governs when and under what conditions batch or hybrid jobs should run.

8 terms

🏞️

Lakehouse

A modern architectural approach that combines the flexibility of data lakes with the manageability and performance characteristics of warehouses.

🧾

Late Data Reconciliation

A correction process that brings late-arriving data into alignment with previously produced batch outputs.

🧩

Lineage Completeness

A quality dimension describing how fully lineage information covers all critical steps and dependencies in the data flow.

📏

Lineage Confidence Score

A quality indicator that expresses the reliability level of automatically or semi-automatically inferred lineage information.

🤝

Lineage Reconciliation

The process of reconciling trace information coming from different lineage sources into a consistent view.

🔐

Lineage-Driven Access Impact

An approach that analyzes the impact of access permission changes on connected data products and consumer systems.

🔄

Lineage-Metadata Sync

A synchronization approach that keeps metadata definitions and lineage relationships consistent and up to date.

🪟

Load Window

A boundary structure that defines which time range a load process covers and when it runs.

8 terms

🥇

Medallion Architecture

A layered data-processing model that progresses from raw data toward reliable analytical data.

🧬

Merge Policy

A loading logic that defines which record should be preserved under which rule when incoming and existing records conflict.

📝

Metadata

The body of descriptive, source, usage, and technical information that exists about data.

🧷

Metadata Filtering in Vector Search

An approach that narrows vector similarity results using additional fields such as date, source, user, or category.

🏅

Metadata Quality Score

A quality score used to measure the completeness, freshness, clarity, and governance maturity of metadata.

🗄️

Metadata Registry

A central registry where metadata objects are stored in a standardized, governable, and accessible form.

🕰️

Metadata Versioning

An approach that stores changes in metadata definitions as traceable versions over time.

🤖

Model Lineage

A traceability structure that shows which data, features, code version, and training workflow produced a machine learning model.

1 terms

🧱

Namespace Isolation

A structure that logically separates vector collections by use case, tenant, or security boundary.

5 terms

🗃️

Offline Feature Store

A historical and large-scale feature storage layer used for model training, backtesting, and batch feature generation.

⚡

Online Feature Store

A feature store layer optimized for low-latency feature serving at live prediction time.

🗃️

Open Table Format

An open-standard table structure that supports versioning, transactions, and metadata management for large-scale data lake tables.

📡

Operational Metadata

The metadata layer containing operational information such as run status, refresh timing, error history, and processing performance.

🧹

Orphaned Asset Detection

A lineage-based control process for identifying data assets that no longer have meaningful upstream or downstream connections.

7 terms

🛡️

PII Lineage Tracking

A specialized lineage approach focused on tracking where personal data comes from and where it flows across the data platform.

🌿

Partition Pruning

An optimization technique that reduces cost by processing only relevant partitions in batch jobs and queries.

🧱

Partitioning

A technique for improving read, write, and processing efficiency by splitting large datasets into logical partitions.

📡

Pipeline Observability

An approach that continuously monitors the health, latency, volume, and failure behavior of data pipelines.

⏱️

Pipeline SLA

A service standard that defines the expected delivery time, success rate, and availability level of a data pipeline.

🕰️

Point-in-Time Join

An approach for generating training data using only the historical features that would actually have been available at prediction time.

⬇️

Pushdown Transformation

An approach in which transformations are executed inside the engine or platform where the data already resides rather than in a separate layer.

1 terms

⚡

Query Acceleration

An optimization approach that enables data warehouse queries to run with lower latency and higher efficiency.

4 terms

🪵

Raw Zone

The data lake layer where source data is first accepted with minimal alteration.

⚡

Real-Time Feature Computation

An approach in which feature values are computed close to prediction time instead of being fully precomputed.

♻️

Rerun Strategy

An approach that defines how failed or incomplete data jobs should be rerun and under what safety rules.

↩️

Reverse ETL

An integration approach that moves curated data from analytics platforms back into operational systems.

13 terms

📖

Schema-on-Read

A flexible data-processing approach in which schema is applied when data is read rather than when it is written.

🧠

Semantic Layer

A layer that abstracts business metrics, definitions, and query logic consistently above technical data structures.

🧠

Semantic Lineage

A lineage approach that shows how data assets are derived and connected not only technically, but also at the business-meaning level.

📐

Similarity Metric

The core retrieval criterion that defines how proximity between vectors is computed.

🕰️

Slowly Changing Dimension

A warehouse approach that defines how changing dimension attributes should be preserved historically over time.

🪞

Source System Replication

An approach in which data from a source system is replicated into another environment for analytical or operational use.

🪜

Staging Area

An intermediate preparation layer where source data is temporarily held before final transformation.

⭐

Star Schema

A classic analytical warehouse design with a central fact table surrounded by dimension tables.

🗄️

State Store

A persistent or semi-persistent data structure that stores historical context and intermediate computation state during stream processing.

🪢

Stream Join

The operation of joining multiple continuous data streams by key and time logic to create meaningful event context.

🐢

Stream Lag

A core stream-health metric that expresses the delay gap between produced events and consumed events.

⚡

Stream Processing

A processing approach based on handling continuously arriving data events with low latency.

🪟

Stream Windowing

An approach that groups continuous data streams into defined time or event intervals for computation.

3 terms

🧾

Technical Metadata

The technical metadata layer that describes schema, data types, source structures, and storage characteristics of data assets.

📜

Transformation Audit Chain

A trace structure that stores, in auditable form, in what order, under what logic, and under which version data transformations were applied.

🧱

Transformation Layer

A rule-driven processing layer that reshapes raw data for analytical or operational use.

1 terms

👁️

Usage Metadata

A type of metadata showing who uses a data asset, how often, and for what purposes.

3 terms

🚀

Vector Cache

A performance layer that temporarily stores frequently requested embeddings or retrieval results for faster access.

🧲

Vector Database

A storage and retrieval system optimized for high-dimensional embedding data and similarity search.

⚖️

Vector Normalization

The process of controlling embedding magnitude effects to produce more stable retrieval behavior.

4 terms

🗝️

Warehouse Partition Key

The primary partitioning field used to split warehouse tables into logical segments.

💧

Watermarking

A mechanism in stream systems that defines a tolerance boundary for time progression in order to manage late-arriving events.

🎼

Workflow Orchestration

An approach for centrally managing multiple data processing steps through dependency, sequencing, and scheduling rules.

🚧

Workload Isolation

A warehouse approach in which resources are separated to prevent different query and compute workloads from interfering with one another.

Data Engineering and AI Infrastructure

Most Read

All Terms (108)

Aggregate Table

Approximate Nearest Neighbor Search

Audit Trail

Backfill

Batch Backlog

Batch Job

Batch Processing

Blast Radius Analysis

Business Metadata

Change Propagation Analysis

Cold Storage Tier

Column-Level Lineage

Conformed Dimension

Consumer Group

Curated Zone

Cutoff Time

DAG

Dashboard Lineage

Data Contract Enforcement

Data Lake

Data Lifecycle Tiering

Data Pipeline

Data Product Lineage

Data Provenance

Data Warehouse

Dataset Dependency Map

Dependency Management

Dependency Resolution

Dimensional Modeling

Downstream Breakage Risk

ELT

ETL

Embedding Versioning

Event Schema Registry

Event Time

Exactly-Once Semantics

Feature Consistency Check

Feature Deprecation Policy

Feature Registry

Feature Serving API

Feature Store

Feature Versioning

File Pruning

Glossary Alignment

HNSW Index

Hybrid Search

Idempotency

Impact Analysis

Job Chaining

Job Scheduler

Lakehouse

Late Data Reconciliation

Lineage Completeness

Lineage Confidence Score

Lineage Reconciliation

Lineage-Driven Access Impact

Lineage-Metadata Sync

Load Window

Medallion Architecture

Merge Policy

Metadata

Metadata Filtering in Vector Search

Metadata Quality Score

Metadata Registry

Metadata Versioning

Model Lineage

Namespace Isolation

Offline Feature Store

Online Feature Store

Open Table Format

Operational Metadata

Orphaned Asset Detection

PII Lineage Tracking

Partition Pruning

Partitioning

Pipeline Observability

Pipeline SLA