Data Engineering and AI Infrastructure
109 terms in the Data Engineering and AI Infrastructure domain — each bilingual TR/EN with related-term graph.
Most Read
All Terms (109)
Aggregate Table
A table structure that stores summarized results derived from detailed data in order to accelerate analytical queries.
Approximate Nearest Neighbor Search
A search approach in high-dimensional vector spaces that prioritizes speed and acceptable proximity over exactness.
Audit Trail
A tracking mechanism that records who changed data or processes, when, and how.
Backfill
The process of reprocessing historical periods or filling missing historical data gaps after the fact.
Batch Backlog
A load condition in which scheduled batch jobs accumulate in the processing queue because they cannot complete on time.
Batch Job
A bulk data processing task that runs according to a schedule or trigger.
Batch Processing
A processing approach in which data is handled in bulk at scheduled intervals and results are produced periodically.
Blast Radius Analysis
A risk analysis approach that measures how many assets and which critical processes may be affected by a data change or failure.
Business Metadata
The metadata layer that explains the business meaning, usage purpose, and enterprise definitions of data assets.
Change Propagation Analysis
The process of analyzing how a change in a data asset or business logic will propagate across the platform.
Cold Storage Tier
An approach in which rarely accessed but still required data is kept in a low-cost storage tier.
Column-Level Lineage
A detailed lineage level that traces which source columns each field was derived from and how it was transformed.
Conformed Dimension
A dimension structure reused across different fact tables and business domains with shared meaning.
Consumer Group
A group of consumer instances that read from the same stream in a balanced and parallel way.
Curated Zone
The data lake layer where data has been cleaned, structured, and made more suitable for analytical use.
Cutoff Time
A time boundary that determines until what moment a batch job will accept records for a given data period.
DAG
A core orchestration structure that models data processing tasks as a directed acyclic dependency graph.
Dashboard Lineage
A trace structure that shows which datasets, queries, and transformations feed the metrics and visuals inside a dashboard.
Data Contract Enforcement
An approach in which schema, quality, and delivery expectations are not only defined, but actively enforced by the system.
Data Lake
A storage layer where structured and unstructured data is kept in raw or lightly processed form at scale.
Data Lifecycle Tiering
An approach in which data is moved across storage tiers as its access frequency, age, and business value change.
Data Lineage
The traceable lifecycle journey of a data element from source through transformations to final usage.
Data Pipeline
A processing chain that reliably moves data from a source, through transformations, into one or more target systems.
Data Product Lineage
A trace structure showing which sources feed a data product, how it is produced, and which consumers it serves.
Data Provenance
A source-reliability perspective that describes the origin, creation conditions, and processing history of a data element.
Data Warehouse
A structured, integrated, query-optimized data storage environment built for reporting, analytics, and decision support.
Dataset Dependency Map
A mapping structure that systematically shows dependency relationships among datasets.
Dependency Management
The process of managing the dependencies among tasks, datasets, and execution orders within a data workflow.
Dependency Resolution
The process of determining in what order and under what conditions tasks and data assets in a workflow should execute.
Dimensional Modeling
A modeling approach that organizes analytical data structures around facts and dimensions.
Downstream Breakage Risk
A risk measure describing the likelihood that a change in a data asset will cause breakage in connected reports, models, or services.
ELT
A modern approach in which data is loaded into the target platform first, and transformations are performed later inside the storage or compute layer.
ETL
A classic data integration approach in which data is extracted from source systems, transformed, and then loaded into a target analytical environment.
Embedding Versioning
An approach for managing different embedding models or updated embedding-generation processes through versions.
Event Schema Registry
A structure in which event schemas are centrally stored and their evolution is managed in stream-based systems.
Event Time
A time concept expressing when an event actually occurred, independent of when it reached the system.
Exactly-Once Semantics
A processing model that aims to guarantee each data event is logically handled exactly once by the system.
Feature Consistency Check
A validation process that verifies whether training-side and serving-side feature values are produced with the same logic and definition.
Feature Deprecation Policy
A policy that governs the controlled retirement of feature definitions that are no longer recommended or supported.
Feature Registry
A registry layer where feature definitions, versions, ownership, and usage states are centrally maintained.
Feature Serving API
A service layer that delivers the required features through a standardized interface during live prediction.
Feature Store
An infrastructure layer where machine learning features are centrally managed with reuse and training-serving consistency.
Feature Versioning
An approach for managing changes in feature definitions as traceable versions over time.
File Pruning
An optimization technique that improves data lake performance by avoiding scans of unnecessary files during queries.
Lakehouse
A modern architectural approach that combines the flexibility of data lakes with the manageability and performance characteristics of warehouses.
Late Data Reconciliation
A correction process that brings late-arriving data into alignment with previously produced batch outputs.
Lineage Completeness
A quality dimension describing how fully lineage information covers all critical steps and dependencies in the data flow.
Lineage Confidence Score
A quality indicator that expresses the reliability level of automatically or semi-automatically inferred lineage information.
Lineage Reconciliation
The process of reconciling trace information coming from different lineage sources into a consistent view.
Lineage-Driven Access Impact
An approach that analyzes the impact of access permission changes on connected data products and consumer systems.
Lineage-Metadata Sync
A synchronization approach that keeps metadata definitions and lineage relationships consistent and up to date.
Load Window
A boundary structure that defines which time range a load process covers and when it runs.
Medallion Architecture
A layered data-processing model that progresses from raw data toward reliable analytical data.
Merge Policy
A loading logic that defines which record should be preserved under which rule when incoming and existing records conflict.
Metadata
The body of descriptive, source, usage, and technical information that exists about data.
Metadata Filtering in Vector Search
An approach that narrows vector similarity results using additional fields such as date, source, user, or category.
Metadata Quality Score
A quality score used to measure the completeness, freshness, clarity, and governance maturity of metadata.
Metadata Registry
A central registry where metadata objects are stored in a standardized, governable, and accessible form.
Metadata Versioning
An approach that stores changes in metadata definitions as traceable versions over time.
Model Lineage
A traceability structure that shows which data, features, code version, and training workflow produced a machine learning model.
Offline Feature Store
A historical and large-scale feature storage layer used for model training, backtesting, and batch feature generation.
Online Feature Store
A feature store layer optimized for low-latency feature serving at live prediction time.
Open Table Format
An open-standard table structure that supports versioning, transactions, and metadata management for large-scale data lake tables.
Operational Metadata
The metadata layer containing operational information such as run status, refresh timing, error history, and processing performance.
Orphaned Asset Detection
A lineage-based control process for identifying data assets that no longer have meaningful upstream or downstream connections.
PII Lineage Tracking
A specialized lineage approach focused on tracking where personal data comes from and where it flows across the data platform.
Partition Pruning
An optimization technique that reduces cost by processing only relevant partitions in batch jobs and queries.
Partitioning
A technique for improving read, write, and processing efficiency by splitting large datasets into logical partitions.
Pipeline Observability
An approach that continuously monitors the health, latency, volume, and failure behavior of data pipelines.
Pipeline SLA
A service standard that defines the expected delivery time, success rate, and availability level of a data pipeline.
Point-in-Time Join
An approach for generating training data using only the historical features that would actually have been available at prediction time.
Pushdown Transformation
An approach in which transformations are executed inside the engine or platform where the data already resides rather than in a separate layer.
Raw Zone
The data lake layer where source data is first accepted with minimal alteration.
Real-Time Feature Computation
An approach in which feature values are computed close to prediction time instead of being fully precomputed.
Rerun Strategy
An approach that defines how failed or incomplete data jobs should be rerun and under what safety rules.
Reverse ETL
An integration approach that moves curated data from analytics platforms back into operational systems.
Schema-on-Read
A flexible data-processing approach in which schema is applied when data is read rather than when it is written.
Semantic Layer
A layer that abstracts business metrics, definitions, and query logic consistently above technical data structures.
Semantic Lineage
A lineage approach that shows how data assets are derived and connected not only technically, but also at the business-meaning level.
Similarity Metric
The core retrieval criterion that defines how proximity between vectors is computed.
Slowly Changing Dimension
A warehouse approach that defines how changing dimension attributes should be preserved historically over time.
Source System Replication
An approach in which data from a source system is replicated into another environment for analytical or operational use.
Staging Area
An intermediate preparation layer where source data is temporarily held before final transformation.
Star Schema
A classic analytical warehouse design with a central fact table surrounded by dimension tables.
State Store
A persistent or semi-persistent data structure that stores historical context and intermediate computation state during stream processing.
Stream Join
The operation of joining multiple continuous data streams by key and time logic to create meaningful event context.
Stream Lag
A core stream-health metric that expresses the delay gap between produced events and consumed events.
Stream Processing
A processing approach based on handling continuously arriving data events with low latency.
Stream Windowing
An approach that groups continuous data streams into defined time or event intervals for computation.
Technical Metadata
The technical metadata layer that describes schema, data types, source structures, and storage characteristics of data assets.
Transformation Audit Chain
A trace structure that stores, in auditable form, in what order, under what logic, and under which version data transformations were applied.
Transformation Layer
A rule-driven processing layer that reshapes raw data for analytical or operational use.
Vector Cache
A performance layer that temporarily stores frequently requested embeddings or retrieval results for faster access.
Vector Database
A storage and retrieval system optimized for high-dimensional embedding data and similarity search.
Vector Normalization
The process of controlling embedding magnitude effects to produce more stable retrieval behavior.
Warehouse Partition Key
The primary partitioning field used to split warehouse tables into logical segments.
Watermarking
A mechanism in stream systems that defines a tolerance boundary for time progression in order to manage late-arriving events.
Workflow Orchestration
An approach for centrally managing multiple data processing steps through dependency, sequencing, and scheduling rules.
Workload Isolation
A warehouse approach in which resources are separated to prevent different query and compute workloads from interfering with one another.