Data Science and Data Management
98 terms in the Data Science and Data Management domain — each bilingual TR/EN with related-term graph.
Most Read
All Terms (98)
Accuracy
A quality dimension expressing how correctly a data field reflects the real-world value it represents.
Active Labeling
An approach that aims to optimize labeling cost by selecting the most useful or uncertain examples for annotation.
Adjudication Workflow
A quality assurance workflow in which conflicting or ambiguous labels are resolved through higher-level review.
Aggregation Feature
A feature structure that summarizes lower-level records into higher-level signals meaningful for modeling.
Anonymization
The process of transforming personal data so that it can no longer be linked back to a specific individual.
Canonicalization
The process of converting different representations of the same information into one standard canonical form.
Category Standardization
The process of unifying different spellings, abbreviations, or formats representing the same concept into one standard form.
Change Data Capture
An approach for tracking data changes in source systems and propagating them to downstream systems in near real time.
Class Imbalance
A condition in which some classes are heavily represented while others are represented only sparsely in a dataset.
Class Weighting
An approach that rebalances model learning by increasing the error cost of underrepresented classes.
Completeness
A data quality dimension describing how fully expected fields, records, or business scope are present in a dataset.
Consensus Labeling
An approach in which multiple annotators’ judgments are combined to determine the final label for a data instance.
Consent Management
The consent-based management of the purposes, scope, and duration under which personal data may be processed.
Data Augmentation
An approach that expands the training set by transforming existing data to improve model robustness.
Data Catalog
A centralized catalog structure that presents definitions, ownership, usage, and discovery information for enterprise data assets.
Data Collection
The systematic process of acquiring data for analysis, reporting, and modeling workflows.
Data Collection SLA
An operational service-level framework that defines timeliness, completeness, and availability standards for data flows.
Data Contracts
An agreement approach that explicitly defines schema, quality, and delivery expectations between data producers and consumers.
Data Governance
The enterprise framework for managing data through ownership, quality, access, usage, and control principles.
Data Lineage
The visible trace of all movements and transformations a data element undergoes from source to report or model.
Data Minimization
The principle of collecting and processing only the data that is truly necessary for a defined purpose.
Data Observability
A monitoring approach that aims to detect data issues, anomalies, and silent quality degradation early.
Data Ownership
The principle that defines which business or technical role is responsible for the quality, definition, and use of specific data domains.
Data Profiling
The process of systematically examining a dataset’s content, distribution, missingness, uniqueness, and rule violations.
Data Source
The system, platform, or operational touchpoint where data is generated, stored, or retrieved.
Data Stewardship
An operational approach in which specific data domains are actively stewarded for definition, quality, and appropriate use.
Data Type Mismatch
A problem arising when the expected data type of a field differs from the actual stored content type.
Derived Feature
A new feature computed or transformed from existing fields rather than directly coming from raw data.
Differential Privacy
A mathematical privacy framework that limits the extent to which any single individual’s data can affect published results.
Diffusion-Based Synthetic Data
A modern synthetic data generation approach that reconstructs data distributions through noise injection and reverse sampling.
Domain Randomization
An approach that varies environmental factors in synthetic data generation to make models more robust to the real world.
Duplicate Record
A repeated data record that represents the same real-world entity or event more than once.
Encoding
The process of converting categorical data into numerical representations that models can process.
Entity Resolution
The process of determining whether different records actually refer to the same real-world entity.
Event Tracking
A tracking approach that records user or system behaviors as discrete events.
Feature Hashing
A method that maps features into a fixed-dimensional space using hash functions to provide scalable representation.
Feature Selection
The process of selecting the most informative variables for a model in order to reduce noise, cost, and complexity.
Fuzzy Matching
A matching approach that uses similarity-based rules to find near-matching records instead of exact matches.
Imbalance-Aware Calibration
An approach that helps model probabilities reflect true risk levels more accurately under class imbalance.
Imputation
The process of filling missing observations using statistical, rule-based, or model-driven methods.
Instrumentation Design
A design approach that defines which events and fields should be recorded, and how, in order to measure product, process, or system behavior correctly.
Inter-Annotator Agreement
A quality measure indicating how consistently different annotators make similar decisions on the same data.
Interaction Feature
A combined variable created to capture the joint effect of two or more features.
Label Ontology
A classification framework that defines the hierarchical, relational, and conceptual structure of labels.
Labeling Guideline
A formal instruction document defining the rules, examples, and exceptions to be used during labeling.
Lag Feature
A type of feature that brings time-dependent patterns into the model using values from previous time steps.
Leakage Prevention
A preprocessing discipline that prevents information unavailable at real usage time from leaking into model training.
Leakage-Aware Feature Engineering
An approach to feature creation that preserves time, target, and operational usage boundaries to avoid leakage.
l-Diversity
A privacy model that requires sufficient diversity of sensitive values within anonymized groups.
Master Data Management
An approach for managing core enterprise entities such as customers, products, and suppliers in a unified and consistent way.
Metadata Management
The systematic management of descriptions, sources, usage, and technical structure information about data.
Missing Data
A condition in which fields expected in an observation appear as empty, null, or unknown.
Mode Collapse
A problem in synthetic data generation where the model loses distributional diversity and produces only limited types of samples.
Monotonic Binning
A feature transformation technique that bins continuous variables while preserving a monotonic relationship with the target.
One-Class Classification
A modeling approach that learns the normal pattern and treats deviations as anomalous when the minority class is extremely rare.
Outlier
An observation or value that deviates noticeably from the general pattern of the dataset.
Oversampling
An approach that increases the number of minority-class examples to make them more visible in the dataset.
Passive Data Collection
An approach in which data is collected through behavior, sensor output, and system traces rather than direct user input.
Policy as Code
An approach in which data access, usage, and security policies are defined and enforced through code instead of manual processes.
Preprocessing Pipeline
A sequenced, reproducible, and automation-friendly workflow of data transformation steps.
Privacy Budget
A concept that quantitatively governs how much privacy loss is allowed in differential privacy applications.
Privacy-Preserving Synthetic Data
A synthetic data generation approach designed to create analytical value without exposing real individuals.
Programmatic Labeling
An approach in which labels are generated automatically through code, rules, or functions rather than manual entry.
Pseudonymization
An approach that replaces direct identifiers with substitute representations that can be re-linked only through controlled additional information.
Rare Event Modeling
An approach that requires specialized strategies to model low-frequency but high-impact events.
Re-identification Risk
A privacy risk describing the possibility of identifying individuals again from anonymized or restricted datasets.
Reconciliation Control
The process of verifying alignment of records, totals, and business logic across different data systems or layers.
Record Linkage
The process of linking records belonging to the same person, organization, or event across multiple data sources.
Reference Data Management
The centralized and consistent management of controlled data sets such as code lists, classes, and shared dictionaries.
Retention Policy
A governance policy that defines how long data is retained, and when it should be archived or deleted.
Rolling Window Features
A feature structure that summarizes past observations within a defined window to generate time-dependent signals.
Rule-Based Data Cleansing
A cleansing approach that improves data quality through explicit business rules and validation conditions.
SMOTE
A widely used balancing technique that generates new synthetic examples for the minority class from existing ones.
Sampling Frame
The source list or coverage structure that defines which units can enter the sampling process.
Schema Drift
The risk that changes in data structure over time will break existing processing and analytics workflows.
Simulation Data
Data generated by imitating the behavior of real systems through mathematical or rule-based models.
Standardization
The process of transforming a variable so that it has mean zero and standard deviation one.
Streaming Data Collection
An approach for ingesting continuously generated data in real time or near real time.
Synthetic Data
Artificially generated data designed to imitate real data distributions for analysis or modeling purposes.
Synthetic Data Fidelity
A property indicating how well synthetic data preserves the statistical, structural, and use-case-relevant characteristics of real data.
Synthetic Data Leakage
A risk in which synthetic data leaks membership or privacy-sensitive information because it preserves too much trace of the real data.
Target Encoding
An advanced feature engineering technique that converts categorical levels into numerical representations using target-related summary statistics.
Threshold Moving
An approach that adjusts the classification threshold according to business goals and error costs in imbalanced settings.
Time-Based Split
An approach in which training and evaluation sets are split chronologically for time-dependent data.
Timeliness
The property of data being sufficiently current, timely, and available when needed.
Train-Serve Skew
A mismatch between the data seen during training and the data encountered in production at serving time.
t-Closeness
A model that requires sensitive-value distributions within anonymized groups to remain close to the overall dataset distribution.
Weak Supervision
An approach that generates approximate labels through rules, heuristics, or weak sources instead of full manual labeling.
Web Scraping
A method for programmatically collecting structured or semi-structured data from web pages.
Winsorization
An approach that caps extreme values at defined thresholds instead of removing them entirely.