Data Science and Data Management

97 terms in the Data Science and Data Management domain — each bilingual TR/EN with related-term graph.

Data CollectionData CleaningData PreprocessingFeature EngineeringData LabelingData QualityData GovernanceData PrivacySynthetic DataImbalanced Data Problems

All Terms (97)

5 terms

🎯

Accuracy

A quality dimension expressing how correctly a data field reflects the real-world value it represents.

🎯

Active Labeling

An approach that aims to optimize labeling cost by selecting the most useful or uncertain examples for annotation.

⚖️

Adjudication Workflow

A quality assurance workflow in which conflicting or ambiguous labels are resolved through higher-level review.

🧺

Aggregation Feature

A feature structure that summarizes lower-level records into higher-level signals meaningful for modeling.

🎭

Anonymization

The process of transforming personal data so that it can no longer be linked back to a specific individual.

1 terms

📦

Balanced Batch Sampling

A sampling strategy that balances learning by maintaining a more controlled class distribution within each training batch.

8 terms

🧱

Canonicalization

The process of converting different representations of the same information into one standard canonical form.

🧹

Category Standardization

The process of unifying different spellings, abbreviations, or formats representing the same concept into one standard form.

🔄

Change Data Capture

An approach for tracking data changes in source systems and propagating them to downstream systems in near real time.

⚖️

Class Imbalance

A condition in which some classes are heavily represented while others are represented only sparsely in a dataset.

🏋️

Class Weighting

An approach that rebalances model learning by increasing the error cost of underrepresented classes.

🧩

Completeness

A data quality dimension describing how fully expected fields, records, or business scope are present in a dataset.

🗳️

Consensus Labeling

An approach in which multiple annotators’ judgments are combined to determine the final label for a data instance.

✅

Consent Management

The consent-based management of the purposes, scope, and duration under which personal data may be processed.

18 terms

📚

Data Catalog

A centralized catalog structure that presents definitions, ownership, usage, and discovery information for enterprise data assets.

📥

Data Collection

The systematic process of acquiring data for analysis, reporting, and modeling workflows.

⏱️

Data Collection SLA

An operational service-level framework that defines timeliness, completeness, and availability standards for data flows.

📄

Data Contracts

An agreement approach that explicitly defines schema, quality, and delivery expectations between data producers and consumers.

🏛️

Data Governance

The enterprise framework for managing data through ownership, quality, access, usage, and control principles.

🔍

Data Lineage

The visible trace of all movements and transformations a data element undergoes from source to report or model.

📉

Data Minimization

The principle of collecting and processing only the data that is truly necessary for a defined purpose.

📡

Data Observability

A monitoring approach that aims to detect data issues, anomalies, and silent quality degradation early.

👤

Data Ownership

The principle that defines which business or technical role is responsible for the quality, definition, and use of specific data domains.

🔬

Data Profiling

The process of systematically examining a dataset’s content, distribution, missingness, uniqueness, and rule violations.

🗂️

Data Source

The system, platform, or operational touchpoint where data is generated, stored, or retrieved.

🧑‍💼

Data Stewardship

An operational approach in which specific data domains are actively stewarded for definition, quality, and appropriate use.

🔠

Data Type Mismatch

A problem arising when the expected data type of a field differs from the actual stored content type.

🧪

Derived Feature

A new feature computed or transformed from existing fields rather than directly coming from raw data.

🔐

Differential Privacy

A mathematical privacy framework that limits the extent to which any single individual’s data can affect published results.

🌫️

Diffusion-Based Synthetic Data

A modern synthetic data generation approach that reconstructs data distributions through noise injection and reverse sampling.

🎲

Domain Randomization

An approach that varies environmental factors in synthetic data generation to make models more robust to the real world.

🪞

Duplicate Record

A repeated data record that represents the same real-world entity or event more than once.

3 terms

🔢

Encoding

The process of converting categorical data into numerical representations that models can process.

🧩

Entity Resolution

The process of determining whether different records actually refer to the same real-world entity.

📍

Event Tracking

A tracking approach that records user or system behaviors as discrete events.

3 terms

🧮

Feature Hashing

A method that maps features into a fixed-dimensional space using hash functions to provide scalable representation.

🎯

Feature Selection

The process of selecting the most informative variables for a model in order to reduce noise, cost, and complexity.

🪄

Fuzzy Matching

A matching approach that uses similarity-based rules to find near-matching records instead of exact matches.

2 terms

🧠

GAN-Based Synthetic Data

A synthetic data approach based on generating new data samples similar to the real distribution using generative adversarial networks.

✅

Ground Truth

The trusted reference label or verification information considered correct for a data instance.

5 terms

🎯

Imbalance-Aware Calibration

An approach that helps model probabilities reflect true risk levels more accurately under class imbalance.

🩹

Imputation

The process of filling missing observations using statistical, rule-based, or model-driven methods.

🎛️

Instrumentation Design

A design approach that defines which events and fields should be recorded, and how, in order to measure product, process, or system behavior correctly.

🤝

Inter-Annotator Agreement

A quality measure indicating how consistently different annotators make similar decisions on the same data.

🔗

Interaction Feature

A combined variable created to capture the joint effect of two or more features.

1 terms

👥

k-Anonymity

A privacy protection model that aims to make each individual indistinguishable from at least k others.

6 terms

🌳

Label Ontology

A classification framework that defines the hierarchical, relational, and conceptual structure of labels.

📘

Labeling Guideline

A formal instruction document defining the rules, examples, and exceptions to be used during labeling.

⏮️

Lag Feature

A type of feature that brings time-dependent patterns into the model using values from previous time steps.

🚫

Leakage Prevention

A preprocessing discipline that prevents information unavailable at real usage time from leaking into model training.

🚧

Leakage-Aware Feature Engineering

An approach to feature creation that preserves time, target, and operational usage boundaries to avoid leakage.

🌈

l-Diversity

A privacy model that requires sufficient diversity of sensitive values within anonymized groups.

5 terms

🧭

Master Data Management

An approach for managing core enterprise entities such as customers, products, and suppliers in a unified and consistent way.

📝

Metadata Management

The systematic management of descriptions, sources, usage, and technical structure information about data.

🕳️

Missing Data

A condition in which fields expected in an observation appear as empty, null, or unknown.

🌀

Mode Collapse

A problem in synthetic data generation where the model loses distributional diversity and produces only limited types of samples.

📊

Monotonic Binning

A feature transformation technique that bins continuous variables while preserving a monotonic relationship with the target.

1 terms

📏

Normalization

The process of bringing numerical variables to a defined scale to make them more suitable for modeling and comparison.

3 terms

🕳️

One-Class Classification

A modeling approach that learns the normal pattern and treats deviations as anomalous when the minority class is extremely rare.

🚨

Outlier

An observation or value that deviates noticeably from the general pattern of the dataset.

⬆️

Oversampling

An approach that increases the number of minority-class examples to make them more visible in the dataset.

7 terms

👀

Passive Data Collection

An approach in which data is collected through behavior, sensor output, and system traces rather than direct user input.

💻

Policy as Code

An approach in which data access, usage, and security policies are defined and enforced through code instead of manual processes.

🛤️

Preprocessing Pipeline

A sequenced, reproducible, and automation-friendly workflow of data transformation steps.

💰

Privacy Budget

A concept that quantitatively governs how much privacy loss is allowed in differential privacy applications.

🔐

Privacy-Preserving Synthetic Data

A synthetic data generation approach designed to create analytical value without exposing real individuals.

💻

Programmatic Labeling

An approach in which labels are generated automatically through code, rules, or functions rather than manual entry.

🪪

Pseudonymization

An approach that replaces direct identifiers with substitute representations that can be re-linked only through controlled additional information.

1 terms

📈

Quantile Transformation

A transformation that reshapes data through rank-based mapping to make it more regular or closer to a target distribution.

8 terms

🚨

Rare Event Modeling

An approach that requires specialized strategies to model low-frequency but high-impact events.

🕵️

Re-identification Risk

A privacy risk describing the possibility of identifying individuals again from anonymized or restricted datasets.

🧾

Reconciliation Control

The process of verifying alignment of records, totals, and business logic across different data systems or layers.

🔗

Record Linkage

The process of linking records belonging to the same person, organization, or event across multiple data sources.

🗃️

Reference Data Management

The centralized and consistent management of controlled data sets such as code lists, classes, and shared dictionaries.

🗄️

Retention Policy

A governance policy that defines how long data is retained, and when it should be archived or deleted.

🪟

Rolling Window Features

A feature structure that summarizes past observations within a defined window to generate time-dependent signals.

📜

Rule-Based Data Cleansing

A cleansing approach that improves data quality through explicit business rules and validation conditions.

9 terms

🧬

SMOTE

A widely used balancing technique that generates new synthetic examples for the minority class from existing ones.

🧭

Sampling Frame

The source list or coverage structure that defines which units can enter the sampling process.

🧱

Schema Drift

The risk that changes in data structure over time will break existing processing and analytics workflows.

🕹️

Simulation Data

Data generated by imitating the behavior of real systems through mathematical or rule-based models.

⚖️

Standardization

The process of transforming a variable so that it has mean zero and standard deviation one.

🌊

Streaming Data Collection

An approach for ingesting continuously generated data in real time or near real time.

🧬

Synthetic Data

Artificially generated data designed to imitate real data distributions for analysis or modeling purposes.

🧪

Synthetic Data Fidelity

A property indicating how well synthetic data preserves the statistical, structural, and use-case-relevant characteristics of real data.

🧯

Synthetic Data Leakage

A risk in which synthetic data leaks membership or privacy-sensitive information because it preserves too much trace of the real data.

6 terms

🎯

Target Encoding

An advanced feature engineering technique that converts categorical levels into numerical representations using target-related summary statistics.

🎚️

Threshold Moving

An approach that adjusts the classification threshold according to business goals and error costs in imbalanced settings.

🕒

Time-Based Split

An approach in which training and evaluation sets are split chronologically for time-dependent data.

⏰

Timeliness

The property of data being sufficiently current, timely, and available when needed.

⚙️

Train-Serve Skew

A mismatch between the data seen during training and the data encountered in production at serving time.

📊

t-Closeness

A model that requires sensitive-value distributions within anonymized groups to remain close to the overall dataset distribution.

1 terms

⬇️

Undersampling

An approach that reduces the number of majority-class examples to produce a more balanced class distribution.

1 terms

🛡️

Validity

A quality dimension indicating whether a data value conforms to defined formats, ranges, vocabularies, or business rules.

3 terms

🪄

Weak Supervision

An approach that generates approximate labels through rules, heuristics, or weak sources instead of full manual labeling.

🕸️

Web Scraping

A method for programmatically collecting structured or semi-structured data from web pages.

✂️

Winsorization

An approach that caps extreme values at defined thresholds instead of removing them entirely.

Data Science and Data Management

Most Read

All Terms (97)

Accuracy

Active Labeling

Adjudication Workflow

Aggregation Feature

Anonymization

Balanced Batch Sampling

Canonicalization

Category Standardization

Change Data Capture

Class Imbalance

Class Weighting

Completeness

Consensus Labeling

Consent Management

Data Catalog

Data Collection

Data Collection SLA

Data Contracts

Data Governance

Data Lineage

Data Minimization

Data Observability

Data Ownership

Data Profiling

Data Source

Data Stewardship

Data Type Mismatch

Derived Feature

Differential Privacy

Diffusion-Based Synthetic Data

Domain Randomization

Duplicate Record

Encoding

Entity Resolution

Event Tracking

Feature Hashing

Feature Selection

Fuzzy Matching

GAN-Based Synthetic Data

Ground Truth

Imbalance-Aware Calibration

Imputation

Instrumentation Design

Inter-Annotator Agreement

Interaction Feature

k-Anonymity

Label Ontology

Labeling Guideline

Lag Feature

Leakage Prevention

Leakage-Aware Feature Engineering

l-Diversity

Master Data Management

Metadata Management

Missing Data

Mode Collapse

Monotonic Binning

Normalization

One-Class Classification

Outlier

Oversampling

Passive Data Collection

Policy as Code

Preprocessing Pipeline

Privacy Budget

Privacy-Preserving Synthetic Data

Programmatic Labeling

Pseudonymization

Quantile Transformation

Rare Event Modeling

Re-identification Risk

Reconciliation Control

Record Linkage

Reference Data Management

Retention Policy

Rolling Window Features

Rule-Based Data Cleansing