Deep Learning
119 terms in the Deep Learning domain — each bilingual TR/EN with related-term graph.
Most Read
All Terms (119)
Additive Attention
An early attention approach that compares query and context representations through a learnable combination function.
Attention
A mechanism that enables a model to learn which parts of the input deserve more focus during prediction.
Attention Mask
A control mechanism that determines which positions a model may or may not attend to during attention computation.
Attention Score Matrix
A matrix structure that numerically represents how much each element in a sequence attends to the others.
Autoencoder
A neural architecture that learns low-dimensional representations by compressing and reconstructing data.
Backpropagation
The core learning mechanism that propagates loss gradients backward through layers to update weights.
Backpropagation Through Time
A method in sequential models where the network is unrolled across time steps and gradients are computed backward.
Batch Normalization
A technique that normalizes intermediate activations at the mini-batch level to accelerate training and provide partial regularization.
Beta-VAE
A variational model that strengthens VAE regularization to learn more disentangled factors in latent space.
Bidirectional RNN
An RNN structure that processes sequence information in both forward and backward directions to provide richer context.
Bottleneck Layer
A narrow intermediate layer that forces the model to compress information and learn more compact representations.
Causal Attention
An autoregressive attention structure that allows a token to attend only to positions at or before itself.
Cell State
A memory pathway in LSTM architectures that carries long-term information more directly.
Chain Rule
The rule for computing derivatives of composed functions and the mathematical foundation of backpropagation.
Channel Attention
An attention mechanism that emphasizes more informative feature channels rather than treating all of them equally.
Checkpointed Backpropagation
A training technique that reduces memory usage by not storing all intermediate activations and recomputing them when needed.
Computational Graph
A structure that represents model operations as nodes and edges, making automatic differentiation easier.
Context Window
The maximum sequence length that a Transformer model can process in a single pass.
Contractive Autoencoder
A type of autoencoder that uses an additional penalty to learn more stable latent representations under input perturbations.
Convolution
The fundamental operation of CNNs that captures spatial patterns through local filters.
Cross-Attention
An attention mechanism that allows one representation set to draw context from another representation set.
CutMix
A visual regularization technique that mixes image regions and labels together to improve robustness.
Data Augmentation
A regularization approach that improves generalization by expanding training data through meaningful transformations.
Decoder-Only Transformer
A modern large-language-model architecture that generates autoregressively by predicting the next token.
Deep Neural Network
A general neural network structure that learns hierarchical features through multiple hidden layers.
Denoising Autoencoder
A type of autoencoder that learns more robust representations by reconstructing clean outputs from corrupted inputs.
Depthwise Separable Convolution
An efficient convolution structure that reduces CNN computation by separating spatial and channel transformations.
Dilated Convolution
A convolution technique that enlarges the receptive field by inserting gaps between filter elements.
Dropout
A regularization technique that reduces overfitting by temporarily disabling some neurons during training.
ELU
An activation function that uses smooth exponential behavior in the negative region to encourage a more balanced activation distribution.
Early Stopping
A strategy that prevents overfitting by stopping training when validation performance begins to deteriorate.
Encoder-Decoder RNN
A classical sequential architecture that compresses an input sequence into context and generates an output sequence from it.
Encoder-Decoder Transformer
The classical Transformer architecture that encodes an input sequence and contextually generates an output sequence.
Encoder-Only Transformer
A Transformer architecture focused on contextual representation learning and used mainly for understanding tasks.
Exploding Gradients
An optimization problem in which gradients grow excessively during backpropagation and destabilize training.
Exposure Bias
The problem in which a model trained on correct past context must face its own imperfect history during inference.
Feature Hierarchy
The structure in which representations become increasingly abstract from lower to higher layers.
Feature Map
The spatial activation representation produced by a convolution layer through specific filters.
Feature Pyramid Network
An architectural design that combines visual information across scales to improve multi-scale object understanding.
Feedforward Neural Network
The classical family of neural networks in which information flows one-way from input to output.
GELU Activation
A modern activation function that transforms inputs with probabilistic smoothness rather than a hard threshold.
GRU
A recurrent unit that learns sequence dependencies through a simpler gating structure than LSTM.
Gated Linear Unit
An activation-like structure that filters linear signals through a gating mechanism to enable more selective information flow.
Gradient Checking
A debugging technique that validates analytical gradients by comparing them with numerical approximations.
Gradient Flow
A core training-dynamics concept describing how effectively the learning signal moves across network layers.
Gradient Noise Scale
A training-dynamics measure that characterizes how noisy gradient estimates are in stochastic optimization.
Graph Attention Network
A GNN architecture that combines neighboring nodes with learned attention weights rather than treating them equally.
Graph Classification
A graph learning task focused on assigning a single label to the entire graph.
Graph Convolutional Network
A foundational GNN architecture that learns representations over graphs by using neighborhood information.
Graph Isomorphism Network
A GNN architecture designed to strengthen the theoretical power of distinguishing graph structures.
Graph Pooling
A GNN operation that aims to compress node information into more compact and task-relevant representations.
GraphSAGE
A GNN method that makes representation learning scalable on large graphs through neighborhood sampling.
Hard-Swish Activation
An efficient activation function designed to approximate Swish-like behavior at lower computational cost.
Hessian-Vector Product
A computational technique that accesses second-order information without explicitly forming the full Hessian matrix.
Heterogeneous Graph Neural Network
An advanced GNN architecture capable of modeling different node and relation types within the same graph.
Hidden Layer Width
An architectural concept referring to the number of neurons in a layer and directly affecting model capacity.
Hidden State
An internal representation vector in sequence models that carries past information and is updated over time.
LSTM
An advanced recurrent architecture that uses gating mechanisms to learn long-term dependencies.
Latent Manifold
The idea that meaningful low-dimensional structure of data is represented as a regular manifold in latent space.
Latent Space Interpolation
A technique for exploring the continuity of learned structure by moving between points in latent representation space.
Layer Normalization
A technique that normalizes activations at the sample level and provides more stable training especially in sequence models.
Leaky ReLU
An activation function that leaves a small nonzero slope in the negative region to alleviate the dying ReLU problem.
Linear Attention
An approach that aims to make attention computation more scalable by reducing complexity to an approximately linear form.
Link Prediction
A task aimed at predicting edges that are not currently present in a graph but are likely to exist.
Masked Language Modeling
A pretraining objective based on masking some input tokens and predicting them from context.
Message Passing Neural Network
A general GNN framework that updates information over graphs through message exchange among nodes.
Mish Activation
A modern activation function noted for its smooth shape and internally regular gradient behavior.
Mixture-of-Experts Transformer
A Transformer approach that improves scaling efficiency by activating selected expert subnetworks rather than the full model on every input.
Mixup
A data-driven regularization technique that mixes training examples and labels so the model learns smoother decision boundaries.
Multi-Head Attention
A structure that runs attention in parallel across multiple subspaces to learn different types of relationships.
Multilayer Perceptron
A fully connected neural network structure containing multiple hidden layers.
Neighborhood Aggregation
The core GNN operation by which a node updates its representation by collecting information from its neighbors.
Neural Tangent Kernel
A theoretical framework that connects the training dynamics of very wide neural networks with kernel methods.
Node Classification
A core GNN task focused on predicting a label for each node in a graph.
Parameter Sharing
An efficient learning principle in which the same weights are reused across multiple positions or structures.
Perceptron
The most basic artificial neuron model that learns a linear decision boundary through weighted inputs.
Pooling Layer
A layer that summarizes feature maps, reduces dimensionality, and provides robustness to local variations.
Positional Encoding
A method that injects order information into Transformer models so sequence positions become visible.
Post-Norm Transformer
The classical Transformer variant that applies normalization after the attention or FFN block.
Posterior Collapse
A VAE training issue in which the decoder ignores the latent variable, weakening representation learning.
Pre-Norm Transformer
A Transformer design variant that places normalization before the main attention or FFN block.
ReLU Activation
The most common modern activation function, which zeros negative inputs and leaves positive inputs linear.
Receptive Field
A concept describing which region of the input contributes information to a neuron or feature activation.
Recurrent Neural Network
A foundational neural network family that models sequential data by carrying information from past time steps.
Reparameterization Trick
A core VAE technique that makes latent-variable models with stochastic sampling differentiable.
Residual Block
A building block that eases the training of deep CNNs by carrying information directly through identity connections.
Rotary Positional Embedding
A modern positional representation method that encodes order information through rotations in vector space.
SELU Activation
An activation function designed to support self-normalizing network behavior.
Scaled Dot-Product Attention
The fundamental Transformer operation that computes attention weights by scaling similarity between query and key vectors.
Scheduled Sampling
A method that gradually reduces teacher forcing to bring training conditions closer to inference conditions.
Self-Attention
An attention mechanism in which each element in a sequence directly models its relationship with all others.
Sequence-to-Sequence Learning
A general modeling approach focused on converting one input sequence into another output sequence.
Sharpness-Aware Minimization
An optimization approach that seeks not only low loss but also flatter and more generalizable solution regions.
Sigmoid Activation
A classical activation function that squashes input values into the range between 0 and 1.
Skip Connection
An architectural connection that allows information to bypass certain layers and improves training stability.
Softmax Activation
An output activation that expresses multiclass outputs as a normalized probability distribution.
Sparse Attention
An attention approach that reduces cost by allowing each element to attend only to selected regions rather than the full sequence.
Sparse Autoencoder
A type of autoencoder that encourages only a small number of latent neurons to activate, leading to more selective features.
Squeeze-and-Excitation
A CNN module that reweights feature channels using global context.
Stochastic Depth
A method that provides stronger regularization in very deep networks by randomly skipping some layers during training.
Stochastic Weight Averaging
A method that averages parameter states from different stages of training in order to obtain more robust generalization.
Stride
A CNN hyperparameter that determines how many steps a filter moves across the input and affects output resolution.
Swish Activation
A modern activation function that multiplies the input by a sigmoid to create a smooth nonlinear transformation.
Tanh Activation
A zero-centered activation function that maps inputs into the range from -1 to 1.
Teacher Forcing
A training strategy in sequence generation where the model is fed the true previous output instead of its own prediction.
Transformer Feed-Forward Network
A Transformer sub-block that operates independently on each token and strengthens representation transformation.
Transposed Convolution
A learnable upsampling layer that maps feature maps to higher spatial resolution.
Truncated BPTT
A method that makes training more tractable on long sequences by applying backpropagation over a limited window.