Deep Learning

ELU

An activation function that uses smooth exponential behavior in the negative region to encourage a more balanced activation distribution.

⏹️

Early Stopping

A strategy that prevents overfitting by stopping training when validation performance begins to deteriorate.

🔄

Encoder-Decoder RNN

A classical sequential architecture that compresses an input sequence into context and generates an output sequence from it.

🔀

Encoder-Decoder Transformer

The classical Transformer architecture that encodes an input sequence and contextually generates an output sequence.

📚

Encoder-Only Transformer

A Transformer architecture focused on contextual representation learning and used mainly for understanding tasks.

💥

Exploding Gradients

An optimization problem in which gradients grow excessively during backpropagation and destabilize training.

⚠️

Exposure Bias

The problem in which a model trained on correct past context must face its own imperfect history during inference.

4 terms

🏗️

Feature Hierarchy

The structure in which representations become increasingly abstract from lower to higher layers.

🗺️

Feature Map

The spatial activation representation produced by a convolution layer through specific filters.

🗻

Feature Pyramid Network

An architectural design that combines visual information across scales to improve multi-scale object understanding.

➡️

Feedforward Neural Network

The classical family of neural networks in which information flows one-way from input to output.

12 terms

GELU Activation

A modern activation function that transforms inputs with probabilistic smoothness rather than a hard threshold.

⚙️

GRU

A recurrent unit that learns sequence dependencies through a simpler gating structure than LSTM.

🚪

Gated Linear Unit

An activation-like structure that filters linear signals through a gating mechanism to enable more selective information flow.

✅

Gradient Checking

A debugging technique that validates analytical gradients by comparing them with numerical approximations.

Gradient Flow

A core training-dynamics concept describing how effectively the learning signal moves across network layers.

Gradient Noise Scale

A training-dynamics measure that characterizes how noisy gradient estimates are in stochastic optimization.

🎯

Graph Attention Network

A GNN architecture that combines neighboring nodes with learned attention weights rather than treating them equally.

🕸️

Graph Classification

A graph learning task focused on assigning a single label to the entire graph.

🕸️

Graph Convolutional Network

A foundational GNN architecture that learns representations over graphs by using neighborhood information.

🧬

Graph Isomorphism Network

A GNN architecture designed to strengthen the theoretical power of distinguishing graph structures.

📦

Graph Pooling

A GNN operation that aims to compress node information into more compact and task-relevant representations.

🌐

GraphSAGE

A GNN method that makes representation learning scalable on large graphs through neighborhood sampling.

5 terms

📱

Hard-Swish Activation

An efficient activation function designed to approximate Swish-like behavior at lower computational cost.

📐

Hessian-Vector Product

A computational technique that accesses second-order information without explicitly forming the full Hessian matrix.

🧩

Heterogeneous Graph Neural Network

An advanced GNN architecture capable of modeling different node and relation types within the same graph.

📏

Hidden Layer Width

An architectural concept referring to the number of neurons in a layer and directly affecting model capacity.

Hidden State

An internal representation vector in sequence models that carries past information and is updated over time.

2 terms

Implicit Differentiation

An approach for computing derivatives through solutions or equilibrium conditions that are not written explicitly.

🧭

Inductive Bias

The structural tendency that determines which kinds of patterns a model is naturally more likely to learn.

1 terms

📐

Jacobian Matrix

A matrix representing the derivative structure of vector-valued functions and playing an important role in multidimensional backpropagation.

1 terms

💾

Key-Value Cache

A mechanism that speeds up autoregressive Transformer inference by storing previous attention representations.

7 terms

LSTM

An advanced recurrent architecture that uses gating mechanisms to learn long-term dependencies.

🌐

Latent Manifold

The idea that meaningful low-dimensional structure of data is represented as a regular manifold in latent space.

🌈

Latent Space Interpolation

A technique for exploring the continuity of learned structure by moving between points in latent representation space.

📏

Layer Normalization

A technique that normalizes activations at the sample level and provides more stable training especially in sequence models.

⚡

Leaky ReLU

An activation function that leaves a small nonzero slope in the negative region to alleviate the dying ReLU problem.

📈

Linear Attention

An approach that aims to make attention computation more scalable by reducing complexity to an approximately linear form.

🔗

Link Prediction

A task aimed at predicting edges that are not currently present in a graph but are likely to exist.

7 terms

🕳️

Masked Language Modeling

A pretraining objective based on masking some input tokens and predicting them from context.

📨

Message Passing Neural Network

A general GNN framework that updates information over graphs through message exchange among nodes.

Mish Activation

A modern activation function noted for its smooth shape and internally regular gradient behavior.

Mixture-of-Experts Transformer

A Transformer approach that improves scaling efficiency by activating selected expert subnetworks rather than the full model on every input.

🧪

Mixup

A data-driven regularization technique that mixes training examples and labels so the model learns smoother decision boundaries.

🧩

Multi-Head Attention

A structure that runs attention in parallel across multiple subspaces to learn different types of relationships.

Multilayer Perceptron

A fully connected neural network structure containing multiple hidden layers.

3 terms

📨

Neighborhood Aggregation

The core GNN operation by which a node updates its representation by collecting information from its neighbors.

📐

Neural Tangent Kernel

A theoretical framework that connects the training dynamics of very wide neural networks with kernel methods.

🔹

Node Classification

A core GNN task focused on predicting a label for each node in a graph.

2 terms

📦

Overparameterization

The condition in which a model has a parameter capacity far larger than the amount of available data.

🌫️

Oversmoothing in GNN

A problem in which node representations become too similar after excessive message passing, reducing discriminative power.

7 terms

🔁

Parameter Sharing

An efficient learning principle in which the same weights are reused across multiple positions or structures.

🔹

Perceptron

The most basic artificial neuron model that learns a linear decision boundary through weighted inputs.

📦

Pooling Layer

A layer that summarizes feature maps, reduces dimensionality, and provides robustness to local variations.

📍

Positional Encoding

A method that injects order information into Transformer models so sequence positions become visible.

⬇️

Post-Norm Transformer

The classical Transformer variant that applies normalization after the attention or FFN block.

Posterior Collapse

A VAE training issue in which the decoder ignores the latent variable, weakening representation learning.

⬆️

Pre-Norm Transformer

A Transformer design variant that places normalization before the main attention or FFN block.

1 terms

🔑

Query-Key-Value Representation

A representation scheme in attention mechanisms that structures information access through the query, key, and value separation.

6 terms

⚡

ReLU Activation

The most common modern activation function, which zeros negative inputs and leaves positive inputs linear.

👁️

Receptive Field

A concept describing which region of the input contributes information to a neuron or feature activation.

🔄

Recurrent Neural Network

A foundational neural network family that models sequential data by carrying information from past time steps.

🎲

Reparameterization Trick

A core VAE technique that makes latent-variable models with stochastic sampling differentiable.

➕

Residual Block

A building block that eases the training of deep CNNs by carrying information directly through identity connections.

🌀

Rotary Positional Embedding

A modern positional representation method that encodes order information through rotations in vector space.

16 terms

📊

SELU Activation

An activation function designed to support self-normalizing network behavior.

🎯

Scaled Dot-Product Attention

The fundamental Transformer operation that computes attention weights by scaling similarity between query and key vectors.

🗓️

Scheduled Sampling

A method that gradually reduces teacher forcing to bring training conditions closer to inference conditions.

🪞

Self-Attention

An attention mechanism in which each element in a sequence directly models its relationship with all others.

🔀

Sequence-to-Sequence Learning

A general modeling approach focused on converting one input sequence into another output sequence.

🪶

Sharpness-Aware Minimization

An optimization approach that seeks not only low loss but also flatter and more generalizable solution regions.

Sigmoid Activation

A classical activation function that squashes input values into the range between 0 and 1.

⤴️

Skip Connection

An architectural connection that allows information to bypass certain layers and improves training stability.

🎛️

Softmax Activation

An output activation that expresses multiclass outputs as a normalized probability distribution.

🧾

Sparse Attention

An attention approach that reduces cost by allowing each element to attend only to selected regions rather than the full sequence.

🧬

Sparse Autoencoder

A type of autoencoder that encourages only a small number of latent neurons to activate, leading to more selective features.

📡

Squeeze-and-Excitation

A CNN module that reweights feature channels using global context.

🎲

Stochastic Depth

A method that provides stronger regularization in very deep networks by randomly skipping some layers during training.

🧮

Stochastic Weight Averaging

A method that averages parameter states from different stages of training in order to obtain more robust generalization.

👣

Stride

A CNN hyperparameter that determines how many steps a filter moves across the input and affects output resolution.