Skip to content
Category

Speech, Voice and Audio AI

73 terms in the Speech, Voice and Audio AI domain — each bilingual TR/EN with related-term graph.

Speech to TextText to SpeechSpeaker RecognitionDiarizationAudio ClassificationSpeech Emotion AnalysisKeyword SpottingNoise ReductionAudio Signal Processing

Most Read

All Terms (73)

S
17 terms
🔄

Sample Rate Conversion

A process that adapts an audio signal to different sampling rates for model and system compatibility.

⚖️

Score Normalization

A process that makes similarity scores in speaker verification systems more stable and comparable.

📐

Short-Time Fourier Transform

A core transform that enables windowed analysis of audio frequency content over time.

📟

Small-Footprint Keyword Spotting

An approach focused on designing lightweight keyword spotting models for devices with limited memory and compute.

📡

Sound Event Localization and Detection

An advanced environmental audio task that determines not only the presence of a sound event but also its timing and sometimes direction.

🧩

Source Separation

A task that aims to separate a mixed audio signal into components such as speech, music, or individual speakers.

🧩

Speaker Clustering

A diarization subtask that groups similar speech segments so they correspond to the same speaker.

👥

Speaker Diarization

The task of determining who spoke when over the timeline of an audio recording.

🧠

Speaker Embeddings

Dense vector representations that capture speaker identity in a discriminative form.

🪪

Speaker Identification

A task that determines which enrolled speaker in a known set produced a given voice sample.

Speaker Verification

A binary decision problem that verifies whether a voice sample belongs to the claimed speaker.

🧠

Speaker-Independent Emotion Recognition

An approach that aims for emotion models to learn general affective cues without overfitting to speaker-specific voice traits.

🎭

Speech Emotion Recognition

A task that attempts to infer emotional state by extracting affective acoustic cues from speech.

🧼

Speech Enhancement

A processing task that aims to make speech more intelligible from noisy or degraded audio.

⏹️

Streaming Endpoint Detection

A mechanism that determines when speech has truly ended in order to provide correct response timing in streaming ASR systems.

Streaming TTS

A real-time speech synthesis approach that begins generating audio with low latency without waiting for the full text.

⚠️

Stress Detection from Speech

A task that attempts to extract stress or cognitive-load signals from acoustic variations in speech.