AI Glossary

Speech-to-Text (ASR)

Automatic Speech Recognition technology that converts spoken audio into written text, using neural networks to transcribe human speech.

Modern Approaches

End-to-end neural models (like OpenAI's Whisper) directly map audio to text using transformer architectures. They handle multiple languages, accents, and audio quality levels. CTC (Connectionist Temporal Classification) and attention-based decoding are the main paradigms.

Key Models

Whisper (OpenAI): Open-source, multilingual, robust to noise. Deepgram: Low-latency commercial API. Google Speech-to-Text: Cloud API with streaming support. AssemblyAI: Feature-rich transcription API.

Applications

Meeting transcription, voice assistants, accessibility (captions/subtitles), call center analytics, podcast indexing, and voice-controlled interfaces.

← Back to AI Glossary

Last updated: March 5, 2026