Speech-to-Text (ASR)
Automatic Speech Recognition technology that converts spoken audio into written text, using neural networks to transcribe human speech.
Modern Approaches
End-to-end neural models (like OpenAI's Whisper) directly map audio to text using transformer architectures. They handle multiple languages, accents, and audio quality levels. CTC (Connectionist Temporal Classification) and attention-based decoding are the main paradigms.
Key Models
Whisper (OpenAI): Open-source, multilingual, robust to noise. Deepgram: Low-latency commercial API. Google Speech-to-Text: Cloud API with streaming support. AssemblyAI: Feature-rich transcription API.
Applications
Meeting transcription, voice assistants, accessibility (captions/subtitles), call center analytics, podcast indexing, and voice-controlled interfaces.