AI Glossary

Speech Recognition

AI technology that converts spoken language into text, enabling voice interfaces, transcription services, and real-time captioning with increasing accuracy.

Evolution

Hidden Markov Models (1990s-2010s) → Deep neural networks (2012+) → End-to-end models (2016+) → Whisper (2022, OpenAI's multilingual model). Modern systems achieve near-human accuracy in clean conditions.

Modern Systems

Whisper: OpenAI's open-source model handling 99 languages. Google Speech-to-Text: Cloud API with real-time streaming. Deepgram: Enterprise-focused with speaker diarization. AssemblyAI: API with summarization and topic detection.

Challenges

Noisy environments, accents, and dialects. Multiple speakers (diarization). Domain-specific vocabulary (medical, legal). Real-time processing latency. Code-switching (mixing languages mid-sentence).

← Back to AI Glossary

Last updated: March 5, 2026