There’s a lot of hype around the new Microsoft announcement where they claim human parity on the conversational speech recognition task. I don’t doubt for one second that the folks at Microsoft Research are brilliant and working on new and effective techniques.  But… That paper basically just says a huge ensemble model == better ASR. While that’s true, it’s not maybe the most useful measurement point.

That said, I’ve compiled a list of papers I’ve been reading that start to touch on the many facets of large vocabulary continuous speech recognition (LVCSR). Many of these aren’t hot off the presses, but each lay a foundation for thinking about the problem in different ways and along different axes of success. Hopefully, this can serve as a reality check for people, and highlight some things the Microsoft paper doesn’t mention: the contributions of very smart and innovative organizations that speak specifically to the question, “could we run an industry leading model in Prod?”

A big data approach to acoustic model training corpus selection
http://193.6.4.39/~czap/letoltes/IS14/IS2014/PDF/AUTHOR/IS140948.PDF

The Deep Speech Trilogy

Deep Speech
https://arxiv.org/abs/1412.5567
Deep Speech 2
https://arxiv.org/abs/1512.02595
Reducing Bias in Production Speech Models
https://arxiv.org/abs/1705.04400

1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs

1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs

Scalable Modified Kneser-Ney Language Model Estimation
http://www.aclweb.org/anthology/P13-2#page=738

Building an Efficient Neural Language Model Over a Billion Words
https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/

Personalized Speech Recognition On Mobile Devices
https://research.google.com/pubs/pub44631.html

Exploring Sparsity in Recurrent Neural Networks
https://arxiv.org/abs/1704.05119

Purely sequence-trained neural networks for ASR based on lattice-free MMI
https://pdfs.semanticscholar.org/6ce6/a9a30cd69bd2842a4b581cf48c6815bdfdd8.pdf