Everything you need to know

Speech to text, automatic transcription, automatic subtitles, automatic speech recognition (ASR), computer speech recognition, and voice to text are all in the same field of AI technology that put spoken words into written text!

What is speech to text?

Speech to text is an automatic speech recognition (ASR) system that consists primarily of statistical models which map continuous spoken utterances or speech waveforms to text in human language. The ASR system is put together from a language model, a pronunciation model (lexicon/dictionary), and an acoustic model. When the ASR system is consistently fed and trained with new speech data by multiple speakers it receives an extended vocabulary, and the accuracy of the ASRs transcript increases. Therefore, the more the ASR has used the better accuracy it receives. The accuracy levels are measured and set by the Word Error Rate (WER).

For an ASR model to be considered highly accurate, the WER correspondence needs to be less than 10%. Txtplays ASR model is considered being highly accurate and that's because we deliver an accuracy of 94%.

Why do we need it?

  • Language barriers: subtitles allow people to understand languages that aren't their mother tongue/they don't speak by translating words or sentences into the user's preferred language.
  • Improved concentration: help for people with reading or learning difficulties. People with reading or learning difficulties can benefit from subtitles to understand the content better and faster.
  • Noisy environments: in noisy environments, for example in public places, aeroplanes or trains, subtitles make it possible to follow along in the dialogue.
  • Subtitles: is an excellent resource for language learning.
  • Avoid disturbing others; most people watch videos on mute while in public.
  • Citations and references: for researchers, writers and journalists, subtitles can be a valuable source of citations and references when writing about film or media content.
  • Transcription of long interviews: 1 hour of audio content takes about 6 hours to transcribe manually,

Speech-to-text and automatic speech recognition (ASR) enable audio content to be visually accessible for everyone by adding text. Adding speech-to-text and subtitles provides accessibility for hearing-impaired audiences who would otherwise be excluded from this content. Therefore, automatic speech recognition (ASR) has become a necessity to possess to make content accessible to everyone.

Making online video content available to everyone has become a new EU directive set by the European Disability Forum Guidelines and the Web Content Accessibility Guidelines (WCAG).

EU directive dates and legislation:

  • September 23'rd in 2019: Accessibility requirements apply to all public websites launched after this date.
  • September 23'rd in 2020: The law will be applied backwards too - that is, it will apply to all public sector sites.
  • June 23'rd in 2021: Legislation applies to mobile applications too.

How does speech to text with speech recognition work?

With automatic speech recognition (ASR), an algorithm recognizes the words spoken in your video and delivers machine-based texts for indexing, subtitling, and searching. The result is good and usable, but not always perfect. The result that is delivered depends a lot on the sound quality of the source material used.

If there is a lot of noise, and if many people are talking in each other's mouths, then the result will not be perfect. The subtitles therefore often need to be reworked, but the time-consuming work has been done - especially about the timing of the subtitles. Txtplay can provide post-processing of the subtitles on an hourly basis.



Search for a word or click on the text to navigate. The subtitles of this example are not post-processed and are provided directly from our speech-to-text algorithm.

Getting started!

Step 1: Create your txtplay account
Press any "Get started" buttons on our website and get directly transferred to our "create account" page. Over at the "create account" page, you insert your name, email address, and chosen password. When you have created your account we will send an activation link to your email address, if you don't receive an activation email check your junk or spam email folder. Otherwise, email our support at contact@imgplay.com.

Step 2: Choose your payment method
Txtplay accepts almost all payment methods such as American Express, Visa, and Mastercard.

Step 3: Choose one of our pricing modules
Choose between Pay as you go, Pro and Enterprise.

Pay as you go: € 0.25 / MIN transcripted media, perfect for fast easy access transcripts and subtitles!

Pro: For professionals who encounter transcription needs daily. Get greater discounts the more included transcription. Contact our sales manager for details!

Enterprise: High volume transcription discounts with advanced features settings built for teams. Contact our sales executive for a satisfying collaboration!

Step 4: Upload your audio or video file
Upload your media and select your language. Our speech recognition engine will take care of the job and notify you when it's done. You can continue working while our AI is doing the magic.

Step 5: Edit your audio or video file
We connect your media to the transcript in our online text editor where you can update, highlight, detect speakers and search through your text, and scroll in your audio or video.

Step 6: Export to over 20 formats
We support over 20 formats including: SRT, VTT, .docx. You can fine-tune the export with details like Timecode, Atlas format, speakers, etc. We also have developer-friendly options.

Free versus paid speech to text

There are free speech-to-text versions that you can find online, but there are some differences between free and paid versions that we think are important to keep in mind. First, is the quality of the transcripts that the free ASR generates, they often come with low accuracy. At txtplay we always strive to increase our accuracy, so we can deliver high-quality transcripts to our customers. Second, free services come with a price, and that is your data, they keep your data for their own interest. At txtplay we believe in privacy and safety, that is why we always delete all our customer's data right after your process is done. Third, free services do sometimes not support the file format you need. At txtplay we support transcript export to over 20 formats making it possible for our customers to, for example, download the transcript as a Word file to include it in an interview or an srt file to add subtitles for a video.

Other free speech to text services

  • Low accuracy
  • Limited processing
  • No customer support
  • No clear data storage guidelines
  • No online platform editor
  • Limited file export possibilities

Txtplays paid speech to text services

  • High accuracy
  • Unlimited processing
  • Customer support
  • Clear data storage guidelines
  • Online platform editor
  • Almost unlimited possibilities for different file export options