skeegan at htctu.net
Thu May 8 12:47:17 PDT 2008
> What experience/hands-on knowledge might any of you have with
> Docsoft software used for captioning?
First the short version:
As a tool to generate basic text information that could be used for
searching an audio file, this is pretty good. As a tool to automatically
generate captioned text information, I do not believe this tool is
sufficient by itself (i.e., you *will* need someone to review and edit the
Here is a longer version:
We did some testing with the system on a audio file that was recorded in a
studio quality environment with multiple takes to get the audio track "just
right". When we ran the file through the DocSoft engine, we got about 91-92%
accuracy. That is about one wrong word out of 10. Others that were testing
the system (for podcasting-type situations) were able to get similar levels
of recognition provided that they scripted out what they planned to say
We then ran some audio clips that were recorded in more of a "classroom"
type environment (i.e., non-studio, more dynamic interaction, etc.) and were
able to get about 80-85% accuracy, which was similar to what others were
getting. In reality, it was not so much that the system was not recognizing
the words, it was that the system was mis-recognizing whole
phrases/sentences. Because it is automated speech recognition, it is not so
much an issue of misspelling a word, but mis-recognizing the spoken word as
something altogether different. We ended up with text content that was very
different from what was spoken.
We found that when the recognition went below 90%, it became much more
difficult to edit the generated transcript. The generated text content was
very different from the spoken audio content to the point it did not make
sense. This was not an issue of correcting just one or two words, but
having to repeatedly review whole sentences/phrases to edit the text content
vs. the spoken content. For content that was in the 80% region, there were
significant problems with the content being totally out of context. It may
be more effective to transcribe/parrot the audio file separately as opposed
to using an automated solution.
At the time, DocSoft had an editor tool in beta development that we did
not get a chance to use, but their developers thought that it would be
approximately 1-1.5 times the length of the audio clip for a person to
edit the recognized text (this is after the audio clip has already been
processed). So, for a 30 minute audio clip, you would be looking at a
total processing time of 1 hour to 1 hour 15 minutes (30 minutes for audio
clip processing, and 30-45 minutes for post-production audio clip editing).
The DocSoft tool is basically running the Dragon Naturally Speaking
engine (from Nuance), which is probably one of the better automated
speech recognition engines commercially available. There is an option that
you could use to train the user and this may result in an improvement. I
was unable to test this component.
So, as a tool to create basic text from an audio file for searching,
then I think DocSoft is a good option. As a tool to automatically
create transcripts (or captioned files), I think there is a lot more
work that needs to be done. From what I have seen, you will need someone
would need to go back through and proof the generated text AND audio to
The question I ask is, what is the intent of using such a system? If
you have an accurate transcript, then there are various vendor options
for creating the time-stamped text file (and the level of searchability is
FAR more granular). If all you are interested in doing is basic audio
mining to add searchability to the audio content, then I think DocSoft has a
very useful platform.
More information about the athen-list