Speech to Text: timesaver or time waster?
October 7th, 2007 by joseWe academics should be obsessed with the amount of stuff that we write, and it could be that one bottleneck of our output is simply the speed at which we type. We have provided some tools to help you write faster (see our review of an autocompleter here), but actually audio could be a very good tool to get your ideas into a more manageable form, which could be text or it could be simply an audio file. For example, it’s very, very easy to do a brain dump using audio. You just start talking about the idea that you just had and try to put it in a way that sounds reasonable that you go to other people, play it, and they will understand what you are saying.
In that sense, it is a lot better to use audio because you speak at a speed that is a lot higher than your typing speed.
Actually this post has been dictated into Audacity, which is a free software that I use for dictating. One of the things that mainly changed my mind and made me try dictation was Peter Fisher’s Podcast series; Peter Fisher is a professor at MIT, and he has a series of Podcasts on Academic Productivity. I seriously would recommend his stuff in my review here; I think he has plenty of very valuable advice in his Podcasts. But anyway, I want to go through the advantages of using audio as a means to take your ideas down to paper at the same time.
The first advantage is that audio forces linearity on you. When I write text, I can jump freely around; I can go to the introduction, then add to the end of the paper; I can work on the Methods section, go back to the intro, then back to Method and so on. This is not something that you can do with audio; you really have to start from point A and run all the way to point Z. This could be an advantage or it could be a disadvantage, but for short ideas like a Blog post or just a quick note, this should be an advantage.
The second advantage is that dictating also prevents multi-tasking; that is, when you are doing your audio recording, you cannot be working on all things at the same time. Like reading email while writing a paper (most people do that). Setting up the recording program means that you will do only dictating for a while. I don’t know about you, but having my undivided attention on just one task helps. I have been known for doing several tasks at the same time, and this is really not a good idea. We will talk about multi-tasking at some point in Academic Productivity, but right now I am pretty happy knowing that I have to talk to a microphone, and that is the only thing that I have to do, so there is no easy way I can be doing two things at the same time, like listening to music and writing for example.
The third advantage of audio is that it removes a barrier of entry for developing an idea. For example, sometimes I feel too lazy to fully develop an idea into a manuscript because it’s just too complicated. I can see way too many steps that I have to take until the full paper is completely written. It is just a pain to type the idea down and to correct the spelling and so on. When you are dictating, you don’t have that; you don’t have to worry about spelling, you don’t have to worry about the structure much, and you can just simply jump around and take your idea down as soon as possible. Of course, for longer-form pieces you are going to need a structure (but interestingly, not for something like a note or a Blog post or quick introduction for a letter). For longer form pieces, I would recommend you work on your outline with paper and pencil or a computer, and dedicate as much time as you want to the outline. Then, and only then, dictate. You may even have some standard structure that you just copy and paste and repeat over and over. For example, Experimental papers: all experimental papers have (1) an introduction, (2) a theory part where you connect your ideas with previous theories or you compare the predictions of two existing theories (3) A Method section will just bring what you actually did, (4) a results section and (5) a discussion. This is a pretty common structure, and if you have many experiments you may well have this structure repeated x times. Then it is good to have that structure before you start dictating, and you can even go into a fine grain detail and say, “Okay, introduction must have three paragraphs; the first one, I am going to explain idea y, and the second one idea z or whatever,” and so you may even make notes for what the paragraphs are going to contain and then dictate elaborating on those ideas. So you can get a detailed outline and then dictate to fill in the gaps in that online.
For outlining, the best tool I found is Microsoft OneNote. But this is worth a Blog post in itself, so I am not going to elaborate on OneNote later.
A fourth advantage of dictating is that it is actually hands-free; so you could do something, -again, only if you really have to multi-task while you are dictating-. So, in a extreme case of both multitasking and having no life you could be feeding your baby or doing something that requires you to use your hands and dictating one idea to your computer, or you could even use a portable voice recorder. So you may even be waiting for the bus or just commuting somehow and dictating, I have not tried driving and dictating or doing delicate things while dictating, but I guess, with practice you may even be able to do that.
Of course, dictating has disadvantages. Let us cover those too. For a start, it is very difficult to start dictating and get a perfectly fine manuscript out of your brain dump. It is going to take several iterations, but I think I just started using that dictation so I don’t really know how good it is. I think it may be worth it because you would get a mass of text that you can then work on. You can remove paragraphs, paste them together, clean up stuff, it should not be that difficult, I think. With a few iterations you can get a pretty good manuscript that you can just send off somewhere.
Finally I’d like to compare sending your ideas out to the world in pure audio form -what is called a Podcast – versus a text post. I mean: If I have dictated all this, why would I not just post the audio file?
I think there are several advantages for text. For example, text can definitely be consumed faster. We are trained to skim, and we do it consistently. It’s a lot more difficult to skim audio (although possible). The non-skimmable character or some speed reading methods made me abandon them. Second, text is indexed. No search engine indexes audio or video yet (although I’m sure it’s coming). So it’s harder to get noticed if you produce only audio, and people will have trouble trying to track down where you said what, making quotation difficult.
A nice advantage of audio, though, is that it can be consumed anywhere; for example I listen to Podcasts while jogging or commuting.
I know of at least one successful academic who dictates: Robin Dawes, at the social and decision sciences, Carnegie Mellon University. Drop a comment if you know of anyone else, or if dictation has made a big impact in the way you work on your writing.
We do that whole day long, every working hour we have, so we are very good at skimming, and I think most people would mostly blog post at incredible speed. So, it may be that people really want that kind of speed to consume your ideas and they don’t really want to wait for you to actually talk for ten minutes, so it will take ten minutes for them to get to your entire idea whereas they may skim the same idea in a text format in just a minute probably.
So, that’s one of the advantages for text. And of course, text is indexed in search engines, so if you want people to find you and your ideas, you have to have them in text form, it is very difficult to index audio. Eventually in the future, search engines will start indexing audio and video but that is not happening yet.
Of course, one advantage of your audio, like Podcasts, is that you can consume it anywhere you are, for example, waiting for the bus or driving or jogging. Actually, that is how I consume most of the Podcasts I listen to.
Now finally, I want to talk about how you can get your audio into text form if you want to do that. The first way is to use transcription services and that’s what I am using right now. It could be costly, it’s about $1 a minute and it takes a few days for the person to do the job which can be the downside to this option. The person I am using right now is taking three to five days, and then you have to go through the text and make sure it is the way you wanted it to be. While there’ll always be mistakes in transcription. I still think it’s the most convenient way… as you will see below.
The second way is using some software that can do transcription automatically. The best one is supposed to be Dragon Naturally Speaking. They are now in version 9.1 and to tell you the truth, I haven’t tried their software, I haven’t really spent time working with Dragon to see how well it can transcribe. They report conversion rates of about 29% correct. I know you have to train the system, it may take a while to get there. It’s pretty CPU intensive and it may be a bit disappointing the first time you use it because you have to correct errors on the fly, so as you are talking, you see the text on your screen and it can be in any kind of computer program like Word or OneNote, even an email you can compose with your voice and you will see words written as you talk, that’s pretty cool. But you will see that plenty of things are misspelt or just plain wrong; sometimes it is really hilarious. So it takes quite a bit of work; you have to type corrections, so it’s pretty attention demanding, so I will say, transcription services are actually better.
UPDATE: Now I have given Dragon a try.And unfortunately, y first impression is that Dragon show an impressive collection of fatal mistakes.
- The program asks permission to scan things you have written to improve and personalize its vocabulary. That’s great; but then it only scans the ‘My documents’ folder. What if you use a different folder to save your writings? Then, it will miss it. First fatal design flaw.
- When you fire Dragon, it installs a large toolbar across your screen. Well, if close the program, the space that used to be reserved for that toolbar is never released (!). Talk about civic behavior! All other applications, even after closing Dragon, will have lost a few inches of screen real state. This is just rude.
- In use. The inital accuracy was pretty poor, to the point of making me laugh outloud. But then, the laughs would show as text (Dragon’s best guesses!). Then, you obviously need to do some work on the produced text… but ff you correct the mistakes on the fly with a keyboard then the process is so slow and so distracting that you’re better off writing the thing yourself.
- Performance. I wonder how we current technology and fast CPUs this program needs several seconds to figure out what the hell you were saying.
Conclusion: honestly, if I was the CEO of this company I would have never released this program. You may get some use out of Dragon in extreme circumstances, i.e., if you have had an accident that prvents you from using your hands, and you cannot type or you have to hold you children while writing your paper and you really need to dictate. But overall, extremely disappointing experience.
October 13th, 2007 at 10:35 pm
I’m surprised that you had such a bad experience with Dragon. I remember that the New York Times gave Dragon an outstanding review. The accuracy was above 90%, if I remember correctly.
October 14th, 2007 at 11:18 am
I did my PhD on (among other things) how people who can’t type make Dragon (or similar software) useful and useable day-to-day.
It requires so much effort to be a productive user of desktop speech recognition software that unless you are absolutely compelled to use it for reasons of injury or other disability then you are better off not using it, if only to save yourself the frustration.
October 15th, 2007 at 12:31 pm
Thanks for the extensive write-up. I came to the same conclusion (more at: The 4-hour workweek applied: How I spent $100, saved hours, and boosted my reading workflow
http://ideamatt.blogspot.com/2007/08/4-hour-workweek-applied-how-i-spent-100.html)
For general capture, I found voice unwieldy – I capture a lot during the day, and a 1 or 2 day turnaround time is the max I can live with. I think outsourcing this kind of day-to-day information would be expensive and complex. I find the plain old pad of paper works great. Write anytime during the day (or night), tear off, toss into inbox, and process daily!
October 15th, 2007 at 1:39 pm
I’ve no experience of it, but I’m interested in Speech Dasher. It’s only a prototype so far, but it could really be sweet for transcribing:
http://www.inference.phy.cam.ac.uk/kv227/speechdasher/
Quote:
Speech Dasher is a novel interface for the input of text using a combination of speech and gestures. A speech recognizer provides the initial guess of the user’s desired text while a gesture-based interface allows the user to confirm and correct the recognizer’s output.
It is hoped that Speech Dasher will provide a text input interface which is:
More efficient – allowing faster input than either speech or gestures alone.
More fun – providing a consistent and less frustrating method of correcting speech recognition errors.
More accessible – enabling text input by people unable to use a keyboard and by those using mobile devices.
December 10th, 2007 at 3:40 am
Responding to Terri Yu: I agree that David Pogue at the NYT did a review of the latest Dragon in which he raved about its accuracy. I think he claimed about 90 percent before training, and about 95 percent after that. Ninety percent sounds impressive, but consider the following (I am sorry I don’t remember where I first read this): The typical English word is about five characters, so 20 words would equal 100 characters–and, with 95 percent accuracy, several of those words (on average, five of twenty) would have some sort of error. Of course, the errors wouldn’t distribute this way–my guess is that those of us in academia would see errors in technical terms in our work, not in the sort of words I am using in this post, for example. I am a social scientist, so I see words like “postmodernity” and “reify” and “ontological” perhaps a bit too often–I suspect that these are the sort of words that will give Dragon or other voice recognition fits.
I have a colleague who had arthritis and had to use voice, but she put an awful lot of time into it. I can type very fast, so as much as I think the idea is super cool, I just cannot justify it.
My dream is to have highly accurate voice recognition that can take dictation from me–so that when I am driving, or working out, or whatever, I can make a voice note that I upload to a computer, and then get back as text. I’ve seen recorders that allegedly do this, but I strongly doubt it works.
December 23rd, 2007 at 4:31 am
I had a very bad experience with dragon also. It took me 5 days of playing their speech recognition and I still had some issue. So I tried their support but they are not very good at helping. So I do not recommend to buy this program. I am still looking for a good software.
October 11th, 2008 at 5:51 am
In my opinion it all depends on calibration. Of course there are better pieces of equipment out there and you cant expect all of them to function the same. At the end of the day it is all about how the voice recorder for example is calibrated.