Speech Synthesis & Speech Recognition
Speech Synthesis & Speech Recognition
Brian Long (www.blong.com)
Table of Contents
Introduction
This article will look at support for speech in Microsoft Windows and see what's involved in incorporating aspects of speech technology in Windows applications. In particular we examine the Microsoft Speech API (SAPI) to see what it offers developers in terms of letting applications speak to users and also understand what users say. Since there is a lot of information (text and code) here, the article has been split over a number of pages. This page is an introduction to the subject whilst the other two pages (which have enough information to make them indivisual articles) look in detail at using SAPI 4 and SAPI 5.1 within Delphi applications.
Once upon a time the task of making your application speak or understand its user's commands was science fiction or at the very least involved lots of computing power. We will see that today's computers and speech technology enable any application to use speech technology with relative ease and good performance.
Microsoft have been researching and implementing speech technology for some
years and they have an area of their Web site dedicated to the matter at http://www.microsoft.com/speech.
Speech Technology
The speech capabilities that can be added to an application are text-to-speech synthesis (TTS) and speech recognition (SR).
Text-To-Speech Synthesis (TTS)
This involves turning a string into spoken language that is played through the computer speakers. The complexities of turning words into phonemes, adding appropriate emphasis and translating the result into digital audio are beyond the scope of this paper and are catered for by a TTS engine installed on your machine.
The end result is that the computer talks to the user to save the user having to read some text on the screen.
Speech Recognition (SR)
This involves the computer taking the user's speech and interpreting what has been said. This allows the user to control the computer (or certain aspects of it) by voice, rather than having to use the mouse and keyboard, or alternatively just dictating the contents of a document.
The complex nature of translating the raw audio into phonemes involves a lot
of signal processing and is not focused on here. These details are taken care
of by an SR engine that will be installed on your machine. SR engines are often
called recognisers and these days typically implement continuous speech
recognition (older recognisers implemented isolated or discrete speech recognition,
where pauses were required between words).
Speech recognition usually means one of two things. The application can understand
and follow simple commands that it has been educated about in advance. This
is known as command and control (sometimes seen abbreviated as CnC, or
simply SR).
Alternatively an application can support dictation (sometimes abbreviated
to DSR). Dictation is more complex as the engine has to try and identify arbitrary
spoken words, and will need to decide which spelling of similarly sounding words
is required. It develops context information based on the preceding and following
words to try and help decide. Because this context analysis is not required
with Command and Control recognition, CnC is sometimes referred to as context-free
recognition.
Speaker Profiles
Dictation speech recognition is speaker-dependant, meaning that because of different people's enunciation, accent, pitch and so on, recognisers require a speaker profile to be set up for decent results. This profile results from training sessions that educate the recogniser about the nuances of the speaker's voice.
On the other hand, command and control speech recognition is usually not speaker-independent.
The Previous State Of Affairs
As the incorporation of speech technology became more realistic more vendors released TTS and SR engines. Unfortunately each engine had its own API and so interchanging them was not possible. Programming for multiple engines meant a lot of recoding and the whole situation was very similar to the database API programming problem before the advent of the BDE and ADO.
The Current State Of Affairs
In late 1995 the Microsoft Speech API (SAPI) was introduced as part of the Windows Open Services Architecture (WOSA) services. This was intended to simplify matters and has done a good job of doing so, at least relatively speaking. Depending on what you wish to do it can involve some tricky coding, but that's the same story with the basic Windows API.
SAPI is currently (at the time of writing) at version 5.1 and now professes a single API made from a set of interfaces that you can program it with to get TTS and/or SR in your application. However up until version 4 an alternative API was in use. In fact there were two APIs defined, but neither of these is now documented and Microsoft recommends the new API.
This means that TTS and SR engines will have to be labelled as either SAPI 4 compliant (if they use the old interfaces) or SAPI 5 compliant (if they use the new interfaces). However, given the widespread use of the older interfaces you shouldn't expect Microsoft to stop them being available any time soon.
The Microsoft Speech SDK can be obtained via Microsoft's Web site and when installed provides documentation on the APIs. Because of the complete differences between SAPI 4 and SAPI 5.1 you can install both SDKs on a single machine and take advantage of any of the available APIs.
SAPI applications programmers call the interfaces defined in the API and SAPI-compliant TTS and SR engines implement those interfaces. SAPI supports text to speech (TTS), speech recognition (SR), dictation speech recognition (DSR) and also telephony (TEL). We will explore TTS and also see what's involved with both types of SR in this paper.
SAPI 4
The SAPI 4 SDK is available in two flavours from http://www.microsoft.com/speech/download/old.
You can download the SDK itself (the download file is called SAPI SDK 4.exe
and is around 8Mb) or the SDK Suite (SAPI SDK 4 Suite.exe, around 40Mb). The
SDK contains the runtime binaries and documentation, but no speech engines.
The SDK Suite also contains Microsoft's TTS and SR engines, as well as a couple
of useful applications (Microsoft Voice and Microsoft Dictation). You would
be advised to download and install the SDK Suite.
In order to get anywhere we need some Pascal representation of the various
interfaces, constants and structures defined by SAPI. You can get everything
you need from those helpful people in the JEDI project (http://delphi-jedi.org).
A translated version of the needed SAPI files can be obtained from http://delphi-jedi.org/api/sapi.zip.
This provides two Delphi import units, speech.pas and spchtel.pas, which correspond
to speech.h and spchtel.h from the SDK.
Of the two, speech.pas is the key unit, as it defines all the important interfaces you will need that are not defined anywhere in type libraries.
There are various issues, anomalies and bugs in SAPI 4, which
are mentioned as notes in this paper where they crop up. The number of issues
in the entire API was one of the reasons Microsoft decided to start from scratch
with SAPI 5 (a directive from the upper echelons of the company started the
SAPI project afresh with new developers at version 5).
Windows 2000 has the SAPI 4 runtime binaries installed
by default (in C:WINNTSpeech) along with the SAPI 4 compliant Microsoft TTS
engine (although with only the Sam voice available). Installing the SAPI 4 SDK
Suite gives additional voices and also the Microsoft SR engine.
SAPI 5.1
You can download the latest SAPI SDK from http://www.microsoft.com/speech/download/SDK51.
There are no specific import units required to program with SAPI 5.1. Most of
the key functionality is exposed through a number of rich Automation objects
and the type libraries contain all the constants, types and interfaces required
to implement SAPI 5.1 applications.
Windows XP has the SAPI 5.1 runtime binaries installed
by default (in C:Program FilesCommon FilesMicrosoft SharedSpeech) along
with the SAPI 5.x compliant Microsoft TTS engine (although with only the Sam
voice available). The downloadable version of SAPI 5.1 is more recent that the
version shipping with Windows XP.
Using SAPI 4 In Delphi Applications
The older SAPI 4 interfaces are defined in two ways. There are high
level interfaces, intended to make implementation easier, but which sacrifice
some of the control. These are intended for quick results but can be quite effective.
There are also low
level interfaces, which give full control but involve more work to get going.
These are intended for the serious programmer to work with.
The high level interfaces are implemented
by Microsoft in COM objects to call the lower
level interfaces, taking care of all the nitty-gritty. The low
level interfaces themselves are implemented by the TTS and SR engines that
you obtain and install.
You can find coverage of the using the SAPI 4 high level interfaces to build
speech-enabled Delphi applications by clicking
here.
Coverage of using the low level interfaces can be found by clicking
here.
Using SAPI 5.1 In Delphi Applications
SAPI 5.1 consists of low level COM interfaces and rich, high level Automation
interfaces. There are no Delphi translations of the COM interfaces, so we are
limited to using the Automation interfaces (these were not present in the original
SAPI 5.0 release but were added in the SAPI 5.1 update). Information on using
the SAPI 5.1 Automation interfaces to build speech-enabled Delphi applications
can be found by clicking here.
Summary
Adding various speech capabilities into a Delphi application does not take an awful lot of work, particularly if you do the background work to understand the SAPI concepts.
There is much to Speech API that we have not looked at in these pages but hopefully the areas covered will be enough to whet your appetite and get you exploring further on your own.
Further Reading
Acknowledgements
Thanks are due to Alec Bergamini of O&A Productions for help getting out of
a number of holes whilst writing these articles. O&A productions develop a set
of native Delphi components that make SAPI application development much simpler
- you can find more information at http://www.o2a.com.
About Brian Long
Brian Long used to work at Borland
UK, performing a number of duties including Technical Support on all the programming
tools. Since leaving in 1995, Brian has been providing training and consultancy
on Borland's RAD products ever since, and is now moving into the .NET world.
Besides authoring a
Borland Pascal problem-solving book published in 1994, Brian is a regular
columnist in The
Delphi Magazine and has had numerous articles published in Developer's Review,
Computing, Delphi
Developer's Journal and EXE Magazine. He was nominated for the Spirit
of Delphi 2000 award and was voted Best Speaker at Borland's BorCon
2002 conference in Anaheim, California by the conference delegates.
There are a growing number of conference papers and articles available on Brian's
Web site, so feel free to have a browse.
In his spare time (and waiting for his C++ programs to compile) Brian has learnt
the art of juggling and
making inflatable origami
paper frogs.
Go back to the top of this page
Go to the SAPI 4 High Level Interfaces
coverage
Go to the SAPI 4 Low Level Interfaces coverage
Go to the SAPI 5.1 coverage