The simplicity of the speech server abstraction described above meant that version 0 of the speech server was running within an hour after I started implementing the system. This meant that I could then move on to the more interesting part of the project: producing good quality spoken output. Version 0 of the speech server was by no means perfect; it was improved as I built the Emacspeak speech client.
A friend of mine had pointed me at the marvels of Emacs Lisp advice a few weeks earlier. Som when I sat down to speech-enable Emacs, advice was the natural choice. The first task was to have Emacs automatically speak the line under the cursor whenever the user pressed the up/down arrow keys.
In Emacs, all user actions invoke appropriate Emacs Lisp functions. In standard editing modes, pressing the down arrow invokes function next-line, while pressing the up arrow invokes previous-line. To speech-enable these commands, version 0 of Emacspeak implemented the following rather simple advice fragment:
(defadvice next-line (after emacspeak) "Speak line after moving." (when (interactive-p) (emacspeak-speak-line)))
The emacspeak-speak-line function implemented the necessary logic to grab the text of the line under the cursor and send it to the speech server. With the previous definition in place, Emacspeak 0.0 was up and running; it provided the scaffolding for building the actual system.
The next iteration returned to the speech server to enhance it with a well-defined eventing loop. Rather than simply executing each speech command as it was received, the speech server queued client requests and provided a launch command that caused the server to execute queued requests.
The server used the select system call to check for newly arrived commands after sending each clause to the speech engine. This enabled immediate silencing of speech; with the somewhat naïve implementation described in version 0 of the speech server, the command to stop speech would not take immediate effect since the speech server would first process previously issued speak commands to completion. With the speech queue in place, the client application could now queue up arbitrary amounts of text and still get a high degree of responsiveness when issuing higher-priority commands such as requests to stop speech.
Implementing an event queue inside the speech server also gave the client application finer control over how text was split into chunks before synthesis. This turns out to be crucial for producing good intonation structure. The rules by which text should be split up into clauses varies depending on the nature of the text being spoken. As an example, newline characters in programming languages such as Python are statement delimiters and determine clause boundaries, but newlines do not constitute clause delimiters in English text.
As an example, a clause boundary is inserted after each line when speaking the following Python code:
i=1 j=2
See the section "Augmenting Emacs to create aural display lists," later in this chapter, for details on how Python code is distinguished and its semantics are transferred to the speech layer.
With the speech server now capable of smart text handling, the Emacspeak client could become more sophisticated with respect to its handling of text. The emacspeak-speak-line function turned into a library of speech-generation functions that implemented the following steps:
Parse text to split it into a sequence of clauses.
Preprocess text—e.g., handle repeated strings of punctuation marks.
Carry out a number of other functions that got added over time.
Queue each clause to the speech server, and issue the launch command.
From here on, the rest of Emacspeak was implemented using Emacspeak as the development environment. This has been significant in how the code base has evolved. New features are tested immediately because badly implemented features can render the entire system unusable. Lisp's incremental code development fits naturally with the former; to cover the latter, the Emacspeak code base has evolved to be "bushy"—i.e., most parts of the higher-level system are mutually independent and depend on a small core that is carefully maintained.
Lisp advice is key to the Emacspeak implementation, and this chapter would not be complete without a brief overview. The advice facility allows one to modify existing functions without changing the original implementation. What's more, once a function f has been modified by advice m, all calls to function f are affected by advice.
advice comes in three flavors:
before
The advice body is run before the original function is invoked.
after
The advice body is run after the original function has completed.
around
The advice body is run instead of the original function. The around advice can call the original function if desired.
All advice forms get access to the arguments of the adviced function; in addition, around and after get access to the return value computed by the original function. The Lisp implementation achieves this magic by:
Storing this definition as the adviced function
Thus, when the advice fragment shown in the earlier section "A Simple First-Cut Implementation" is evaluated, Emacs' original next-line function is replaced by a modified version that speaks the current line after the original next-line function has completed its work.
At this point in its evolution, here is what the overall design looked like:
Text is preprocessed by placing the text in a special scratch buffer. Buffers acquire specialized behavior via buffer-specific syntax tables that define the grammar of buffer contents and buffer-local variables that affect behavior. When text is handed off to the Emacspeak core, all of these buffer-specific settings are propagated to the special scratch buffer where the text is preprocessed. This automatically ensures that text is meaningfully parsed into clauses based on its underlying grammar.
Emacs uses font-lock to syntactically color text. For creating the visual presentation, Emacs adds a text property called face to text strings; the value of this face property specifies the font, color, and style to be used to display that text. Text strings with face properties can be thought of as a conceptual visual display list.
Emacspeak augments these visual display lists with personality text properties whose values specify the auditory properties to use when rendering a given piece of text; this is called voice-lock in Emacspeak. The value of the personality property is an Aural CSS (ACSS) setting that encodes various voice properties—e.g., the pitch of the speaking voice. Notice that such ACSS settings are not specific to any given TTS engine. Emacspeak implements ACSS-to-TTS mappings in engine-specific modules that take care of mapping high-level aural properties—e.g., mapping pitch or pitch-range to engine-specific control codes.
The next few sections describe how Emacspeak augments Emacs to create aural display lists and to process these aural display lists to produce engine-specific output.
Emacs modules that implement font-lock call the Emacs built-in function put-text-property to attach the relevant face property. Emacspeak defines an advice fragment that advices the put-text-property function to add in the corresponding personality property when it is asked to add a face property. Note that the value of both display properties (face and personality) can be lists; values of these properties are thus designed to cascade to create the final (visual or auditory) presentation. This also means that different parts of an application can progressively add display property values.
The put-text-property function has the following signature:
(put-text-property START END PROPERTY VALUE &optional OBJECT)
The advice implementation is:
(defadvice put-text-property (after emacspeak-personality pre act) "Used by emacspeak to augment font lock." (let ((start (ad-get-arg 0)) ;; Bind arguments (end (ad-get-arg 1 )) (prop (ad-get-arg 2)) ;; name of property being added (value (ad-get-arg 3 )) (object (ad-get-arg 4)) (voice nil)) ;; voice it maps to (when (and (eq prop 'face) ;; avoid infinite recursion (not (= start end)) ;; non-nil text range emacspeak-personality-voiceify-faces) (condition-case nil ;; safely look up face mapping (progn (cond ((symbolp value) (setq voice (voice-setup-get-voice-for-face value))) ((ems-plain-cons-p value)) ;;pass on plain cons ( (listp value) (setq voice (delq nil (mapcar #'voice-setup-get-voice-for-face value)))) (t (message "Got %s" value))) (when voice ;; voice holds list of personalities (funcall emacspeak-personality-voiceify-faces start end voice object))) (error nil)))))
Here is a brief explanation of this advice definition:
Bind arguments
First, the function uses the advice built-in ad-get-arg to locally bind a set of lexical variables to the arguments being passed to the adviced function.
Personality setter
The mapping of faces to personalities is controlled by user customizable variable emacspeak-personality-voiceify-faces. If non-nil, this variable specifies a function with the following signature:
(emacspeak-personality-put START END PERSONALITY OBJECT)
Emacspeak provides different implementations of this function that either append or prepend the new personality value to any existing personality properties.
Guard
Along with checking for a non-nil emacspeak-personality-voiceify-faces, the function performs additional checks to determine whether this advice definition should do anything. The function continues to act if:
The first of these checks is required to avoid edge cases where put-text-property is called with a zero-length text range. The second ensures that we attempt to add the personality property only when the property being added is face. Notice that failure to include this second test would cause infinite recursion because the eventual put-text-property call that adds the personality property also triggers the advice definition.
Get mapping
Next, the function safely looks up the voice mapping of the face (or faces) being applied. If applying a single face, the function looks up the corresponding personality mapping; if applying a list of faces, it creates a corresponding list of personalities.
Apply personality
Finally, the function checks that it found a valid voice mapping and, if so, calls emacspeak-personality-voiceify-faces with the set of personalities saved in the voice variable.
With the advice definitions from the previous section in place, text fragments that are visually styled acquire a corresponding personality property that holds an ACSS setting for audio formatting the content. The result is to turn text in Emacs into rich aural display lists. This section describes how the output layer of Emacspeak is enhanced to convert these aural display lists into perceptible spoken output.
The Emacspeak tts-speak module handles text preprocessing before finally sending it to the speech server. As described earlier, this preprocessing comprises a number of steps, including:
This section describes the tts-format-text-and-speak function, which handles the conversion of aural display lists into audio-formatted output. First, here is the code for the function tts-format-text-and-speak:
(defsubst tts-format-text-and-speak (start end ) "Format and speak text between start and end." (when (and emacspeak-use-auditory-icons (get-text-property start 'auditory-icon)) ;;queue icon (emacspeak-queue-auditory-icon (get-text-property start 'auditory-icon))) (tts-interp-queue (format "%s\n" tts-voice-reset-code)) (cond (voice-lock-mode ;; audio format only if voice-lock-mode is on (let ((last nil) ;; initialize (personality (get-text-property start 'personality ))) (while (and ( < start end ) ;; chunk at personality changes (setq last (next-single-property-change start 'personality (current-buffer) end))) (if personality ;; audio format chunk (tts-speak-using-voice personality (buffer-substring start last )) (tts-interp-queue (buffer-substring start last))) (setq start last ;; prepare for next chunk personality (get-text-property last 'personality))))) ;; no voice-lock just send the text (t (tts-interp-queue (buffer-substring start end )))))
The tts-format-text-and-speak function is called one clause at a time, with arguments start and end set to the start and end of the clause. If voice-lock-mode is turned on, this function further splits the clause into chunks at each point in the text where there is a change in value of the personality property. Once such a transition point has been determined, tts-format-text-and-speak calls the function tts-speak-using-voice, passing the personality to use and the text to be spoken. This function, described next, looks up the appropriate device-specific codes before dispatching the audio-formatted output to the speech server:
(defsubst tts-speak-using-voice (voice text) "Use voice VOICE to speak text TEXT." (unless (or (eq 'inaudible voice ) ;; not spoken if voice inaudible (and (listp voice) (member 'inaudible voice))) (tts-interp-queue (format "%s%s %s \n" (cond ((symbolp voice) (tts-get-voice-command (if (boundp voice ) (symbol-value voice ) voice))) ((listp voice) (mapconcat #'(lambda (v) (tts-get-voice-command (if (boundp v ) (symbol-value v ) v))) voice " ")) (t "")) text tts-voice-reset-code))))
The tts-speak-using-voice function returns immediately if the specified voice is inaudible. Here, inaudible is a special personality that Emacspeak uses to prevent pieces of text from being spoken. The inaudible personality can be used to advantage when selectively hiding portions of text to produce more succinct output.
If the specified voice (or list of voices) is not inaudible, the function looks up the speech codes for the voice and queues the result of wrapping the text to be spoken between voice-code and tts-reset-code to the speech server.
I first formalized audio formatting within AsTeR, where rendering rules were written in a specialized language called Audio Formatting Language (AFL). AFL structured the available parameters in auditory space—e.g., the pitch of the speaking voice—into a multidimensional space, and encapsulated the state of the rendering engine as a point in this multidimensional space.
AFL provided a block-structured language that encapsulated the current rendering state by a lexically scoped variable, and provided operators to move within this structured space. When these notions were later mapped to the declarative world of HTML and CSS, dimensions making up the AFL rendering state became Aural CSS parameters, provided as accessibility measures in CSS2 (http://www.w3.org/Press/1998/CSS2-REC).
Though designed for styling HTML (and, in general, XML) markup trees, Aural CSS turned out to be a good abstraction for building Emacspeak's audio formatting layer while keeping the implementation independent of any given TTS engine.
Here is the definition of the data structure that encapsulates ACSS settings:
(defstruct acss family gain left-volume right-volume average-pitch pitch-range stress richness punctuations)
Emacspeak provides a collection of predefined voice overlays for use within speech extensions. Voice overlays are designed to cascade in the spirit of Aural CSS. As an example, here is the ACSS setting that corresponds to voice-monotone:
[cl-struct-acss nil nil nil nil nil 0 0 nil all]
Notice that most fields of this acss structure are nil—that is, unset. The setting creates a voice overlay that:
Sets pitch to 0 to create a flat voice.
Sets pitch-range to 0 to create a monotone voice with no inflection.
This setting is used as the value of the personality property for audio formatting comments in all programming language modes. Because its value is an overlay, it can interact effectively with other aural display properties. As an example, if portions of a comment are displayed in a bold font, those portions can have the voice-bolden personality (another predefined overlay) added; this results in setting the personality property to a list of two values: (voice-bolden voice-monotone). The final effect is for the text to get spoken with a distinctive voice that conveys both aspects of the text: namely, a sequence of words that are emphasized within a comment.
Rich visual user interfaces contain both text and icons. Similarly, once Emacspeak had the ability to speak intelligently, the next step was to increase the bandwidth of aural communication by augmenting the output with auditory icons.
Auditory icons in Emacspeak are short sound snippets (no more than two seconds in duration) and are used to indicate frequently occurring events in the user interface. As an example, every time the user saves a file, the system plays a confirmatory sound. Similarly, opening or closing an object (anything from a file to a web site) produces a corresponding auditory icon. The set of auditory icons were arrived at iteratively and cover common events such as objects being opened, closed, or deleted. This section describes how these auditory icons are injected into Emacspeak's output stream.
Auditory icons are produced by the following user interactions:
Auditory icons that confirm user actions—e.g., a file being saved successfully—are produced by adding an after advice to the various Emacs built-ins. To provide a consistent sound and feel across the Emacspeak desktop, such extensions are attached to code that is called from many places in Emacs.
Here is an example of such an extension, implemented via an advice fragment:
(defadvice save-buffer (after emacspeak pre act) "Produce an auditory icon if possible." (when (interactive-p) (emacspeak-auditory-icon 'save-object) (or emacspeak-last-message (message "Wrote %s" (buffer-file-name)))))
Extensions can also be implemented via an Emacs-provided hook. As explained in the brief advice tutorial given earlier, advice allows the behavior of existing software to be extended or modified without having to modify the underlying source code. Emacs is itself an extensible system, and well-written Lisp code has a tradition of providing appropriate extension hooks for common use cases. As an example, Emacspeak attaches auditory feedback to Emacs' default prompting mechanism (the Emacs minibuffer) by adding the function emacspeak-minibuffer-setup-hook to Emacs' minibuffer-setup-hook:
(defun emacspeak-minibuffer-setup-hook () "Actions to take when entering the minibuffer." (let ((inhibit-field-text-motion t)) (when emacspeak-minibuffer-enter-auditory-icon (emacspeak-auditory-icon 'open-object)) (tts-with-punctuations 'all (emacspeak-speak-buffer)))) (add-hook 'minibuffer-setup-hook 'emacspeak-minibuffer-setup-hook)
This is a good example of using built-in extensibility where available. However, Emac-speak uses advice in a lot of cases because the Emacspeak requirement of adding auditory feedback to all of Emacs was not originally envisioned when Emacs was implemented. Thus, the Emacspeak implementation demonstrates a powerful technique for discovering extension points.
Lack of an advice-like feature in a programming language often makes experimentation difficult, especially when it comes to discovering useful extension points. This is because software engineers are faced with the following trade-off:
Make the system arbitrarily extensible (and arbitrarily complex)
Guess at some reasonable extension points and hardcode these
Once extension points are implemented, experimenting with new ones requires rewriting existing code, and the resulting inertia often means that over time, such extension points remain mostly undiscovered. Lisp advice, and its Java counterpart Aspects, offer software engineers the opportunity to experiment without worrying about adversely affecting an existing body of source code.
In addition to using auditory icons to cue the results of user interaction, Emacspeak uses auditory icons to augment what is being spoken. Examples of such auditory icons include:
Auditory icons are implemented by attaching the text property emacspeak-auditory-icon with a value equal to the name of the auditory icon to be played on the relevant text.
As an example, commands to set breakpoints in the Grand Unified Debugger Emacs package (GUD) are adviced to add the property emacspeak-auditory-icon to the line containing the breakpoint. When the user moves across such a line, the function tts-format-text-and-speak queues the auditory icon at the right point in the output stream.
To summarize the story so far, Emacspeak has the ability to:
Produce auditory output from within the context of an application
Audio-format output to increase the bandwidth of spoken communication
This section explains some of the enhancements that the design makes possible.
I started implementing Emacspeak in October 1994 as a quick means of developing a speech solution for Linux. It was when I speech-enabled the Emacs Calendar in the first week of November 1994 that I realized that in fact I had created something far better than any other speech-access solution I had used before.
A calendar is a good example of using a specific type of visual layout that is optimized both for the visual medium as well as for the information that is being conveyed. We intuitively think in terms of weeks and months when reasoning about dates; using a tabular layout that organizes dates in a grid with each week appearing on a row by itself matches this perfectly. With this form of layout, the human eye can rapidly move by days, weeks, or months through the calendar and easily answer such questions as "What day is it tomorrow?" and "Am I free on the third Wednesday of next month?"
Notice, however, that simply speaking this two-dimensional layout does not transfer the efficiencies achieved in the visual context to auditory interaction. This is a good example of where the right auditory feedback has to be generated directly from the underlying information being conveyed, rather than from its visual representation. When producing auditory output from visually formatted information, one has to rediscover the underlying semantics of the information before speaking it.
In contrast, when producing spoken feedback via advice definitions that extend the under-lying application, one has full access to the application's runtime context. Thus, rather than guessing based on visual layout, one can essentially instruct the underlying application to speak the right thing!
The emacspeak-calendar module speech-enables the Emacs Calendar by defining utility functions that speak calendar information and advising all calendar navigation commands to call these functions. Thus, Emacs Calendar produces specialized behavior by binding the arrow keys to calendar navigation commands rather than the default cursor navigation found in regular editing modes. Emacspeak specializes this behavior by advising the calendar-specific commands to speak the relevant information in the context of the calendar.
The net effect is that from an end user's perspective, things just work. In regular editing modes, pressing up/down arrows speaks the current line; pressing up/down arrows in the calendar navigates by weeks and speaks the current date.
The emacspeak-calendar-speak-date function, defined in the emacspeak-calendar module, is shown here. Notice that it uses all of the facilities described so far to access and audio-format the relevant contextual information from the calendar:
(defsubst emacspeak-calendar-entry-marked-p( ) (member 'diary (mapcar #'overlay-face (overlays-at (point))))) (defun emacspeak-calendar-speak-date( ) "Speak the date under point when called in Calendar Mode. " (let ((date (calendar-date-string (calendar-cursor-to-date t)))) (cond ((emacspeak-calendar-entry-marked-p) (tts-speak-using-voice mark-personality date)) (t (tts-speak date)))))
Emacs marks dates that have a diary entry with a special overlay. In the previous definition, the helper function emacspeak-calendar-entry-marked-p checks this overlay to implement a predicate that can be used to test if a date has a diary entry. The emacspeak-calendar-speak-date function uses this predicate to decide whether the date needs to be rendered in a different voice; dates that have calendar entries are spoken using the mark-personality voice. Notice that the emacspeak-calendar-speak-date function accesses the calendar's runtime context in the call:
(calendar-date-string (calendar-cursor-to-date t))
The emacspeak-calendar-speak-date function is called from advice definitions attached to all calendar navigation functions. Here is the advice definition for function calendar-forward-week:
(defadvice calendar-forward-week (after emacspeak pre act) "Speak the date. " (when (interactive-p) (emacspeak-speak-calendar-date ) (emacspeak-auditory-icon 'large-movement)))
This is an after advice, because we want the spoken feedback to be produced after the original navigation command has done its work.
The body of the advice definition first calls the function emacspeak-calendar-speak-date to speak the date under the cursor; next, it calls emacspeak-auditory-icon to produce a short sound indicating that we have successfully moved.