Facial Animation
Combining some aspects of HCI with VoiceXML might require real-time synchronization of audio text with mouth and jaw motion of an avatar.
Definition: phoneme — the smallest unit of sound in a spoken word. [English is said to have from 35 to 50 phonemes.]
Definition: viseme — a facial expression which corresponds to one or more phonemes of the speaker.
A naive animation might attempt to create a separate facial expression for each possible phoneme in a language. Disney animators found that several sounds correspond "well enough" with a single expression. They produced a chart of twelve archetypal mouth positions. Any sound in a character's speech would map to one of these 12.
Alternatively, the deaf community, which does not hear phonemes, relies on lip reading (aka speechreading) for spoken language recognition. Lip-reading bases speech recognition on 18 speech postures [this may be outdated]. Some of these mouth postures show very subtle differences that a hearing individual may not see.
An HCI/computer speech problem could be seeking answers to questions like: