An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.


Introduction
Students in the United Kingdom who find it difficult or impossible to write using a keyboard may use Automatic Speech Recognition (ASR) to assist or enable their written expressive communication (Banes & Seale, 2002;Draffan, 2002;Hargrave-Wright, 2002).In a report to the English Higher Education Funding Council it was noted that one of the 'key issues for teaching' with regard to information and communications technology was the opportunities for such technologies as speech recognition software: 10 M. Wald No mention was made, however, of the use of ASR to assist students who find it difficult or impossible to understand speech, with their receptive communication of speech in class or online.
UK Disability Discrimination Legislation states that reasonable adjustments should be made to ensure that disabled students are not disadvantaged (HMSO, 2001), and so it would appear reasonable to expect that adjustments should be made to ensure that multimedia materials including speech are accessible if a simple and inexpensive method to achieve this was available.Since providing a text transcript of a video does not necessarily provide equivalent information for a disabled learner, the government-funded Skills for Access website, 1 which describes itself as 'the comprehensive guide to creating accessible multimedia for e-learning', currently advises that the most desirable accessibility solution is to: […] provide the video with text captions for all spoken and other audio content […]There is no 'reasonable' reason for not captioning video clips from a 'widening access' point of view.and that if resource limitations prohibit providing a reasonable alternative experience for those who cannot hear the video, the 'reasonable adjustment' argument: […] begs the question: should you be using video clips at all?
The Skills for Access website reports that it took a number of skilled workers many tens of hours to caption the video clips used on its site.As video and speech become more common components of online learning materials, the need for captioned multimedia with synchronised speech and text, as recommended by the Web Content Accessibility Guidelines (WAI, 1999), can be expected to increase, and so finding an affordable method of captioning will become more important to help support a 'reasonable adjustment'.
This paper explores how using ASR can help provide a cost-effective way to assist and enable receptive communication, help ensure e-learning is accessible and enhance the quality of learning and teaching.

Use of captions and transcription in education
Deaf and hard of hearing people can find it difficult to follow speech through hearing alone, or to take notes while they are lip-reading or watching a sign-language interpreter.Although summarised notetaking and sign language interpreting is currently available, notetakers can only record a small fraction of what is being said while qualified sign language interpreters with a good understanding of the relevant higher education subject content are in very scarce supply (RNID, 2005): There will never be enough sign language interpreters to meet the needs of deaf and hard of hearing people, and those who work with them.Some deaf and hard of hearing students may also not have the necessary higher education subject-specific sign language skills.Students may consequently find it difficult to study in a higher education environment or to obtain the qualifications required to enter higher education.
Although UK government funding is available to deaf and hard of hearing students in higher education for interpreting or notetaking services, real-time captioning has not yet been used because of the shortage of trained stenographers wishing to work in universities.Since universities in the United Kingdom do not have direct responsibility for funding or providing interpreting or notetaking services, there would appear to be less incentive for them to investigate the use of ASR in classrooms as compared with universities in Canada, Australia and the United States.
ASR offers the potential to provide automatic real-time verbatim captioning for deaf and hard of hearing students or for any user of systems when speech is not available, suitable or audible.Students, especially those whose first language is not English, may also find it easier to follow the captions and transcript than to follow the speech of the lecturer who may have a dialect, accent or not have English as their first language.
In lectures/classes students can spend much of their time and mental effort trying to take notes.This is a very difficult skill to master for any student, especially if the material is new and they are unsure of the key points, as it is difficult to simultaneously listen to what the lecturer is saying, read what is on the screen, think carefully about it and write concise and useful notes.The automatic provision of a live verbatim displayed transcript of what the teacher is saying, archived as accessible lecture notes, would therefore enable staff and students to concentrate on learning and teaching issues (e.g.students could be asked searching questions in the knowledge that they had the time to think) as well as benefiting students who find it difficult or impossible to take notes at the same time as listening, watching and thinking or those who are unable to attend the lecture (e.g. for mental or physical health reasons).Lecturers would also have the flexibility to stray from a pre-prepared 'script', safe in the knowledge that their spontaneous communications will be 'captured' permanently.

Enhancing teaching and learning through reflection
Poor oral presentation skills of teachers can affect all students, but is particularly an added disadvantage for disabled students and students whose first language is not English.Using ASR to capture all presentations in synchronised and transcribed form allows teachers to monitor and review what they have said and reflect on it to improve their teaching and the quality of their spoken communication.

Access to preferred modality of communication
Teachers may have preferred teaching styles involving the spoken or written word that may differ from learners' preferred learning styles (e.g.teacher prefers spoken communication, while student prefers reading).Speech, text and images have communication qualities and strengths that may be appropriate for different content, tasks, learning styles and preferences.Speech can express feelings that are difficult to convey through text (e.g.presence, attitudes, interest, emotion and tone) and that cannot be reproduced through speech synthesis.Images can communicate information permanently and holistically.and simplify complex information and portray moods and relationships.Students can usually read much faster than a teacher speaks and so find it possible to switch between listening and reading.When a student becomes distracted or loses focus it is easy to miss or forget what has been said, whereas text reduces the memory demands of spoken language by providing a lasting written record that can be reread.

Benefits of synchronised multimedia for learning and teaching
Synchronising multimedia means that text, speech and images can be linked together by the stored timing of information, and this enables all the communication qualities and strengths of speech, text and images to be available as appropriate for different content, tasks, learning styles and preferences.Some students, for example, may find the more colloquial style of verbatim-transcribed text from spontaneous speech, easier to follow than an academic written style.

Creating synchronised multimedia
Synchronised multimedia learning and teaching materials can offer many benefits for students but can be difficult to create access, manage and exploit.Tools that synchronise pre-prepared text and corresponding audio files, either for the production of electronic books (e.g.Dolphin 2 ) based on the DAISY 3 specifications or for the captioning of multimedia (e.g.MAGpie 4 ) using, for example, the Synchronized Multimedia Integration Language, 5 are not normally suitable or cost-effective for use by teachers for the 'everyday' production of learning materials with accessible speech captions or transcriptions.This is because they depend on either a teacher reading a prepared script aloud, which can make a presentation less natural sounding and therefore less effective, or on obtaining a written transcript of the lecture, which is expensive and time consuming to produce.ASR can improve the usability and accessibility of e-learning through the cost-effective production of synchronised and captioned multimedia.

Advantages of recorded speech compared with synthetic speech
Synchronised speech and text can assist blind, visually impaired or dyslexic learners to read and search text-based learning material more readily by augmenting unnatural synthetic speech with natural recorded real speech.Although speech synthesis can provide access to some text based materials for blind, visually impaired or dyslexic learners, it can be difficult and unpleasant to listen to for long periods and cannot match synchronised real recorded speech in conveying 'pedagogical presence', attitudes, interest, emotion and tone, and communicating words in a foreign language and descriptions of pictures, equations, tables, diagrams, and so on.

Automatic formatting
It is very difficult to usefully automatically punctuate transcribed spontaneous speech as ASR systems can only recognise words and cannot understand the concepts being conveyed.Further investigations and trials demonstrated it was possible to develop an ASR application that automatically formatted the transcription by breaking up the continuous stream of text based on the length of the pauses/ silences in the speech stream.Since people do not naturally spontaneously speak in complete sentences, attempts to insert conventional punctuation (e.g. a comma for a shorter pause and a fullstop for a longer pause) in the same way as normal written text did not provide a very readable and comprehensible display of the speech.A more readable approach was achieved by providing a visual indication of pauses showing how the speaker grouped words together (e.g. one new line for a short pause and two for a long pause; it is, however, possible to select any symbols as pause markers).
Text created automatically from spontaneous speech using ASR usually has a more colloquial style than academic written text and, although students may prefer this, some teachers had some concerns that this would make it appear that they had poor writing skills.Editors were therefore used to correct and punctuate the transcripts before making them available to students online after the lectures.However, lecturers' concerns that a transcript of their spontaneous utterances will not look as good as carefully prepared and hand-crafted written notes can be met with the response that students at present can tape a lecture and then get it transcribed.Students are capable of understanding the different purposes and expectations of a verbatim transcript of spontaneous speech and pre-prepared written notes.

The 'Liberated Learning' concept
The potential of using ASR to provide automatic captioning of speech in higher education classrooms has now been demonstrated in 'Liberated Learning' classrooms in the United States, Canada and Australia (Bain et al., 2002;Wald, 2002;Leitch & MacMillan, 2003).Lecturers spend time developing their ASR voice profile by training the ASR software to understand the way they speak.This involves speaking the enrolment scripts, adding new vocabulary not in the system's dictionary and training the system to correct errors it has already made so that it does not make them in the future.Lecturers wear wireless microphones, providing the freedom to move around as they are talking, while the text is displayed in real time on a screen using a data projector so students can simultaneously see and hear the lecture as it is delivered.After the lecture the text is edited for errors and made available for students on the Internet.
To make the Liberated Learning vision a reality, the prototype ASR application 'Lecturer', developed in 2000 in collaboration with IBM, was superseded the following year by 'IBM ViaScribe'.Both applications used the ViaVoice 'engine' and its corresponding training of voice and language models, and automatically provided text displayed in a window and stored for later reference synchronised with the speech.ViaScribe used a standard file format (e.g.SMIL) enabling synchronised audio and the corresponding text transcript and slides to be viewed on an Internet browser or through media players that support the SMIL 2.0 standard for accessible multimedia.
ViaScribe 7 can automatically produce a synchronised captioned transcription of spontaneous speech using automatically triggered formatting from live lectures, or in the office, or even, using speaker-independent recognition, from recorded speech files on a website (Bain et al., 2005).

Accuracy
Studies to date have shown that it has proved difficult to obtain an accuracy of over 85% in all higher education classroom environments directly, from the speech of all teachers (Leitch & MacMillan, 2003).It was also observed that lecturers' ASR accuracy rates were lower in classes compared with those achieved in the office environment.This has also been noted elsewhere (Bennett et al., 2002).Informal investigations have suggested this might be because the rate of delivery varied more in a live classroom situation than in the office, resulting in the ends of words being run into the start of subsequent words.It is important to note that the standardised statistical measurement of recognition accuracy by noting recognition 'errors' does not necessarily mean that the error affected readability or understanding (e.g.substitution of 'the' for 'a').It is difficult, however, to devise a standard measure for ASR accuracy that takes readability and comprehension into account.

Student and teacher feedback
Detailed feedback (Leitch & Macmillan, 2003) from 44 students with a wide range of physical, sensory and cognitive disabilities and interviews with lecturers showed that both students and teachers generally liked the Liberated Learning concept and felt it improved teaching and learning as long as the text was reasonably accurate (i.e.>85%).Many students developed strategies to cope with errors in the text and the majority of students used the text as an additional resource to verify and clarify what they heard (e.g.87% of students surveyed reported watching the screen in class, 69% reported comparing their own notes with the digitised text and 63% reported retrieving the online notes).Typical comments were: It gives you something to compare your notes to and if you miss a class the notes are still accessible.
It's very helpful when the lecturer moves on while you're still writing down a point as you can look at the screen.

M. Wald Coping with multiple speakers
In Liberated Learning classrooms, lecturers repeated questions from students so they appeared transcribed on the screen.In interactive group sessions, in order that contributions, questions and comments from all speakers could be transcribed directly into text, each speaker would at present need to have their own separate personal ASR system trained to their voice.

Current and planned developments
Liberated Learning research and development has continued to try improving the usability and performance through training users, simplifying the interface and improving the display readability.In addition to continuing classroom trials in the USA, Canada and Australia, new trials will occur in the United Kingdom, China and Japan.Research and development also continues on developing the ASR application.MIT is a member of the Liberated Learning collaboration and is working to share information to assist the incorporation of speech recognition technology into MIT OpenCourseWare to help students find and review lecture materials (Hazen & Barzilay, 2005).

Improving accuracy through editing and/or re-voicing
Although it can be expected that developments in ASR will continue to improve accuracy rates, the use of a human intermediary to improve accuracy through re-voicing and/or correcting mistakes in real time as they are made by the ASR software could, where necessary, help compensate for some of ASR's current limitations It is possible to edit errors in the synchronised speech and text to insert, delete or amend the text with the timings being automatically adjusted.For example, an 'editor' correcting 15 words per minute would improve the accuracy of the transcribed text from 80% to 90% for a speaker talking at 150 words per minute.Not all errors are equally important, and so the editor can use their initiative to prioritise those that most affect readability and understanding.
An experienced trained 're-voicer' using ASR by repeating very carefully and clearly what has been said can improve accuracy over the original speaker using ASR where the original speech is not of sufficient volume/quality or when the system is not trained (e.g.telephone, Internet, television, indistinct speaker, multiple speakers, meetings, panels, audience questions).Re-voiced ASR is sometimes used for live television subtitling in the United Kingdom (Lambourne et al., 2004) and in classrooms and courtrooms in the United States (Francis & Stinson, 2003) using a mask to reduce background noise and disturbance to others.
While one person acting as both the re-voicer and editor could attempt to create real-time edited re-voiced text, this would be more problematic if a lecturer attempted to edit ASR errors while they were giving their lecture.However, a person editing their own ASR errors to increase accuracy might be more acceptable when using ASR to communicate one-to-one with a deaf person.

Improving usability and performance
Current unrestricted vocabulary ASR systems normally are speaker dependent and so require the speaker to train the system to the way they speak, any special vocabulary they use and the words they most commonly employ when writing.This normally involves initially reading aloud from a provided training script, providing written documents to analyse, and then continuing to improve accuracy by improving the voice and language models by correcting existing words that are not recognised and adding any new vocabulary not in the dictionary.Current research includes a new integrated speech recognition engine ('Lecturer' and 'ViaScribe' required the ViaVoice ASR engine) and providing 'pre-trained' voice models (the most probable speech sounds corresponding to the acoustic waveform) and language models (the most probable words spoken corresponding to the phonetic speech sounds) from samples of speech, so the user does not need to spend the time reading training scripts to improve the voice or language models.This should also help ensure better accuracy for a speaker's specialist subject vocabularies and also spoken spontaneous speech structures, which will differ from their more formal written structures.Speaker independent systems currently usually generate lower accuracy than trained models but systems can improve accuracy with exposure to the speaker's voice.

Personalised displays
Liberated Learning's research has shown that while projecting the text onto a large screen in the classroom has been used successfully, it is clear that in many situations an individual personalised and customisable display would be preferable or essential.An application is therefore being developed to provide users with their own personal display on their own web-enabled wireless systems (e.g.computers, PDAs, mobile phones, etc.) customised to their preferences (e.g.font, size, colour, text formatting and scrolling).

Highlighting, annotation and manipulation of synchronised material
Since it would take students a long time to read through a verbatim transcript after a lecture and summarise it for future use, it would be valuable for students to be able to create an annotated summary for themselves in real time through highlighting, selecting and saving key sections of the transcribed text and adding their own words time linked with the synchronised transcript.

Managing, searching and indexing multimedia
It is difficult to search multimedia materials (e.g.speech, video, PowerPoint files), and using ASR to synchronise speech with transcribed text files can assist learners and teachers to manipulate, index, bookmark, manage and search for online digital multimedia resources that include speech by means of the synchronised text.Standard synchronised multimedia streams do not currently offer a simple way to achieve this.

Conclusion
It would appear to be reasonable to expect educational material produced by staff and students to be accessible to disabled students whenever possible, and audiovisual material in particular can benefit from captioning.Screen readers using speech synthesis can provide access to many materials but it will also sometimes be helpful to provide real synchronised speech.ASR enables academic staff to take a proactive rather than a reactive approach to teaching students with disabilities by providing practical, economic methods to make their teaching accessible and assist learners to manage and search online digital multimedia resources.This can improve the quality of education for all students because the automatic provision of accessible synchronised lecture notes enables students to concentrate on learning and enables teachers to monitor and review what they said and reflect on it to improve their teaching.
The only ASR application that is currently being used in classrooms to provide a real-time synchronised and editable transcription would appear to be IBM ViaScribe; therefore, to further research and develop the use of ASR, applications need to continue to be developed for use by researchers, staff and students.For example ViaScribe automatically produces a phonetic transcription, and this could be used for 'phonetic searching' (Clements et al., 2002) without the need to correct ASR errors in the transcript.Phonetic searching is faster than searching the original speech and can also help overcome ASR 'out of vocabulary' errors that occur when words spoken are not known to the ASR system, as it searches for words based on their phonetic sounds not their spelling.