HMMs in Speech recognition and wordspotting

1 Introduction

Speech Recognition (SR), also know as Automatic Speech Recognition (ASR), is a domain in computer science concerning speech transcription into text and mechanics. Systems as such can be speaker dependent, which means they are able to recognize a specific speakers voice and others that are speaker independent which do not require learning - training to recognize a speakers speech. Practically an SR system, means learning machines to recognize any pattern through a speech continuum. Most of SR systems and speech synthesis systems are based in Hidden Markov Models (HMMs). HMM is a statistical method used widely because speech signal can be interpreted into short time stationary signal, a property that HMM is based on (unobserved – hidden states). This way we can create an automaton able to detect and isolate any instance of given key-words. Applications as such can be found in aerospace and telecommunications, automotive, robotics and programming, translation, video-games and many other industries. This project was implemented using the following software: Praat, Hidden Markov Model Toolkit (HTK), Perl programming language. Before continuing, it is necessary to add some information regarding HTK and HMMs. It is important to understand the structure and the way HMMs work in order to continue with the details of this project.

  • 1.1 Hidden Markov Model Toolkit

HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. HTK software, supports HMMs using both continuous density mixture Gaussian and discrete distributions and can be used to build complex HMM systems. In the figure below we can see the two fundamental functions of HTK. First, Training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Secondly, the Recognizer set of tools is used to transcribe unknown utterances 1. 7.

Figure 1. The two basic functions of HTK

2 Basic principals of Hidden Markov Models

In HMM based speech recognition it is often assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model. In figure 2. we can see a simple Markov chain example. S1 to S5 represent states and αχ represents state transitions. In practice transition between different states is probabilistic and we only know only the observation sequence of each state. The underlying state sequence is hidden and that is why it is called Hidden Markov Model. Given that the underlying sequence is unknown, the required likelihood is computed by summing over all possible state sequences or by calculating the most likely state sequence. In HTK, the entry and exit states of a HMM are non-transimiting. This is to facilitate the construction of more composite models. 1.

Figure 2.
A Markov chain with states (S1,S2,S3,S4,S5) and as selected state transitions αχ
  • 2.1 Isolated word recognition – Keyword detection

The same way, given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient
re-estimation procedure. This means that a sufficient number of representative examples of each word can be collected, then a HMM can be constructed which implicitly models all of the many sources of variability immanent in real speech. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. In our case the vocabulary consists of the different speech turns. To recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies that word.

Figure 3. Isolated word recognition problem

3 Vocabulary of terms

Even though this project's native language is French, we took the decision of writing this report in English, due to some difficulties concerning expression in written text and knowing that best results will be achieved this way. So in this section we will try to enlist some fundamental terms, used in SR systems and additionally translate them into English (from French literature). Undoubtedly cataloging such modifications can be very helpful and wise. This way this project can be available to a wide audience and the embedded knowledge regarding those terms gets enhanced.

  1. Apprentissage – Learning, training
  2. Reconnaissance de la Parole – Speech Recognition
  3. Modèle de Markov cachée (MMC) – Hiden Markov Model (HMM)
  4. Détection de mots clés – Word spotting
  5. Locuteur – Speaker
  6. Tours de la parole – Speech turns
  7. Transcription phonétique – Phonetic transcription
  8. Répertoire – Directory
  9. Arborescence de travail – Process tree
  10. Étiquetage – Labeling
  11. Entités nommées – Named Entities
  12.  declarations,expressions – utterances

4 Project directories

In order to build the HMM Speech Recognition system, each student was given a number of directories to work with. These directories contained the process tree of our work. To analyse the signal of speech, learn how to use the Markov machines and finally recognize the signal of speech. These directories we had to study and modify to accomplish our goal are described bellow. We furthermore indicated their basic function and categorized them accordingly. This way an overview of this project is demonstrated and some crucial parts of it are explained thoroughly.

  • 1 Perl scripts functions

    • 1.1 Vector calculation

  • "": Vector calculation - files in directory / apprentissage
  • "": Vector calculation - files in directory / test in action

    • 1.2 Markov Models

  • "": Viterbi segmentation & parameter estimation (or K-means)
  • "": Baum-Welch re-estimation
  • "": Embedded Baum-Welch training
  • "": Augmentation of gaussian numbers
  • "": Runs all the above scripts
    ( + + + in action

    • 1 Markov Models Perl scripts

  • HInit: Viterbi algorithm is used to find the most likely state sequence corresponding to each training example, then the HMM parameters are estimated. The likelihood of the training data can be computed and the whole process can be repeated until no further increase in likelihood is obtained.
  • HRest: Involves finding the probability of being in each state at each time frame using the Forward-Backward algorithm. This probability is then used to form weighted averages for the HMM parameters.
  • HERest: Loads in a complete set of HMM definitions. Each training file has an associated label file which gives a transcription for that file. It processes each training file in turn and after loading it into memory, it uses the associated transcription to construct a composite HMM which spans the whole utterance.
  • HHed: Is a definition editor which will clone models into context-dependent sets, apply a variety of parameter tyings and increment the number of mixture components in specified distributions.

Here is an example of a Perl script.


$hviteConf="-T 1";

&HVite("lists/SampaToHTK.dic", "lists/", "lists/phonesHTK", "donnees/Locuteur9", "lists/lex1.txt", "hmms/hmm.3/HMMmacro");

# HVite: Decodage a l'aide des modeles de Markov
sub HVite {
    local($fileDic, $fileNet, $hmmList, $refDir, $paramList, $HMM)=@_;
    system("$HPARSE $fileNet tmp/net2");
system("$HVITE $hviteConf -H $refDir/$HMM -i dap.rec -S $refDir/$paramList -w tmp/net2 $fileDic $hmmList");

    • 1.3 Speech recognition

  • "": Decoding – Acoustico/phonetic
  • "": Named-entities recognition

    • 1.4 Segmentation & phonetic labeling

  • "" Constraint decoding of apprentissage (data) in action

  • 2 Lists directory (files to modify)

  • "list0.txt": Phonetic list (SAMPA) of apprentissage vocabulary
  • "list1.txt": Phonetic list (SAMPA) of recognition vocabulary
  • "phonesHTK": Markov machine list (HTK code)
  • "SampaToHTK.dic": Transcoding of SAMPA phonemes into HTK phonetic codes
  • "": Acoustico/phonetic decoding grammar
  • "": Grammar for recognizing key-words

  • 3 Données directory (data for speakers)

  • "wav": – Apprentissage and test directories (format .wav)
  • "param": – Analysis parameters of apprentissage and test (.mfcc)
  • "lab": – Segmentation and labeling of apprentissage
  • "hmms": – Hidden Markov machines trained with the parameters of analysis (param)

5 Implemented methodology

  • 5.1 Elaboration of resources

The resources we processed during this project came by the slicing of synthRadio117.wav into smaller pieces. This file contains 3 minutes long recorded sound, from an radio transmission, concerning a random topic (in our case FIFA World Cup). This way we can assume that all speech captured, in the form of conversation, is spontaneous and therefore suitable for our concept. We segmented this audio file depending on the speech turns of the conversation. For the first 2 minutes of the recorded broadcast we created a series of files for each speaker (example App.L1-tour1 for speaker 1, App.L2-tour4 for speaker 2, etc), that were registered under wav/apprentissage. We followed the exact same method for the last minute of the transmission and registered these series under wav/test. Based on these two directories we will construct later on the HMM machine. This step was made utilizing Praat, as seen in figure 4.

Figure 4. Annotated selected sound
  • 5.2 Phonetic transcription

So the first step, after creating our resources, is to do the phonetic transcription. Phonetic transcription also called phonetization or grapheme to phoneme conversion, is the process in which sounds are represented by specific signs. The phonetization is the equivalent of a sequence of dictionary look-ups. Assumed that all words of the speech transcription are mentioned in the pronunciation dictionary, the quality of this process is based on this resource. To do so we used the standard encoding of Speech Assessment Methods Phonetic Alphabet (SAMPA) 6., a machine - readable phonetic structure, that it is based on International Phonetic Alphabet (IPA). The product of the phonetic transcription process is loaded into a dictionary file (example list0.txt) so the vector calculation, using the different modules of HTK, can take process. The product of this procedure is later translated automatically by an HTK module (SAMPAtoHTK dictionary file) to each own comprehensible language.

SAMPA encoding example:
e m e R Z e Z @ p e~ s l a d i f i k y l t e a~ d O t R t a~ s e t e Z @ k R w a d @ R e j 2 n i R t u s e t a l a~ e Z a k E e b l a v i E v i t S o~ s u l @ f E R Z @ p a~ s c O m l e b r e z i l j e~ e l e o l a~ d E s a v l @ f E R d @ p w i l e~ t a~

*Note, that in our case we had only to deal with male speakers so there wasn't any need of training our set with female data neither creating a Markov Model according to gender. Otherwise we would have to construct another model based on women voices and speech since there are some significant differences between the two genders, when it comes to physical structure.

  • 5.3 Segmentation & Training

In this phase of the project we are launching the “” and “” commands. The first one is used in the acoustic-phonetic decoding, while the second one is used, as explained in chapter 4, to train the phonetic Markov machine. The first process, that of decoding, is strictly constrained in the acquired data (apprentissage). This way we can retrieve some .lab and .alg files (lab directory), which consist of the segments of the speech continuum represented by time values and HTK phonetic signs. Example in the figure 5 below. Each one of the used words have to be processed via a Markov machine specified for it and the control of this process can be accomplished by verifying those segments using Praat software. Thereafter, our data set of words, phonemes and segments can be treated by “” which will initialize the Markov machine learning.

Figure 5. Segment of word "temps"  in “temps.lab”  file

  • 5.4 Key-word detection

The concept of the recognition is that of effect the mapping between sequences of speech vectors and the wanted underlying symbol sequences. Two problems make this process extremely difficult. Firstly, the mapping from symbols to speech cannot be treated one by one since different underlying symbols can give birth to similar speech sounds. Additionally, there are large variations in the realised speech waveform due to speaker variability, mood, environment, etc. Second big obstacle is that, the boundaries between symbols cannot be identified explicitly from the speech waveform. Hence, it is not possible to treat the speech waveform as a sequence of concatenated static patterns. The second problem of not knowing the word boundary locations can be avoided by restricting the task to isolated word recognition. This implies that the speech waveform corresponds to a single underlying symbol (e.g. word) chosen from a fixed vocabulary. In order to do so, we composed a list of words selected from the speech continuum of our test data. A keyword loop also has been constructed and illustrated in the figure below, while it has been tested with the “” command.

([sil] <$KEYWORD> [sil])

6 Conclusion

Speech recognition in the future will revolutionize the way people operate with machines and vice versa. In the scientific field of pattern recognition and more specifically that of speech recognition, the so far acquired intelligence has to be re-estimated. The great industry of telecommunications, is currently a major field of SR systems applications and in this non-stop evolving domain there are numerous services to be provided globally. Those needs are pointing in the direction of SR systems, therefore it is strongly believed that for the future is significantly higher performance for almost every speech recognition technology area, with more robustness to speakers, background noises etc. This will ultimately lead to reliable, robust voice interfaces to every telecommunications service that is offered, thereby making them universally available.[...]

7 Bibliography

  1. Young Steve et al., The HTK Book, Revised for HTK Version 3.3, Cambridge University Engineering Department, April 2005.
  1. Caraty M. J., Montacie C., Reconnaissance de la Parole (notes de cours), Université Paris Descartes M2 Pro IIP, Université Paris Sorbonne M2 Pro ILGII, 2012-2013.
  1. Caraty M. J., Montacie C., Synthèse de la parole (notes de cours), Université Paris Descartes M2 Pro IIP, Université Paris Sorbonne M2 Pro ILGII, 2012-2013.
  1. Salvi G., HTK Tutorial, Royal Institute of Technology, Dept. of Speech, Music and Hearing, November 2003.
  1. Rabiner R. L., A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of IEEE vol.77 NO.2, February 1989.
  1. SAMPA Computer Readable Phonetic Alphabet (French), University College London:
  1. Hidden Markov Toolkit Homepage, Cambridge University Engineering Department (CUED):

No comments:

Post a Comment

Free online chess

View Kapellas Nick's profile on LinkedIn
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License