This Automatic Speech Recognition (ASR) is the process which converts an acoustic signal captured by the microphone to written text. The motivation of the paper is to create a speech based Integrated Development Environment (IDE) for C program. This paper proposes a technique to facilitate the visually impaired people or the person with arm injuries with excellent programming skills that can code the C program through voice input. The proposed system accepts the C program as voice input and produces compiled C program as output. The user should utter each line of the C program through voice input. First the voice input is recognized as text. The recognized text will be converted into C program by using syntactic constructs of the C language. After conversion, C program will be fetched as input to the IDE. Furthermore, the IDE commands like open, save, close, compile, run are also given through voice input only. If any error occurs during the compilation process, the error is corrected through voice input only. The errors can be corrected by specifying the line number through voice input. Performance of the speech recognition system is analyzed by varying the vocabulary size as well as number of mixture components in HMM.
Speech is one form of communication used by the humans for exchanging the information. Each word that is spoken by the humans is created using the phonetic combination of vowel and consonant speech sound units. Speech processing is the study of speech signals and processing methods of these signals. The speech signals are usually processed in a digital representation. Speech recognition is the process of converting the speech signal into human readable text. Nowadays speech recognition is used in variety of applications. People with disabilities can benefit from speech recognition programs. For individuals that are deaf or hard of hearing, speech recognition software is used to automatically generate a closed captioning of conversations such as discussions in conference rooms, classroom lectures. Speech recognition is also very useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involved disabilities that preclude using conventional computer input devices. Our proposed system is developed to facilitate the visually impaired people or the person with arm injuries with excellent programming skills that can code the C program through voice input. The paper is organized into existing systems, proposed system, implementation and its performance. Literature related with the proposed systems is discussed in Chapter 2. Chapter 3 deals with the proposed frame work followed by the implementation in Chapter 4. The performance analysis is detailed in Chapter 5. Chapter 6 concludes with a few points as to the scope for future enhancement.
Speech recognition [
Different types of spectral features that [
A few of the speech based applications developed are mentioned below.
In [
Two major modules of the proposed framework are shown in
Module 1: Speech recognition.
Module 2: Building IDE for C program.
In Speech recognition training phase, feature vectors are extracted from the given speech signal. The extracted feature is used to build the acoustic model. In testing phase, from the test speech signal, the features are extracted. The extracted feature is compared with the acoustic model to produce the recognized text. Speech Recognition system is implemented using sphinx [
Recognized text from module 1 is pre processed to convert the text into proper C program using syntactic construct of the C language. This C program will be fetched as input to IDE. This IDE will produce the compiled output of the recognized C program.
The training phase consists of the following modules.
1) Data collection
Speech utterance correspond to C keywords are collected from different speakers. For each keyword, sixty speech utterances are collected. All speech samples are recorded in wav format. After collecting the voice samples, dictionary file, fileids, transcription files, language model files are generated. The fileids contains location of the wav file, transcription file contains the text corresponding to the wav file, dictionary file lists all the words and its phoneme sequence, language model file contains the probability of occurrence of each word in the speech corpus.
Example of dictionary file:
A AH
ADD AE D
AMBERSAND AE M B ER S AE N D
2) Feature Extraction
Features are extracted from the voice samples. MFCC features [
Pre emphasis―Divide the signal into 20 - 40 ms frames. In this paper the frame size is assumed as 25 ms. This means the frame length for a 16 kHz signal is 0.025 × 16,000 = 400 samples. Frame step is usually 10 ms to 15 ms, which allows overlap between the frames.
Hamming windowing―Windowing is applied to minimize the disruptions at the start and at the end of the frame.
Fast fourier transform―The conversion from time domain to frequency domain is carried out by fourier transform method.
where h(n): N sample long analysis window,
K: the length of the DFT.
The periodogram―based power spectral estimate for the speech frame si(n) is given by
Mel Filter Bank Processing―The filters are used to compute a weighted sum of spectral components to filter the output.
Mel Scale―The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are superior at discerning minute alterations in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features matches closely with humans’ perception.
The formula for converting frequencies into Mel scale is:
Discrete Cosine Transform―It is used to convert the Mel spectrum to the domain of time.
Delta Energy and Delta Spectrum―It is necessary to add features related to the change in the characteristics of cepstral over the time. Delta energy and delta spectrum are also known as differential and acceleration coefficients. The MFCC feature vector describes only the power spectral envelope of a single frame, however speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit.
Delta coefficients are computed as follows:
where dt is a delta coefficient, from frame computed in terms of the static coefficients ct+N to ct-N.
3) Building HMM model
A hidden Markov model (HMM) [
For example: INCLUDE IH N K L UW D
The word INCLUDE has 6 states. State transition from IH to D will lead to the word INCLUDE
4) Language Model
The language model is used to assign probability to each word according to their frequencies. This language model will facilitate to predict the subsequent word during the testing phase of the speech recognition. In speech recognition if the HMM model didn’t predict the word correctly, then language model will find out the subsequent sequence word using these calculated probability values. Different types of language models can be built, e.g. unigram, bigram, trigram model, etc. Unigram model is used to find out one single word, whereas bigram model is used to predict the predecessor or successor word. Trigram model is used to predict the predecessor and the successor of the given word.
The unigram model can be calculated by:
where
N represents the total number of words in the corpus.
The bigram model can be calculated by:
where
The trigram model can be calculated by:
Example of language model:
(unigram)-3.8770 IF
(bigram)-3.3998 <s>OPENBRACKET
(trigram)-0.3010 <s>OPENBRACKET </s>
<s> : represents a word in the word corpus that occurs predecessor of the current word. </s>: represents a word in the word corpus that occurs a successor of the current word. Here probability values are represented in logarithm.
1) Feature Extraction
The MFCC features are extracted from the test speech utterance. The procedure for extracting MFCC feature is explained in Section 3.3.1.
2) Acoustic Model
After extracting the MFCC features from test speech signal, using the acoustic model the text is recognized. In order to incorporate the syntax of the C program, the recognized text is given to the IDE Preprocessing module. Speech testing is implemented in two ways.
Real time speech testing-Voice input is given instantly to recognize the text.
Recorded wav file speech testing-The recorded wav files are given as input to this type of testing.
The algorithm for Real time speech testing is shown in algorithm 1. In algorithm 1, IDE commands represent open, save, new, compile and run. The algorithm for doIDEcommands function and doIDEpreprocessing function are explained in algorithm 3 and 6. The algorithm for recorded wav file speech testing is given in algorithm 2. The algorithms are explained in Appendix.
In IDE pre processing module, the recognized text from speech testing will be converted into C program using the syntactic construct of C language. In the first step, the recognized text is divided into tokens. If token is recognized as symbol then replace the token with its corresponding symbol. If token is a number then convert the token into its equivalent number. If the token is not a number or a symbol then leave the text as it is. After the recognized text is pre processed, it will be fetched as input to the IDE module.
Text to Symbol and Number ConversionCreate two look up tables for storing the symbols (operators in C language) and numbers. Compare the token with the symbols present in the look up table for symbols. If one of the symbols matches with the token then replace the token with its corresponding symbol. Otherwise, compare the token with the numbers present in the look up table for numbers. If one of the numbers matches with the token then replace the token with its corresponding value. If the token does not match with all of the symbols or numbers then leave the token as such. This process will be repeated for all the tokens. Few symbols and all numbers are listed in look up
The algorithms for doing IDEpreprocessing are shown in algorithm 3, 4, 5.
In this module, the pre processed text will be fetched as input. IDE commands are also given through voice input only. The IDE commands used in our proposed work are open file, save file, new file, compile file and run file.
This command will open a new file in the IDE. The voice command will create a new file in the IDE.
Key | Value |
---|---|
Less than | < |
Equal to | = |
Greater than | > |
Open curly braces | { |
Close curly braces | } |
Dot | . |
Open bracket | ( |
Close bracket | ) |
Hash | # |
Backslash | \ |
Plus | + |
Key | Value |
---|---|
zero | 0 |
One | 1 |
Two | 2 |
Three | 3 |
Four | 4 |
Five | 5 |
Six | 6 |
Seven | 7 |
Eight | 8 |
Nine | 9 |
The open file command is used to open the existing file. For opening a file the user has to provide the command “open file <filename> dot c ‘or’ open file location <location of the file>” through voice input. If the file exists subsequently the IDE will open the file. If it is not exists in that case it will show an error message to the user.
If we utter the following lines:
Open file example dot c (or) open file location D colon backslash example backslash example dot c.
This will be converted into the text as follows:
Open file example.c
Open file location D:\example\example.c
The voice command for open file is converted into text. From the text, file name will be extracted. If the command is open file without the keyword “location” in that case it will look for the file in the current directory. If the command is with the keyword “location” then it will look for the file at the specific location.
The save command is used to save a file. For saving a file the user has to provide the command “save file <filename> dot c ‘or’ save file location <location of the file>” through voice input. It will save the contents of the IDE into the file specified by the user.
Example: save file example dot c (or) save file location D colon backslash example backslash example dot c.
These will be converted into the text.
Save file example.c
Save file location D:\example\example.c
The voice command will be converted as text. From the text the
File name will be extracted. If the command is save file without keyword “location” then it will save the
File in the current directory. If the command is with the keyword “location” then it will save the file at the specific location.
The compile command is used to compile the C file. For compiling a file the user has to provide the command “compile file” through voice input. The file is compiled using the gcc compiler. After the compilation process the gcc compiler will produce the .exe file or .o file based on the Operating System. It will produce .exe files for windows and .o file for unix based Operating System.
The run command is used to run the C file. For running a C file the user has to provide the command “run file” through voice input. If the .exe file or .o file is found then it will run the .exe file or .o file in the command prompt or terminal according the Operating System. If the .exe file or .o file is not found then it will display the error message. After running the C program the .exe or .o file is deleted.
The goto line number command is used to correct the errors occurred during the compilation of the C program. For error correction in a C file, the user has to provide the command “goto line number <lineno>” through voice input. Extract the line number from the user voice command. Empty the text in that line. After clearing the text in the specific line given by the user, place the latest recognized text. The new text is recognized from the user voice.
Example: goto line number six six.
From this six six should be extracted and converted to numbers as 66. This text to number conversion will
be done by the IDEP reprocessing module. The algorithm for doing IDECommands is shown in algorithms 6, 7, 8, 9, 10.
We have collected 217 C programming language keywords. Speech utterances corresponding to these keywords are collected from 25 speakers. Each keyword is uttered 20 times. The speech samples are collected using microphone. Speech data is decoded with sampling rate of 16 KhZ with single bit mono channel stored in WAV format. After the data collection, transcription file, dictionary file and language model files are generated. From the collected speech samples, the MFCC features are extracted. Using the extracted features, HMM model is built. During testing, using the HMM model, the test utterance is recognized. In IDEpre processing, recognized text is converted into C program using syntactic construct of the C language. The IDE commands open file, save file, compile file, run file and goto line number also implemented using the voice input.
The performance measure used to evaluate the proposed system is discussed below.
Performance MeasuresThe Word Error Rate (WER) is a metric which is used to measure the performance of an ASR. It compares the given word to a recognized word and is defined as follows:
where:
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions and,
N is the number of words in the actual word.
Word Error Rate calculation for the entire system is as follows:
Word Error Rate calculation for a C program is
where:
i is the line number of the program,
WERi is the Word Error Rate for the ith line of the C program,
n is the total line numbers in a C program.
Word Error Rate calculation for IDE commands is
where:
c is the voice input for IDE command,
WERc is the Word Error Rate for the IDE command c,
m is the total no. of IDE commands uttered by user.
The performance of speech recognition is analyzed by varying the number of mixture components using 150 words are tabulated in
The performance of speech recognition is analyzed by varying number of words are represented in
Our proposed system is used to capture the C program through voice input and produces the compiled C program as output. During training phase, speech utterances corresponding to C key word are collected. MFCC features are extracted from the speech samples. HMM model is built using extracted features. During testing, from the test utterance, the MFCC features are extracted. Using the HMM model, the text is recognized. The recognized text is converted into the C program by using syntactic constructs of the C language. The IDE commands for saving, opening, compiling and running the file are also given through voice input. The proposed speech based IDE is implemented for C program only, it can be extended to other programming languages. In our proposed work, word based speech recognition is implemented. While extending the research work to other programming languages, phoneme based speech recognition can be applied. Phoneme based speech recognition supports the large vocabulary data set.
No. of mixtures | Word error rate (in %) |
---|---|
2 | 58.8 |
4 | 42.0 |
8 | 27.5 |
16 | 13.0 |
32 | 4.6 |
64 | 3.8 |
128 | 9.2 |