Patients with severe hearing loss have the option to get a cochlear implant device to regain their hearing. Yet, the implantation process is not always optimal, which in some cases results in a shallow insertion depth or an accidental insertion into the wrong cochlear duct. As a consequence, the patients' pitch discrimination ability is suboptimal, leading to an even more decreased vowel identification, which is vital for speech recognition. This paper presents a technical approach to solve this problem: the adaptive pitch transposition module modifies the frequency content in a fashion so that the pitch is fixed to an optimal value. To determine this value, a patient-individual best pitch is determined experimentally by evaluating speech recognition at different pitches. This best pitch is subsequently called the comfort pitch. As a result of the considerations a technical implementation is presented in principle. A system comprised of pitch detection, pitch transposition and an arbitrary chosen comfort pitch is described in depth. It has been implemented proto typically in Matlab/Octave and tested with an example audio file. The system itself is designed as a preprocessing stage preceding cochlear implant processing.
Hearing is one of our fives senses, and losing it constitutes a severe impairment. This loss might happen simply due to age, gene defects, or the excessive exposure to loud noise. Those patients are normally helped by wearing hearing aid devices.
On the functional level, deafness might be caused by the destruction of the hair cells within the cochlea. The cochlea is a central organ deep inside the ear. Its very function is the transformation of mechanical waves, i.e., the sound, into electrical signals, i.e., the action potentials of the nerve cells. The hearing nerve maps these action potentials onto the auditory cortex, which is responsible for further processing.
Often, the term “destruction of the hair cells” does not refer to the entire destruction of those cells but to destruction of the hairs. This kind of deafness is usually called “sensorineural hearing loss”. Without these hairs, the cells cannot sense the mechanical oscillations within the cochlea, and can thus not invoke any action potential to send toward the auditory cortex. Those patients receive help from cochlear implants. A cochlear implant bypasses the processing chain from the earpiece up to the cochlea. It employs a microphone, a speech processor, and a small number of electrodes, with which it stimulates a certain area of hair cells. For every incoming sound sample, the speech processor creates separate frequency filtered signals, which are turned into pulse train signals, that stimulate the hair cells directly. Section 1 provides the background on both the hearing sense and cochlear implants as far as required for the understanding of this paper.
Even with properly implanted and operating cochlear implants, not all patients enjoy sufficient speech recognition. In addition to the small number of electrodes and limited brain plasticity, the cochlea itself might be the origin of further problems: due to its construction and possible additional ossifications within it, the electrodes cannot be inserted deep enough; some of the (usually) twelve available electrodes remain outside the cochlea. Consequently, a very small number of potential frequencies remains available, which does not necessarily map the frequency spectrum of the incoming sound. This problem is described in more detail in Section 3.
As an alleviation, this paper proposes to insert a signal preprocessor between the microphone and the actual speech processor (see, also,
As is already suggested by its name, pitch is the central concept of the adaptive pitch transposition module. Pitch is a concept that is rather hard to understand
as it is not a frequency that is contained in the sound sample, but rather a neurophysiological phenomenon that reflects the inherent periodicity of the sensed sound. Therefore, Section 4 provides a brief description of pitch and its occurrence in speech.
Depending on the actual situation of the implant, every cochlear implant patient as a very specific frequency area that provides the most comfortable and best speech recognition. This personal frequency area is called the comfort pitch range and does not necessarily match the pitch of an incoming speech sample. This frequency mismatch is a main reason for limited speech recognition. As an remedy, Section 8 proposes the adaptive pitch transposition module. Its very purpose is to transpose, i.e., shift and scale, the frequency spectrum of all incoming speech samples to one pitch in the comfort pitch area. To this end, the adaptive pitch transposition module consists of a pitch detection unit and a pitch transposition unit. To achieve the adaptivity of the module, a test to acquire the comfort pitch has been developed as well.
The adaptive pitch transposition module has been prototypically implemented in both Matlab® and C++. These prototypes are described in Section 1, and allow for the offline evaluation of recorded sound samples as well as the online demonstration of its effects.
The preliminary practical experiences clearly indicate that the chosen approach functions. These experiences, however, also indicate room for improvements and several limitations. The main drawback of the current approach is that it heavily hinders speaker separation. Section 11 discusses these aspects and concludes this paper.
This section provides some background information about the ear, its cochlea, and cochlear implants as far as relevant for the understanding of this paper. This description is illustrated by
From a medical point of view, the ear consists of three parts: the outer ear (earpiece (pinna) and eardrum (Tympanic membrane)), the middle ear (the ossicles: malleus, incus, and stapes), and the inner ear (cochlea and hearing nerve). With respect to this paper, the cochlea is the most important part, since it converts incoming mechanical signals into electrical signals. This is accomplished by its hair cells: The entire functional chain consisting of the outer and middle ear inject mechanical sound waves into the liquid of the cochlea. Due to the strong coupling, these oscillations excite the hair cells’ hairs. Depending on the actual excitation, the hair cells release certain action potentials, which are subsequently processed by the hearing nerve and the auditory cortex. This process is shown in
All incoming sound waves are reflected at the cochlea’s end. This leads to a superposition of several sources. Due to its construction and the damping inside its tubes, the cochlea performs a frequency-to-spacial transformation: Every single hair cell is particularly sensible to one frequency of the perceived sound signal, with higher frequencies being detected at the cochlea’s entry (also known as base, i.e., near the oval window (Fenestra vestibuli)) and lower frequencies being detected at the cochlea’s end (apex).
Due to age, diseases, inheritance, and too loud sound sources, the hair cells, particularly their hairs, might be damaged or even broken. As a result, the hair cells’ hairs cannot stimulate the hair cells’ nerves. In the long run, this leads to partial or even total loss of the hearing sense, even though the hair cells' bodies are still functioning. This is known as sensorineural hearing loss, which is the most common form of hearing loss. Other forms are conductive hearing loss, where the mechanical forwarding mechanism is broken (e.g., if the middle ear is damaged), or a mixed form of sensorineural and conductive hearing loss. With conductive hearing loss, cochlear implants are not useful but other devices are available.
Cochlear implants, also called CIs, have been developed since the 1950’s. Their very purpose is to stimulate the nerve cells electrically. That is, a small number n ≈ 12 of electrical electrodes substitute the mechanical excitation of the nerve cells with electrical signals.
A cochlear implant is not just a bunch of electrodes. Rather, it is a complete processing chain as shown in
to an implanted receiver coil. From there, the signals are then relayed to the electrodes to invoke the hearing sensation.
In the mode of operation, the cochlear implant processes the acoustic signals, decomposes them into independent frequency bands, and uses those to stimulate the nerve cells via the electrodes. From that description, it should be clear that cochlear implants require repetitive recalibration of all electrodes and frequency bands, since the transition resistance of the electrodes or even the characteristics of the hair cells themselves might change (age) over time.
Even with such sophisticated cochlear implants as mentioned, not all patients can understand speech to a sufficient degree. One of the major problems is a reduced insertion depth due to a partly ossified cochlea. The upper part of
However, research aiming at enhancing pitch perception was recently done by Laneau et al. [
Pitch is the perceived “height” of a sound and constitutes a neurophysiological phenomenon. It is not a frequency in the traditional sense. Instead, it is a perceptive phenomenon in which the human brain extracts harmonic information from a signal, and turns it into the illusion of being a frequency. This harmonic information is measurable and is called the fundamental frequency. The fundamental frequency is detectable in a sound signal, whereas the pitch sensation is only created in the brain. In other words: the “fundamental frequency” refers to the incoming sound, whereas “pitch” refers to te physiological phenomenon in the auditory system.
This section explains the concept behind pitch, and shows how to estimate it. In addition, it outlines the role of pitch in speech and describes some pitfalls in dealing with pitch in digital speech processing systems. This section also briefly describes the transposition of the incoming sound signal, since the adaptive pitch transposition module aims at a perfect match between the frequency at which the patient has the best speech comprehension and the fundamental frequency of the incoming sound.
Pitch is an attribute of sound, which only manifests itself in the brain of the listener. The perception of pitch is and has been subject to intensive research.
Two theories have emerged: The place coding and the temporal coding theory [
A distinction should be made, between sounds, that have a perceivable pitch and sounds that do not. A pitched sound has an inherent periodicity (a fundamental frequency). This periodicity usually appears as distinct, easily observable peaks in the frequency spectrum. Non-pitched sounds, however, are all forms of noise or noise-like, such as a flowing river or rustling leaves. With those having no detectable periodicity, they also have no distinct fundamental frequency. Thus, using the theories of pitch perception, those sounds have no perceivable pitch.
The human body can create sounds that have a perceivable pitch. What is usually called the voice, is actually the vocal folds vibrating, and giving the exhaled air a periodic oscillation characteristic. That is, by using the vocal folds, humans create speech with a perceivable pitch, which is commonly referred to as voiced speech. However, if the vocal folds are kept still, the air does not vibrate, and thus the created sound does not have a fundamental frequency. These tones, which are usually only shaped by the current shape of the mouth, are called unvoiced. As a third category, also silence is considered as sound, but it does not play a role in the current discussion.
In speech, voiced sounds are called vowels, and unvoiced sounds are usually known as consonants (or whispered speech). Examples for these sounds are [ ∫ ] like in fish, or [θ] like in teeth.
On a side note, the decreasing peaks on the vowel spectrum, constitute another attribute of sounds, namely the timbre, which is defined through the shape which is formed by the single peaks.
Pitch cannot be determined directly. Rather, it is normally estimated via the fundamental frequency f 0 of the considered sound sample. The fundamental frequency f 0 is the lowest in the spectrum that forms a set of harmonic frequencies f n , which are integer multiples of the fundamental frequency. Frequency f 0 itself is also a harmonic. Any harmonic sound is composed of
one or more harmonics and is thus periodic with respect to f 0 . In case a sound has no fundamental frequency, it cannot be considered periodic (or harmonic, respectively).
As already shown, pitch can be estimated by evaluating the distance between the individual peaks. Due to several side effects, it is usually not possible to just use the largest peak of the spectrum as an f 0 estimate. Thus an alternative approach to measure the periodicity of a signal is required. The autocorrelation function (ACF, R x x ) is one of the widely used algorithms for this very purpose. Autocorrelation slides a portion of the data over itself with an offset lag τ to determine a similarity measure. A maximum in this function indicates a harmonic frequency determined by the corresponding lag τ . The equation for the autocorrelation function reads as follows:
R x x n ( τ ) = ∑ n = − N N x ( n ) x ( n − τ ) (1)
The application of the autocorrelation function to the original signal (not the spectrum) yields a good pitch estimate. However, this is only true for signals that
have a harmonic structure. For unvoiced signals, the autocorrelation results appear to be pseudo-random. Therefore, the results have to be verified. This verification process is called voice-activity-detection (VAD) or voiced-unvoiced- decision (V/UV) for speech audio signals, and is used in speech recognition software or other speech-related tasks. Voice-activity-detection usually classifies a given signal part into voiced, unvoiced, and silence. Indicators for voiced speech are high signal energy combined with a low zero crossing rate, which indicates the number of sign changes in a digitized signal. A vice versa constellation, i.e., low energy and a high number of zero crossings, is a good indicator for noise. However, a variety of different algorithms of different complexities and results are offered by the pertinent literature [
Pitch shifting or pitch transposition is the act of changing the fundamental frequency f 0 to change the perceived pitch of a signal. Since pitch is an elaborate concept, this is not a trivial task. Usually pitch shifting is referred to as pitch transposition, since a shift means moving everything in the signal by an absolute value. Transposition however, considers the relationships of the harmonics in the spectrum and keeps those intact. In other words: pitch transposition is harmonic conservant, where pitch shifting is not.
Transposing the fundamental frequency of a signal is usually a two-step method. The first step consists of changing the signal length, without affecting the pitch. Then, the second step simply consists of altering the replay speed. This step can be imagined as a player for vinyl records, which is intentionally set to the wrong speed. In case of a higher-than-intended speed, both the record's pitch increases and the duration shortens. In the opposite case, the pitch decreases, whereas the playtime increases.
In terms of digital signal processing, the two supplied steps are called 1) time stretching and 2) resampling. First, the signal is stretched in time with the original pitch maintained. Then the data is resampled to the original signal
length. A signal stretched by the factor of two, resampled to the original length, will thus have a pitch at double frequency.
For this task, the pertinent literature offers several algorithms. The PSOLA algorithm [
This section describes both, the idea and functionality of the adaptive pitch transposition module. From a top-level perspective, adaptive pitch transposition is a module that is inserted between the microphone and the cochlear implant’s speech processor. Thus, this module changes all incoming utterances, which are then processed by the implant as usual. After a brief overview of the general idea, the single parts of the concept are outlined. These are the concept of comfort pitch, which constitutes the voice pitch of best comprehension, the actual pitch detection, and the final pitch transposition.
Adaptive pitch transposition, or APT for short, is a method that shifts (or transposes) the pitch of an arbitrary signal to a fixed, predetermined value. The resulting signal will thus have only a single pitch for each of the voiced signal parts. Based on the former explanations, this will most likely help cochlear implant patients in their speech comprehension, since it accounts for both the low insertion depth as well as their low brain plasticity. The required functional units are depicted in
The APT concept works as follows: It assumes that a particular patient has a certain voice pitch that yield the best recognition rates. This particular pitch is called comfort pitch. Then, APT “smartly” transposes all incoming voiced sound samples to that pitch. The “mart” part is to measure the current value of the
pitch and adjust it adaptively to the patient’s comfort pitch. This approach requires the following processing stages: 1) determination of the current pitch, 2) determination of the required shift amount, and 3) shift/transposition of all incoming signals to the target comfort pitch. The transposed signals may then be forwarded to an arbitrary cochlear implant speech processing algorithm.
For performing the actual pitch transposition, the pitch of the current signal has to be determined (estimated). This is especially true, since usually the signal pitch is varying (due to intonation), which leads to varying conversion factors, because the signal has always to be transposed to the very same target pitch. In other words: not the transposition factor is constant, but the target pitch is. This varying conversion factor is called the transposition coefficient k , which in turn defines how much shift has to be applied.
Pitch detection itself is executed as explained in Section 6. In addition, Voice activity detection is required, since unvoiced speech does not require any further processing.
The comfort pitch is the frequency of best comprehension. For humans with an intact auditory system, the comfort pitch may not have a distinct value, but rather consist of a certain range of frequencies. Given the brains accustoming to the human voice, this frequency most likely lies within the frequency range of the human voice. Cochlear implant patients, however, suffer from the shortcomings explained above: Their comfort pitch might be a single value, which presumably lies within the range of the cochlear implant’s electrode frequency range. The relationship of these different pitches, is depicted in
This paper defines the comfort pitch f c as the one among a given set of pitches with which a particular listener has the best speech recognition rates. This definition directly defines the means of how to estimate the comfort pitch.
The comfort pitch will be determined by a simple test. This test will be comprised of voice samples of different pitches, uttering random phrases from a fixed set of sample phrases. A sample speaker pitch configuration might be as shown in
In this table, every speaker was recorded with a certain set of phrases, which are then presented to the patient in a randomized manner. Subsequently, the patient communicates the content of the phrase according to what he or she understood. Pitch-score pairs are then ranked for best speech recognition rates on the patients side. The voice pitch with the best score will be the new comfort pitch for optimal recognition when using the system. From a measure of two best scores, the lowest one will be picked, since it results in lower computational costs during the subsequent pitch transposition. Through this test, the comfort pitch adapts to the patient.
The comfort pitch itself is a robust feature, since it is determined with an already implanted cochlear implant, and just depends on the nerve ends in the cochlea, which should not deteriorate in an unusual manner. The speech recognition rates themselves will probably rise even further, since the principle of the comfort pitch relies on the brains adaptivity to the equal sounding stimuli. It is adapted to the existing structures, and may benefit from the remaining brain plasticity.
As the name already suggests, the pitch transposition is the vital part of the APT module. The source audio signal is changed to a new pitch within this block (second block in the overview on
Speaker | M1 | M2 | M3 | F1 | F2 | F3 | F4 |
---|---|---|---|---|---|---|---|
Pitch | 95 | 110 | 140 | 200 | 240 | 290 | 320 |
The transposition coefficient k = f 0 / f c is determined by using the comfort pitch f c and the fundamental frequency f 0 (see, also, Section. 6 and the previous Section.). In APT, the pitch transposition utilizes k to calculate the relative change in pitch. This ensures an always steady pitch at the output. The output signal is a high definition broadband audio signal with almost the same characteristics as the input signal has, except for the altered pitch.
For test purposes, the APT algorithm was implemented in both Matlab and C++. Both programs share the same basic structure and algorithms. The implementations focus on proof-of-concept and maximal flexibility in order to allow further experimentation. This section provides a brief overview of both implementations.
The implementation and testing has been done using an Intel® CoreTM i7 Q720 CPU at 1.6 GHz. The input of the programs may be any file in the Waveform Audio File Format (WAV). The C++ implementation also allows for live microphone input with simultaneous output (best suited for headphone use). The comfort pitch is yet assumed to be a fixed value, and is to be set as a parameter.
After obtaining the audio data, it is segmented into overlapping frames. Each of the following steps is executed per frame of size N (e.g. with N = 1024 ). The output signal, has the modified fundamental frequency f c .
The given hardware is under considerable load when running the program. Therefore, it can be concluded that a microprocessor solution would have to be highly optimized, or alternatively, fundamental improved algorithms have to be developed.
A simple autocorrelation has been used for the pitch detection, as already described in Section 6. The resulting signal is then parsed for its maximum with respect to the lag τ , equivalent to the pitch range of 50 - 500 Hz. As already explained, the results are not reliable, since not every frame has a pitch. Therefore a verification function evaluating both the zero crossing rate (ZCR)
and the energy (E) has been used. Equation (2) calculates a validity measure with p > 0 indicating a voiced frame. This equation achieves good results for pre-normalized signals, but not for normalized frames.
p n = E n + 1 Z C R n + 1 − 1 Z C R , E ∈ [ 0 , ⋯ , 1 ] (2)
If the voice activity detection indicates a voiced frame, APT uses the f 0 estimated by the autocorrelation algorithm.
However, even with voice activity detection, not every frame is classified correctly. Because of that, the voice activity results of some of the recently calculated frames are taken into account to determine the result of the current frame. Using this, a meaningful pitch trajectory in time can be established (since, if the last frame was pitched, the next will most likely also be).
The real-world application made clear that the pitch detection algorithm (especially the voice activity detection part) is the bottleneck of the implementation. Wrongly detected pitches result in jumps in the created pitch trajectory. This destroys the quasi steady character of the pitched regions, and leads to worse comprehension, even for hearing people.
Since no patient data is available yet, the comfort pitch is included as an arbitrary value in the programs, which might be set later according to the determined f c of selected patients. Currently it is used as an indicator whether or not the algorithm works. This evaluation is done manually by listening tests.
The transposition coefficient for frame n is calculated according to
k n = f 0 n f c (3)
Both implementations use a modified phase vocoder technique based on [
Knowing the transposition coefficient k , it is possible to prolong the signal using such a phase vocoder. As a last step, the signal has to be resampled. This means, basically, to transform the signal back from digital to analogue, and to sample it again, using a different sampling frequency. Digital signal processing systems however, already provide filtering and interpolation.
After resampling, the signal has a changed pitch and the original length, and is now ready to be fed into the actual speech processing of a cochlear implant.
Not every cochlear implant patient has a sufficient speech comprehension to get through his or her everyday life. Due to ossifications in the cochlea, the implant's electrodes do not reach deep enough into the cochlea to provide the full frequency spectrum to the patient. To alleviate this problem, this paper has presented the concept of adaptive pitch transposition. This is possible due to considerations of how words can be distinguished on a coarse level through vowels alone, which is utilized through shifting the determining information in the audio signal into the perceptive range of the cochlear implant patient.
One major drawback is, however, the loss of speaker differentiation, since every speaker will sound alike after the modification. The only indicators left will be the timbre and visual cues (e.g., moving mouth). Nevertheless, in prospect of regaining the ability to understand speech, this seems negligible.
On the other hand, through the ever same stimuli, the brain will adapt to the comfort pitch over time. Under the assumption a comfort pitch exists for every patient, the brain is already accustomed to being stimulated in that particular area of the cochlea. This enables the brain to learn to differentiate these stimuli even better, and thus, to use the remaining brain plasticity.
As for the introduced parameter called comfort pitch, it is yet to be investigated, whether a single best pitch exists for cochlear implant patients (as is for normal hearing). The comfort pitch is not likely to change after it has been determined, since it depends on the remaining nerves and is fitted to the established structures in the brain.
The extent to which the concept together with the proposed system enhance the speech recognition rates in patients is subject of future research. It will be directed to the evaluation and testing with hearing people as well as cochlear implant users. Furthermore, performance and quality increase of the developed implementation are required, in order to allow for real time usage and testing. The current implementations are good foundations to do so.
The authors gratefully thank the welisa graduate school for its support. Part of this research was funded by the German Research Foundation (DFG) grant number GRK 1505.
Struwe, K. and Salomon, R. (2017) Adaptive Pitch Transposition: Smart Auditory Spectral Shifts in Cochlear Implants. Engineering, 9, 739-754. https://doi.org/10.4236/eng.2017.99045