Auditory inspired spectro-temporal features
Organiser
Martin Heckmann, Honda Research Institute Europe GmbH
Email: martin "put a dot" heckmann "put an at" honda-ri "put a dot" de
Themes and Issues
The influence of psychoacoustic findings on the development of
speech processing algorithms was and still is very strong.
However, despite tremendous progress over the years,
performance of automatic speech recognition systems still
falls far behind human performance (see also the
Consonant Challenge
for a different look on this subject). Consequently, there
still seems a lot to be learned from the human auditory
system.
With the advances in methodology, neuroscience also provides new insights into the organization of the mammalian brain, which can be of use for this endeavour.
Measurements showed that the receptive fields in the primary auditory cortex of ferrets have Gabor like shapes and respond to modulations in the time frequency domain (compare Shamma 2001). Current features used for speech analysis, e.g. MFCCs, only cover the spectral properties of the signal (this difference is visualized in Fig. 1).
Figure 1
(a)
(b)
(c)
Comparison between pure spectral (a), pure temporal (b) and
spectro-temporal features (c) (adapted from Kleinschmidt 2002).
This lack of fine temporal resolution and the ability to capture the dynamic aspects of speech are well known limitations of MFCCs, and almost all other features conventionally used for speech recognition.
The above mentioned problems should in principle be overcome by the use of spectro-temporal features similar to those found in the primary auditory cortex. They can operate on different temporal resolutions and represent transitions in the time frequency domain. However, their use leads to new problems, especially the selection of the right features from a huge set of features. Nevertheless, in recent years quite a few people have started to use spectro-temporal features for tasks as speech recognition, speech/non-speech discrimination, and source separation and at the same time developed different approaches for the selection of features.
In this special session, we try to provide a forum for people working on the modelling of the representations of speech in the auditory cortex and those who develop speech processing algorithms based on abstractions of these models. We are seeking contributions that take into account the dynamic behaviour of speech via a joint spectro-temporal representation. Areas of research ideally suited to this session are:
- Measurements of the processing in the mammalian auditory cortex.
- Models of the representation of speech in the auditory cortex.
- Algorithmic abstractions of such models deployed for speech enhancement, separation or recognition.
It is especially our goal to sharpen the awareness to the impact that neuroscience can have on speech algorithms and to bring together traditional INTERSPEECH participants and people investigating the neurobiological basis of speech. Thereby, we follow the spirit of the plenary talk by Sophie Scott at INTERSPEECH 2007 and the panel discussion by Anne Cutler and Roger Moore at INTERSPEECH 2005 and want to give momentum to the exchange between these two fields of research.
If you are interested in the Special Session and you want more information just send the organizer an email.