Seminar on Computational Auditory Scene Analysis

Full day seminar on CASA related to understanding and interpretation of auditory sources.


To register for this event you must be logged in as a member of ASIP-NET.

Time and place

May 24, 2007, 9.00-17.00 (registration opens 8:30)
Technical University of Denmark, Building 208, Auditorium 51, 2800 Kongens Lyngby, Denmark
Directions: DTU Map, Google Map


Computational auditory scene analysis aims at developing computational models for the understanding and interpretation of auditory information sources. Current hard research themes include:

  • Conversion of real-time acoustical signals into abstract symbolic/semantic representations with application to e.g. better audio retrieval, search and organization.
  • Separation of individual sources from a stream of mixed auditory sources which meets human listeners' expectations. The field has many applications relevant for hearings aids and general interference cancellation.
  • A major obstacle of CASA systems is that human listening tests often are required. The modeling of human perception is thus essential as prior knowledge to be included in the CASA model, and for system evaluation in general.

The seminar will address such issues through presentations by leading researchers in the field as well as highlights of current university research projects.

Opens internal link in current windowDownload lecture slides from the file archiveFile Archive




Registration and coffee


Welcome by Jan Larsen, IMM/DTU


Mechanisms and models of normal and impaired hearing

Professor Dr.rer.nat Torsten Dau, Center for Applied Hearing Research, Ørsted, Technical University of Denmark

There are at least two main reasons why auditory processing models are constructed: to represent the results from a variety of experiments within one framework, and to explain the functioning of the system. Specifically, processing models help generate hypotheses that can be explicitly stated and quantitatively tested for complex systems. The models can also help determine how a deficit in one or more components affects the overall operation of the system. The development of auditory models has been hampered by the complexity of the individual auditory processing stages and their interactions. This resulted in a multiplicity of auditory models described in the literature which differ in their degree of complexity and quantification. Most of the models can be broadly referred to as functional models, that is, they simulate some experimentally observed input-output behavior of the auditory system without explicitly modeling the precise internal physical mechanisms involved. In this talk, several models of normal and impaired auditory processing are summarized. The emphasis will be on biologically inspired perception models and the perceptual consequences of hearing impairment with respect to frequency analysis, temporal fine-structure and envelope (modulation) analysis as well as speech perception. Some of the models can be useful for technical applications, such as improved man-machine communication by employing auditory-model-based processing techniques, new processing strategies in digital hearing aids, or as a basis for computational auditory scene analysis.

Brief Bio: Torsten Dau is professor at the Technical University of Denmark (department Ørsted-DTU, section Acoustic Technology) and head of the Centre for Applied Hearing Research, established in 2003. His background (Ph.D., Dr. habil.) is in applied physics. His main research interests are within psychoacoustics, computational auditory modeling, neural correlates of auditory function, perceptual consequences of hearing impairment, speech perception, and technical/clinical applications of auditory models.


University Flash - ongoing university projects

  • Binaural processing of fluctuating interaural level differences, Ph.D. student Eric Thompson, CAHR, Ørsted, Technical University of Denmark
  • Speech intelligibility for normal hearing and hearing impaired listeners in simulated room acoustic conditions, Research Assistant Iris Arweiler, CAHR, Ørsted, Technical University of Denmark
  • Mononaural audio separation, Ph.D. student Mikkel N. SchmidtISP, IMM, Technical University of Denmark
  • Harmonic structure of musical instruments, Ph.D. student Andreas B. NielsenISP, IMM, Technical University of Denmark


Coffee break, networking and posters


Within-channel and across-channel processing in computational models of auditory grouping

Senior Lecturer PhD Guy Brown, Sheffield University, United Kingdom

A number of recent psychophysical studies suggest that processing within individual frequency channels plays a significant role in auditory sound separation. For example, Edmonds (2004) has shown that listeners do not segregate concurrent sounds by grouping frequency regions that have a common interaural time difference (ITD). This finding is problematic for computational auditory scene analysis (CASA) systems that exploit binaural grouping cues, since they invariably use across-frequency grouping by ITD to segregate a target sound from spatially separated interfering sounds. This talk will review previous binaural approaches to CASA, and discuss their shortcomings in relation to Edmond's findings. An auditory model that exploits within-channel binaural cues is proposed as an alternative. The model is compared with human performance on the same task, in which the speech reception threshold (SRT) is measured for speech and noise stimuli that have consistent or inconsistent ITDs in different frequency bands. Human and model performance are shown to be in qualitative agreement.

Brief Bio:  Guy Brown is a Senior Lecturer in the Department of Computer Science at the University of Sheffield, UK. He has research interests in computational auditory scene analysis, auditory modelling and noise-robust automatic speech recognition.


Perceptual Features of Music – Measured by Listening Experiments, and Computer Modelled from Signal Analysis

Researcher, PhD Esben Skovenborg, TC Electronic, Århus, Denmark

Numerous properties of musical sound are perceived when the sound is listened to. The research, presented here, involves perceptual properties that can be adjusted by means of the audio processing taking place in the (post‑)production of music.

One investigation concerns loudness, a fundamental property of sound, and another investigation concerns a set of properties corresponding to music mastering. Both investigations comprise three parts: 1) Conducting listening experiments, in which ratings from different listeners are used to measure properties of a set of sound segments; 2) Extracting relevant features, based on a signal analysis of the segments; 3) Modelling the relationship between the signal features and the subjective ratings, by means of pattern recognition techniques.

Mastering processing typically entails some modifications of the music's spectral and dynamic properties, in order to fine-tune the sound. Apart from loudness, however, the multiple perceptual dimensions relevant to mastering are not well-established. Therefore, a set of semantic scales was first developed to represent additional perceptual dimensions. Each of these seven bipolar semantic scales is defined by attributes such as airy, punchy, and fat.

Brief Bio: In 2004, Esben Skovenborg completed an Industrial PhD at TC Electronic, and the Dept. of Computer Science, Aarhus Universitet.  His PhD thesis is titled: "Perceptual Features of Music and Speech – Measured by Listening Experiments, and Computer Modelled from Signal Analysis".
He holds an M.Sc. degree in Music Technology, from The University of York. Esben has also studied music at the Musicians Institute, in California, and enjoys playing the guitar.
Since 2005, Esben Skovenborg has been employed at the Research Dept. of TC Group, where he works with statistical data modelling, listening experiments, and virtual prototyping.


Lunch, networking and poster


Using sound source models to organize mixtures

Professor PhD Dan Ellis, LabROSA, Columbia University, USA

Recovering individual source signals from sound mixtures is almost always highly underconstrained, and is made possible only when additional assumptions are made about the form of the sources, mixture process, or both. Many perceptual phenomena, including restoration and illusions, reveal how strongly human listeners can rely on prior expectations to solve perceptual challenges.  The basis of our computational work is to equate these expectations with internal models of source behavior, delineating the limited subset of possible sounds that are expected to occur, and thereby providing the constraints to solve the problem.  I will review some relevant perceptual phenomena, then discuss how source models, of different degrees of complexity, can be used to help to understand and separate sound mixtures, including speech mixed with nonstationary interference.

Brief Bio: Dan Ellis is an Associate Professor of Electrical Engineering at Columbia University in New York, where he heads the Laboratory for Recognition and Organization of Speech and Audio (LabROSA).  Before joining Columbia, he worked on robust speech recognition at the International Computer Science Institute in Berkeley, CA, where he remains an external fellow.  His dissertation on Computational Auditory Scene Analysis was completed at the MIT Media Lab.  Current research in his group covers many aspects of extracting information from sound spanning environmental audio, music, and marine mammal vocalizations.


Monaural cocktail party processing

Professor PhD DeLiang WangOticon A/S (on leave from the Ohio State University, USA)

Speech segregation, or the cocktail party problem, has proven to be highly challenging. This talk describes a CASA approach to the cocktail party problem. This approach performs auditory segmentation and grouping in a two-dimensional time-frequency representation that encodes proximity in frequency and time, periodicity, amplitude modulation, and onset/offset. In segmentation, our model decomposes the input mixture into contiguous time-frequency segments. Grouping is first performed for voiced speech where detected pitch contours are used to group voiced segments into a target stream and the background. In grouping voiced speech, resolved and unresolved harmonics are dealt with differently. Grouping of unvoiced segments is based on the Bayesian classification of acoustic-phonetic features. This CASA approach has led to major advances in speech segregation.  

Brief Bio: DeLiang Wang is currently a visiting scholar at Oticon A/S, denmark. He is a Professor of the Department of Computer Science & Engineering and the Center for Cognitive Science at the Ohio State University, USA, where he directs the Perception and Neurodynamics Laboratory.


Coffee and networking


ASA calling CASA

Chief, Speech and Hearing Research, Pierre Divenyi, VA Medical Center and EBIRE, Martinez, California, USAC

ASA obtained its name, and earned its reputation, from using up-to-date auditory models to emulate, and if possible better, separation of principally speech sources by human listeners under inclement acoustic conditions. This brief presentation will first focus on a few selected processes adopted by CASA that are thought to be used by the auditory system for the separation of multiple simultaneous sources, such as (amplitude-) modulation detection interference, pitch discrimination interference, and concurrent localization of sources. In the second part, the talk will list a few areas where CASA may benefit from incorporating other processes observed in human listeners, such as prediction of FM trajectories linked to an ability to perform frequency velocity estimation, high sensitivity to frequency change acceleration, and economic use of higher-level linguistic information. Finally, a suggestion will be made to merge CASA principles with ideas germane to ICA-type separation, such as temporal decomposition and basis functions.

Brief Bio: Pierre Divenyi has a great future behind him as a classical musician. Since diverging from piano to science and obtaining his doctorate in systematic musicology, he has been working in hearing and speech research first investigating temporal processing and localization and, for the last 20 years, separation of non-speech and speech signals and its decline in aging. He has also organized and directed workshops on speech separation and speech dynamics.


Wrap up


The seminar is supported by the Signal Processing Chapter of the IEEE Denmark Section


Member Comments

Steven Gelineck Thursday, 24.05.2007 00:07
Your comment
Mikkel Schmidt Wednesday, 02.05.2007 14:56
Yes, the correct date is May 24, 2007. Thank you.
Orges Balla Tuesday, 01.05.2007 21:39
Time and place
February 24, 2007, 9.00-17.10

Don\'t you mean May 24, 2007?

Displaying results 1 to 3 out of 3