Extraction of a single voice from a sung harmony

Audio Signal Processing Research

Source Signal Separation

Since 1998, a series of projects have been undertaken at York investigating the challenges involved in, and potential applications of, extracting selected sound 'structures' from a monophonic audio signal - with each structure being associated with an individual 'source', voice or instrument within a mixture.

The idea of isolating just a selected part of an audio signal is not itself new (illustrated best perhaps by the 'cocktail party problem', where it is desired to extract just one conversation from a general mixture), but most existing work in telecommunications, teleconferencing, forensic and defence applications has concentrated on the 'many-to-many' problem, where multiple input signals are available. But in the case of audio, a tremendous amount of material is available only in mono form - effectively all film, radio and music output pre the 1960s.

The work at York has initially concentrated on the 'one-to-many' problem of essentially 'demixing' a mono source to multiple tracks, with the next research stage being the 'two-to-many' problem of handling stereo signals in a similar fashion. In either case, the basic issue is the limited amount of data available - if only one or two input tracks are available, additional information will be required to drive the separation process. This additional information can take the form of instrument-specific models, physical constraints, or a priori user-specified information about the nature and acceptable parameters associated with a particular sound source, for example.

The general approach developed at York here involves the processing of the raw data via a combination of multiple time-frequency, wavelet, structural and statistical analysis techniques together with the use of a variety of application-specific a priori types of information, in the form of parametric models, physical constraints, user-specified information or additional data derived from other sources. These are used in combination with an interpretative stage where the multiple information streams provided by the processed data and the assorted priors/constraints must be combined in an adaptive fashion using Bayesian methods, forward modelling and/or optimization/fitting techniques.

Using some or all of these methods, the overall aim is to separate the original signal into two or more separate channels, with each channel being related to a desired structure within the original audio - such as individual musical instruments, the human voice, sound sources, noise, etc. (Further information and demonstrations).

The various potential applications offer different challenges in terms of the complexity of the original mixture, the number of channels required, and the acceptable audio fidelity. Some applications, discussed in more detail below, include:

The common features of all of these applications is that each has the potential to outperform established techniques to a very significant degree and/or they offer the means to achieve results that would not be possible using normal methods. At the same time there are general issues in signal analysis, instrument/sound source modelling, optimization that need investigation, but which have implications for all such applications. For example:

  • Multi-Rate Fourier Approaches;
  • Wavelet Approaches;
  • Adaptive Hybrid Approaches;
  • Inverse Approaches to Audio Separation;
  • Adaptive Parametric Transforms for Audio;
  • Coding of User-Specified Constraints and Information about Audio Signals;
  • Non-Linear Parametric Models of Musical Instruments.

Remastering of monophonic audio material to stereo or surround sound

Preliminary studies at York have confirmed that it is possible to separate a monophonic musical recording into multiple tracks by the use of prior parametric source models as part of a structured analysis stage, the output of which is then used to define the controlling parameters of a suite of adaptive filters. Separations have been carried out of up to seven simultaneous pitched (violin) notes, as well as for instrument mixtures (such as saxophone, violin, clarinet and piano), and for the extraction of individual instruments from a commercial recording, allowing the remixing of that recording. The potential for high fidelity remastering of audio and film soundtrack material to stereo and surround sound is considerable.

Remastering of stereophonic audio material to surround sound

The extraction of multi-channel information from stereo signals is a natural extension of the work already carried out on monophonic signals. As a minimum the availability of stereo data doubles the information available for the interpretative stage of the algorithm, and it may be useful simply to treat the signals as two separate mono sources - but a major extra complexity of a general algorithm is the synchronization issues associated with the data content of the two signals - the stereo signals effectively code some of the spatial information about the relative positions of the various instruments, meaning that information about any one source of sound will appear at different places in the two signals. To fully exploit this information, it will be necessary not only to identify and isolate the sound sources, but also to reconstruct their relative spatial positions.

Remastering for enhanced creative control over audio/musical content

A wide range of digital audio processes are already available to the studio engineer, but can only be applied to whatever master tracks are available - if the source material is already partially or completely mixed, then any chosen effect will be applied to the whole mixture, rather than just a specific selected instrument or source. A noteworthy example where problems arise due to this is the compressor/expander process. By separation into multiple tracks, the options available to the sound engineer are greatly increased with regard to the parallel application of a wide variety of filtering and audio processes.
The fidelity requirements for such separated processing may be lower than some other separation applications, since the final recording will often be a remixed version of the individual separated and processed portions - hence minor inaccuracies or processing artefacts will be reduced or masked in the remixing stage.

The creation of completely novel processing and creative effects

Most existing creative audio effects are applied to sections of audio in its entirety - the ability to separate the audio in multiple tracks according to the nature of its content opens up a wide range of entirely new creative possibilities. For example, preliminary studies at York have shown that it is possible to separate the harmonic 'sustain' and 'decay' portions of an individual note event from the initial less harmonic 'onset' and 'transient' portions - this introduces the possibility of time-stretching or pitch-shifting just part of one musical note within a mixture, for example. Likewise, the use of parametric structured models introduces the possibility of the user modifying the basic parameters of the model rather than merely processing portions of the original signal. Initial tests have shown, for example, that the timbral structure of one separated instrument can be used to derive parameters which can drive the resynthesis of an entirely different instrument.

The extraction of sung vocals from music

In the particular case where a fairly realistic physical model of a source can be specified, as in the source-filter model of the vocal tract, work at York has shown that it is possible to use different characteristic features of the model to assist the quality of a separation in different ways. In particular, for sung material, it has been shown that it is possible not only to use a parametric model of the source frequency content to track the overall frequency variations of the vocal output, but also to use a larger-scale formant model to impose further constraints on the acceptable values of the amplitudes of these frequency components, enhancing the ability to 'see through' regions where the vocal material has been heavily masked by other signal content.

The extraction of a single voice/source/instrument from a complicated mixture


Enhanced onset detection techniques


Automatic transcription of musical material;

Much international research is under way into the automatic transcription of musical content. Broadly, the general aim of the algorithms being developed is to establish the number of instruments being played, to identify and classify the individual instruments, to isolate individual note events, and to establish the amplitude, pitch, timing and duration of each such event - the information can then be used as the input of a second stage process which combines this derived data with musicological information to produce a 'score' for the piece.

In short, any such algorithm is going to struggle with complex mixtures of multiple instruments - the use of model-based separation approaches (to produce multiple separated channels which can be processed individually) is likely to greatly enhance the transcription quality. Furthermore, the result will be more than just a traditional score: as in a traditional score it will contain information about the individual instruments, note timing, note pitch, etc. - but it will also act as an index to the sounds associated with each note event on the various separated tracks, allowing the user complete access to manipulate any selected portion of any of the separated channels.

The automatic classification of musical material for archiving and internet delivery;

Automatic music classification algorithms have been attracting increasing interest, especially in view of the commercial opportunities of the online selling of music. The internet already offers an established route for the delivery of musical material to purchasers, but significant difficulties arise in allowing the user to navigate and browse any large archive of audio material. One major developing force is the MPEG-7 standard, which embodies the idea of a set of audio descriptors as part of the metadata of the media file. These descriptors raise the possibility of being able to populate large databases with information that allows users to query and gain access to music in novel fashions.

Unfortunately, there are no automatic techniques to generate descriptor information from the base audio file. Even gross classifications such as 'music', 'pop' or 'classic' have proved challenging, and work is ongoing to work towards more the subtle recognition of genres, style, tempo, timbre, artist, or even simply the sex of a vocal artist.

Current work typically involves the use of pattern classifiers (such as neural networks), preceded by the mapping of certain extracted features of the audio onto a feature or 'anchor' space, defined by a multidimensional set of axes which is intended to provide the classifier with the information necessary to provide some degree of discrimination. A wide range of feature extraction measures (statistics, cepstral coefficients, temporal measures, frequency content, sparseness, etc.) are being investigated by different research groups.

The segmentation of an audio file (from a mono or stereo original) into multiple tracks where single instruments (or groups of similar instruments) have been isolated on a track-by-track basis has considerable potential as a pre-processing stage for any pattern classification technique. In short, the ability to analyse the frequency and temporal characteristics of individual instruments (or groups) provides a great deal more feature-specific information, and also introduces the possibility of measures which are not available when the audio is treated as a single data stream (i.e. the relative amplitudes of different instruments or groups of instruments over time, and the balance between harmonic and non-harmonic content over time)

Enhanced restoration of local damage to monophonic audio/musical material

A wide range of audio restoration techniques exist for handling 'localized' degradation of audio material, such as clicks, low-frequency noise transients or, for digital signals, the complete loss of a number of samples due to clipping or transmission error. The restoration methods are most commonly statistical or interpolative techniques that attempt to predict a suitable waveform to 'patch' the gap in the audio.

The current restoration methods are usually applied to the signal as a whole, however - if the signal is first segmented into multiple channels, each representing different sound sources or types of audio, then each channel can be restored separately, and them recombined to produce the overall restored section. This allows the use of multiple models or statistics and the approach will be much more robust than trying to parameterize the complexities of a hybrid signal in a single process.

In essence, when the full signal has been segmented, it can be argued that although the waveforms themselves are changing fairly rapidly, the underlying parameters that characterize the signal due to each sound source (e.g. pitch, amplitude, frequency structure, etc.) are varying at a much slower rate - the values of, and variations in, these parameters will be mixed, masked or cancelled in the original signal, but can be extracted from at least some of the separated channels. This will lead to enhanced estimates of any missing material.

Enhanced broadband noise removal from complicated wideband audio mixtures

Assorted methods exist for the removal of broad-band noise (hiss) from audio recordings and are fairly effective in situations where the signal-to-noise ratio is relatively good. But for high noise levels most methods tend to produce audible artefacts or limit the fidelity of the overall sound quality.

Model-based segmentation of the audio into multiple structured streams offers two approaches to general broadband noise reduction. One option is to separate the audio into a number of channels and resynthesize only those portions that are consistent with the a priori user-specified models - we have investigated this method and found it to produce results that are 'thin' in the sense that they lack some of the complexity of the original sound - but they are remarkably free of broadband noise. An alternative approach is to use the derived parameters not just to drive a resynthesis engine, but instead to establish a set of structured time-varying filters which can partition the original sound into two or more separate channels. Having gathered all the structured (model-based) information into these channels, a 'residual' signal is automatically generated as well - this residual contains broadband noise and any signal content arising from instrumental features (such as breath noise or bow noise) that are inconsistent with the prior models. We have found this to be an effective approach for separating 'noise-like' signals from 'structured' content - the user is then free to process each channel separately to achieve noise removal, with the option to completely remove the residual if appropriate.

Forensic audio


Content-based analysis of audio-visual material

Content-based video analysis, representation, indexing and recovery is an area of current research interest - the aim is to enable the automatic identification of events (i.e. particular actions within a scene) and cuts (i.e. changes in continuity or scene). The identification and classification of the audio content is an important part of the process, with the aim being to relate particular acoustic events to the corresponding visual content. Since the audio portion may generally be a mix of vocal material, assorted acoustic events, background 'noise' and a musical soundtrack, it will be beneficial to segment the audio prior to attempting to extract and identify sound events.

Back to the Top