Audio Signal Processing Research - Results and Demos #3

NOTE: There are still compatibility issues with playing media files in some browsers or with different operating systems. Here sample sounds are embedded using a dedicated player, but if the player bar doesn't appear or fails to run for any reason direct links to the mp3 files are also provided.

Results page #3 - Improved Source Separation

Recent work with Giorgos Siamantas is addressing a number of issues highlighted by previous separation work at York. In particular, for practical applications the earlier work requires a great deal of user interaction, particularly in the form of a user-produced MIDI score providing approximate information regarding instrument types, pitches and timings which assists the algorithm when separating complicated polyphonic melodies. Such user input not only is costly and slow, but also restricts the use of the algorithm and introduces a wide range of uncertainties, in the sense that the separation algorithm can be misled by inaccurate information supplied by a user, or can produce different results depending on the information supplied by different users. Current work hence includes developing a new multipitch 'front-end' to the algorithm which estimates the fundamental frequencies of the various instruments and removes the need for this user input.

Further research is also under way into exploiting the information contained within the residual channel. This channel contains all the data which is not consistent with any a priori restrictions/information in terms of instrument models and characteristics. For example, if we choose to identify, extract and separate energy that corresponds to broadly harmonic structures of partials, then the residual will contain not only true 'noise' sources, but all energies associated with strongly inharmonic partials and broadband energy associated with transient events such as the attacks of individual notes. This means that the individual output channels contain separated information about not only the frequency and amplitude variations of the notes produced by multiple instruments, but also, via the content of the residual, the onset times and the strength/duration of the note attack. This opens up a number of new possibilities:

  • Using the information in all of the separate 'demixed' output tracks and the residual to provide revised estimates of the information provided initially by the front-end multipitch estimator, establishing an iterative procedure for improving the overall separation quality;
  • Concentrating on the content of the residual track, using the separation process as a means of removing harmonic energy from the signal, enhancing the relative strength of the individual note attacks and hence providing an enhanced onset detection process;
  • Linking portions of the nonharmonic attack energy within the residual with the associated harmonic decay/sustain/release portion of individual note events within each separated track. This allows new creative control over note characteristics - for example, time stretching or pitch shifting of the decay/sustain/release portion of a note without distortion of the attack portion;
  • Separating the individual instruments within the two channels of a stereo input, and then using the relative timing and amplitude information from all of the output (harmonic and residual) tracks to allow calculation of the relative delays and attenuation of the individual instruments, providing enhanced estimation of their positions within the stereo image, hence allowing not only manipulation of the positions of the individual sources within the stereo image, but also enabling an enhanced conversion to a surround sound format.

Iterative improvement of separation quality

For example, the graphs below show plots of one particular performance measure (the Signal-To-Distortion Ratio - SDR) for the two individual sources extracted from mixtures of a flute (D6) and a bassoon (A4) for test cases ranging over a wide range of relative volumes.

The left-hand plot shows the results for a single application of the new separation system, where the initial multipitch detection stage has failed to detect the flute sound within the original mix below a certain critical relative volume level - below this threshold, the bassoon sound simply swamps the flute.

The right-hand plot shows the effect of iteratively repeating the process using the information now available in the output channels. Where the initial step has failed to detect the flute due to its small relative energy, that energy ends up within the residual channel. Although small relative to the initial mix, this energy may now be significant relative to the other content within just the residual, and hence this channel can usefully be fed back to the multipitch detector for further processing. In this particular example, the overall effect is to extend the effective range of relative volumes over which the two individual instruments can be recognised by about 12dB.


Signal-to-distortion ratio for a
                                flute/bassoon mix of varying volume

SDR measures for the two extracted sources at different relative volumes using a single pass of the combined multipitch/separation algorithm.

Improved signal-to-distortion ratio
                                for a flute/bassoon mix of varying
                                volume

SDR measures for the two extracted sources at different relative volumes, showing a significant improvement due to using the information within the residual channel to inform a second pass of the multipitch/separation algorithm.


Similarly, the graphs below illustrate a rather worse case where, unlike the above example, the initial multipitch detector fails for not just one, but both instruments in a cello (B3) and saxophone (A3) mixture below a certain threshold energy ratio.

Here, the use of the residual is even more effective, confirming that the pitch detection and signal separation stages are strongly connected processes - it is, of course, easier to separate signals where some pitch information is available, and estimating pitch information is much easier for isolated sources. An iterative implementation where the two processes assist each other is an effective overall approach.


Signal-to-distortion ratio for a
                                cello/saxophone mix of varying volume

SDR measures for the two extracted sources at different relative volumes using a single pass of the combined multipitch/separation algorithm.

Improved signal-to-distortion ratio
                                for a cello/saxophone mix of varying
                                volume

SDR measures for the two extracted sources at different relative volumes, showing a significant improvement due to using the information within the residual channel to inform a second pass of the multipitch/separation algorithm.


In practice, the use of an initial pitch detection stage, together with an iterative improvement process, means that good quality results can be obtained that would previously have required significant user input. Below, for example, the extract from 'African Breeze', performed by Hugh Masekela with Jonathan Butler (previous results page) is processed using the new automated method to isolate just the flugelhorn, which is then remixed at a different volume (all output files are normalised to a RMS average level of -20dB, to ease comparison). This confirms that excellent remixed versions of mono originals can be produced without the need for user interaction in the demixing process.


Original music sample

Original excerpt from 'African Breeze' (sound file).

Remix with
                              horn strength doubled

Previous separation approach using a MIDI guidance file - remixed version after separation, doubling the strength of the flugelhorn, and remixing (sound file).

New, automated approach - remixed version after separation, doubling the strength of the flugelhorn, and remixing (sound file).


Remix with
                              horn strength halved

Previous separation approach using a MIDI guidance file - remixed version after separation, halving the strength of the flugelhorn, and remixing (sound file).

New, automated approach - remixed version after separation, halving the strength of the flugelhorn, and remixing (sound file).


Enhanced onset detection

The graphs below show the results of the separation process for three isolated notes - a cello, a violin and a saxophone. The left-hand plots show the residual after identification and removal of all (near) harmonic partials, and the right-hand ones show the attack portion as part of the original note waveform. Several points are clear:

  • The energy in the residual is relatively small except for well-defined periods of time associated with the non-harmonic processes during the early stages of the notes;
  • The amplitudes, durations and envelopes of the attacks are quite different for the three instruments;
  • In all cases the duration of the separated attack energy does not correlate well with the parameters estimated via the commonly used attack-decay-sustain-release (ADSR) envelope;
  • In each overall waveform there is no clear boundary between the non-harmonic attack energy and the more harmonic decay/sustain/release portion - instead there are smooth transitions between the two behaviours as the standing waves are established and the partials evolve.

Residual of a cello note
The residual of an isolated cello note.

Residual of a cello note in the
                                context of the whole note
The residual in the context of the original cello note (note the amplitude axis scale change).


Residual of a violin note
The residual of an isolated violin note.

Residual of a violin note in the
                                context of the whole note
The residual in the context of the original violin note (note the amplitude axis scale change).


Residual of a saxaphone note
The residual of an isolated saxophone note.

Residual of a saxaphone note in the
                                context of the whole note
The residual in the context of the original saxophone note (note the amplitude axis scale change).


For the general purposes of onset detection, and for specific applications such as tempo/beat analysis and for parametrising sounds for the purposes of music information retrieval (MIR), a major challenge is detecting the attack of notes in the presence of other sounds - again, the separation approach provides a powerful way to address this problem.

The graphs below show two different test mixtures of the three instruments above, with different start times and in different orders. For the first (saxophone, violin and cello) mixture the separation process leads to a residual which provides considerably enhanced access to the individual note attacks and allows improved estimation of the onset time and the total attack energy.


Note attacks isolated within the
                                residual channel
The residual signal for a mixture of three different instruments (saxophone, violin and cello) starting at different times (0.5s intervals). The individual note attacks and the corresponding onset times are clearly defined.

Note attacks in the context of the
                                original notes
The residual in the context of the original mixture waveform, showing the extent to which the separation process has isolated the individual note attacks (note the amplitude axis scale change).


The second example below shows the residual from the separation process for the same instruments in reverse order (cello, violin and saxophone). This is a harder case, since the violin and (especially) the saxophone attacks are relatively weak. Nevertheless, the residual channel still provides considerably improved information.


Note attacks isolated within the
                                residual channel

The residual signal (sound file) for a mixture of three different instruments (cello, violin and saxophone) starting at different times (0.5s intervals).

Note attacks in the context of the
                                original notes

The residual in the context of the original mixture (sound file) waveform (note the amplitude axis scale change).


In practice, the availability of the residual and the increased clarity of the note attacks means that unlike normal onset detectors, where the detection algorithm has to work on the whole signal, the onset detection process can be applied just to the residual signal, as in the example below.


Onsets detected from the original
                                signal and from the residual signal


Original signal (black - sound file):

Residual signal (red - sound file):


The original signal and the residual signal for a mixture of three different instruments (cello, saxophone and violin) starting at different times (0.5s intervals). Here, a conventional onset detector applied to the original signal (triangles) fails to find the violin onset at 1.5s and produces some spurious results. Alternatively, a custom onset detector applied just to the residual signal correctly locates just three onsets at about 0.5s, 1.0s and 1.5s.


Creative and processing effects using separation

Even within a single note event, being able to separate the attack from the remainder of the note opens up the opportunity for enhanced effects such as time stretching and pitch shifting. For example, the cello note below has been subjected to both such processes - not only in the normal fashion, but also by first identifying, separating and protecting the attack from the processes. The conventional approaches 'soften' or 'draw out' the attack, but the separation results show that it is quite possible to modify only part of the body of the note, retaining the sharpness and clarity of the attack.


Original F4 cello note (sound file).

Isolated F4 cello note

Cello note with a conventional pitch shift of 7 semitones down to Bb3 applied to the whole note (sound file).

Cello note conventionally
                              pitch-shifted down 7 semitones

Cello note after a pitch shift down of 7 semitones applied to all content except the previously separated attack (sound file).

Cello note pitch-shifted down
                              7 semitones but with attack unchanged

Cello note with a conventional time stretch of a factor of 2 applied to the whole note (sound file).

Cello note conventionally
                              time-stretched by a factor of 2

Cello note with a time stretch of a factor of 2 applied to all content except the previously separated attack (sound file).

Cello note time-stretched by a
                              factor of 2 but with attack unchanged

Beyond enhanced effects for isolated notes, the ability to extract individual sources from within a melody opens up the opportunity for processing mono and stereo sources in ways that were previously impossible. For example, taking a simple flute/cello mix, it is quite possible to apply a pitch shift to just one instrument - or to apply totally *different* pitch shifts to both instruments.


Original mixture of G4 flute and C2 cello (sound file)

Original mixture of G4 flute and C2
                              cello

Remix with the flute pitch-shifted to C4 but the cello unchanged (sound file)

Remix with the flute pitch-shifted to
                              C4 but the cello unchanged

Remix with the flute pitch-shifted to C5 but the cello unchanged (sound file)

Remix with the flute pitch-shifted to
                              C5 but the cello unchanged

Remix with the flute pitch-shifted to D4 and the cello pitch-shifted to D2 (sound file)

Remix with the flute pitch-shifted to
                              D4 and the cello to D2

Remix with the flute pitch-shifted to B3 and the cello pitch-shifted to Bb1 (sound file)

Remix with the flute pitch-shifted to
                              B3 and the cello to Bb1

Remix with the flute pitch-shifted to B3 and the cello pitch-shifted to E2 (sound file)

Remix with the flute pitch-shifted to
                              B3 and the cello to E2

Similarly, in the cello/guitar mix below, it is quite possible to modify just the guitar and leave the cello unchanged. In fact, as above, the key idea is that not only is the cello content not modified by the time-strectching and pitch-shifting processes, but also the guitar attack remains unchanged - only the energy identified as being associated with the broadly harmonic content of the chosen instrument is changed. 

Original mixture of B3 cello and Ab3 guitar (sound file)

Original mixture of B3 cello and Ab3
                              guitar

Remix with two new guitars - one speeded up by 10% and pitch-shifted up to to C4, and the second slowed by 10% and pitch-shifted down to E3  (sound file)

Remix with two time stretched and
                              pitch shifted guitars

Remix with just the harmonic content of the guitar speeded up by a factor of two (sound file)

Remix with just the harmonic content
                              of the guitar speeded up by a factor of
                              two

More realistically, in the example below the 'African Breeze' sample used earlier has undergone a separation process, identifying and isolating the horn solo from the rest of the music. Then the horn has been subjected to two different pitch shifts, producing four tracks in total, which have then been remixed together. Such a procedure would not normally have been possible without access to multitrack masters of the recording.


Original music sample

Original mono source signal (sound file).


Remix with original horn and two
                                new pitch-shifted versions

New version after extraction of the flugelhorn, pitch-shifting to create two new tracks (plus and minus 5 semitones) and combining all four tracks to produce a completely new mix (sound file).


Similarly, in the example below, the horn has been demixed, pitch-shifted down by an octave, and remixed with the original horn and the remaining content to produce a more harmonious mixture.


Remix with original horn and a new,
                                octave lower, version

New version after extraction of the flugelhorn, pitch-shifting to create a new track (one octave lower) and combining all three tracks to produce a completely new mix (sound file).


The final (surreal!) example is 'The Deflating Trumpeter', produced by applying a full octave slide down in frequency to the horn before combining it with the remaining (unchanged) content.


Remix with the horn pitch-bending
                                over a full octave range

New version after extraction of the flugelhorn, pitch-bending it over a range of one octave and combining it with the residual to produce a completely new mix (sound file).


A further publication (PDF format) is available, as below...


G. Siamantas, M.R. Every and J.E. Szymanski,
'Separating Sources From Single-Channel Musical Material: A Review And Future Directions'
Proceedings of the Digital Music Research Network Summer Conference 2006, Goldsmiths College, University of London , U.K., (22-23 July 2006).

 

Back to the Top