New Frontiers in Music Information Processing (MIP-Frontiers)

This was a temporary web page. The permanent page is:

MIP-Frontiers is a European Training Network funded by the European Commission from 1 April 2018 - 31 March 2022. We recruited 15 PhD students across a range of exciting projects in collaboration with our industry and cultural partners (listed below), to study at one of the four institutions: Queen Mary University of London, Universitat Pompeu Fabra, Telecom ParisTech, and Johannes Kepler University of Linz.

To give an overview of the whole project, the abstract from the proposal is included below, followed by the list of members of the MIP-Frontiers consortium, and a table of the PhD projects which will be offered (including the hosting organisation, where most of the work will be done, and the secondary host).


Music Information Processing (also known as Music Information Research; MIR) involves the use of information processing methodologies to understand and model music, and to develop products and services for creation, distribution and interaction with music and music-related information. MIR has reached a state of maturity where there are standard methods for most music information processing tasks, but as these have been developed and tested on small datasets, the methods tend to be neither robust to different musical styles or use contexts, nor scalable to industrial scale datasets. To address this need, and to train a new generation of researchers who are aware of, and can tackle, these challenges, we bring together leading MIR groups and a wide range of industrial and cultural stakeholders to create a multidisciplinary, transnational and cross-sectoral European Training Network for MIR researchers, in order to contribute to Europe's leading role in this field of scientific innovation, and accelerate the impact of innovation on European products and industry. The researchers will develop breadth in the fields that make up MIR and in transferable skills, whilst gaining deep knowledge and skills in their own area of speciality. They will learn to perform collaborative research, and to think entrepreneurially and exploit their research in new ways that benefit European industry and society. The proposed work is structured along three research frontiers identified as requiring intensive attention and integration (data-driven, knowledge-driven, and user-driven approaches), and will be guided by and grounded in real application needs by a unique set of industrial and cultural stakeholders in the consortium, which range from consumer electronics companies and big players in media entertainment to innovative SMEs, cultural institutions, and even a famous opera house, thus encompassing a very wide spectrum of the digital music world.



PhD Projects

PhD Topic Supervisor Host Secondment
Representations and models for singing voice transcription Simon Dixon QMUL DRM
Instrument modelling to aid polyphonic transcription Simon Dixon DRM QMUL
Leveraging user interaction to learn performance tracking Simon Dixon QMUL TIDO
Fine grain time resolution audio features for MIR Mark Sandler ROLI QMUL
Note level audio features for understanding and visualising musical performance Mark Sandler QMUL ROLI
Tag propagation from structured to unstructured audio collections Xavier Serra UPF JAM
Extending audio collections by combining audio descriptions and audio transformations Xavier Serra UPF NI
Audio content description of broadcast recordings Emilia Gómez UPF BMAT
Behavioural music data analytics Gael Richard TPT DZ
Voice models for lead vocal extraction and lyrics alignment Gael Richard TPT AN
Multimodal movie music track remastering Gael Richard TPT TC
Context-driven music transformation Gael Richard TPT TC
Defining, extracting and recreating studio production style from audio recordings Gael Richard SONY TPT
Large-scale multi-modal music search and retrieval without symbolic representations Gerhard Widmer JKU KI
Live tracking and synchronisation of complex musical works via multi-modal analysis Gerhard Widmer JKU VSO

Representations and models for singing voice transcription

Supervisor: Simon Dixon (Host: QMUL; Secondment: DRM)

Singing is the most universal form of music-making, yet no suitable notation exists for performed singing. Western notation conceptualises music as sequences of unchanging pitches being maintained for regular durations, and has little scope for representing expressive use of microtonality and microtiming, nor for detailed recording of timbre and dynamics. Research on automatic transcription has followed this narrow view, describing notes in terms of a discrete pitches plus onset and offset times. Cross-cultural studies, building on Lomax's Cantometrics, focus more on the context than the musical content. This project goes beyond the state of the art by devising a detailed representation scheme for singing, and developing algorithms using this representation for automatic transcription and for assessment of various characteristics such as similarity to a reference recording, accuracy of pitch and timing, quality of vocal sound, and the scale and intonation which are used in the singing. By modelling musical knowledge and stylistic conventions, an algorithm for producing a reduction of a singing performance to Western score notation will be developed. In the final stage of the project, these algorithms will be applied in a cross-cultural study of singing, to evaluate the robustness of the representations and correctness of the assumptions. For this project, DRM will provide its dataset of over 2.4M singing recordings.

Instrument modelling to aid polyphonic transcription

Supervisor: Simon Dixon (Host: DRM; Secondment: QMUL)

A leading current approach to transcription is based on the factorisation of a time-frequency representation into a dictionary of instrumental sounds and a matrix of instrument activities over time, where the dictionary typically contains one or very few templates per pitch and instrument. We have extended such models to capture aspects of the temporal evolution of tones, but these approaches fall short of modelling the full range of sounds produced by an instrument. Research in music acoustics provides detailed models of the mechanics of instruments and the resulting range of sounds that they produce, and the goal of this project is to apply this knowledge to the analysis of polyphonic mixtures of instruments, in order to achieve more accurate decompositions. The final objective is to develop a fully automatic system, which performs instrument recognition in an initial analysis stage, and then adapts the transcription algorithm on the basis of the instrument models which are relevant to the recording.

Leveraging user interaction to learn performance tracking

Supervisor: Simon Dixon (Host: QMUL; Secondment: TIDO)

Music signal processing algorithms traditionally rely on hand-crafted features, which often fail to generalise to different instruments, acoustic environments and recording conditions. This limitation is observed when current music alignment techniques are taken from laboratory settings and applied to real-world situations. Deep learning is central to state of the art approaches for processing of multi-dimensional and time-based media such as speech, images and video, and more recently for music, which shares characteristics such as complex temporal dependencies and the “feature engineering bottleneck” with these other domains. In principle, deep networks can capture both low-level features of relevance to the alignment task and higher-level mappings between feature sequences of corresponding performances. The use of deep learning is often hindered by insufficient training data; in this case a lack of ground truth alignments. In this project we leverage the Tido platform to obtain training data for alignment via intuitive user interactions with the system. This will in turn improve the user experience, creating a virtuous cycle in which the Tido system learns to adapt to different acoustic and musical conditions. The central research question is how to design interactions between user and system which simultaneously optimise both the user experience and the gathering of training data.

Fine grain time resolution audio features for MIR

Supervisor: Mark Sandler (Host: ROLI; Secondment: QMUL)

MIR researchers tend to choose from a well established set of possible signal processing features to help them extract meaning from audio. Typically, these features are computed over time windows of tens of milliseconds to several seconds. This project expands that toolset by undertaking a thorough investigation of two less used but highly appropriate signal features, both of which are capable of finer resolution in the time domain without sacrificing frequency domain resolution. In order to understand how human hearing physiology can inspire better features, the work will use computational models of human hearing, the best known of which are the so-called gammatone filterbanks. These can provide information on both timbral and harmonic aspects of the music. To exploit the acoustic properties of many musical instruments, another aspect of this research is to use Linear Predictive Analysis and Coding (LPC) that separately captures resonance and divergence from linearity as, respectively, filter parameters and residual, and can operate over shorter time windows than Fourier analysis. Here we expect that both the analysis parameters and the residual will capture useful information. Likely outcomes include the development of new chromagram-like features (used to represent harmonic content in music) and their use in structural segmentation of musical pieces, or new metric features for more accurate rhythmic structure.

Note level audio features for understanding and visualising musical performance

Supervisor: Mark Sandler (Host: QMUL; Secondment: ROLI)

We all know that musical instruments can be played with different levels of skill (or virtuosity). However current MIR research tools rarely address this, largely because the signal processing is not straightforward. This PhD project will investigate performance at the note level so that fine detail can be captured – for example so that vibrato can be measured, parameterised, and may be used as a retrieval feature. The project will build on prior work from c4dm on high resolution sinusoidal models, and will derive and investigate new ways to visualise and understand musical performance. Once these virtuosic performance aspects are parameterised (and turned into metadata) they can be edited to enhance the performance in subtle ways, such as morphing one performance into another. Imagine pasting a violinist’s vibrato onto a Death Metal vocalist’s singing! A further aspect of this project will be to see how much benefit can be found by using the full stereo music signal to increase the discriminatory powers of the features.

Tag propagation from structured to unstructured audio collections

Supervisor: Xavier Serra (Host: UPF; Secondment: JAM)

An important problem in MIR is the automatic labelling of music, for categories such as genre, instrumentation or mood. The main bottleneck in this problem is the availability of “good” data from which to learn models of particular categories. This project will use Jamendo and as sources for data collections with which to train the models. Jamendo is a music distribution website and an open community of independent artists and music lovers, and it is basically the world's largest digital service for free music. is a platform used to crowd-source acoustic information from commercial music recordings supported through a collaboration between the MetaBrainz foundation and the UPF. These two platforms offer a very good starting point from which to develop better music classification methodologies and from which to develop machine learning models that can be used to automatically label other music collections. The idea is to create appropriate datasets for specific music categories and then improve the content and context analysis in order to develop music classification systems.

Extending audio collections by combining audio descriptions and audio transformations

Supervisor: Xavier Serra (Host: UPF; Secondment: NI)

No matter how large a sound collection is, it will never cover the complete range of possible sounds in a given domain. Some of these missing sounds can be obtained by transforming available sounds. The goal of the project is to be able to do this automatically. Starting from a sound repository like, the idea is to develop and integrate a method into it that, given a sound query, identifies sounds that can be transformed to better suit the query (e.g. via time and frequency scaling transformations). Then the system should be able to retrieve the transformed sound, instead of the original one, in response to the user’s query. Currently has a content based query tool with improved content features giving better audio content similarity measures. In this project the goal is to transform the sounds resulting from the current search to obtain sounds that are more similar to the target query sound.

Audio content description of broadcast recordings

Supervisor: Emilia Gómez (Host: UPF; Secondment: BMAT)

The goal of this project is to improve state of the art methods for the automatic description of audio material in the context of broadcast recordings. In particular, we will address the identification of music material (fingerprinting), the automatic classification of music vs speech, the characterization of prototypical sound effects, and the detection of music covers or versions. Our project addresses some challenges of state of the art techniques in terms of robustness, scalability and limitations to laboratory settings (overfitting to small datasets). We will follow a user- centred approach for data annotation and a combination of a data-driven and a knowledge-driven methodology that incorporates musical knowledge and contextual information available as metadata. In terms of algorithmic approaches, we will consider current advances in classification approaches based on convolutional neural networks trained on frequency-domain (Pons and Serra, 2017) and time-domain signal representations.

Behavioural music data analytics

Supervisor: Gael Richard (Host: TPT; Secondment: DZ)

The goal of this project is to address the general problem of industrial-scale music recommendation with a novel approach, exploiting simultaneously user behavioural and contextual data in addition to audio, thus bridging the gap between content and usage-based recommendation. The idea is to build on machine learning techniques that have proven successful in the music information retrieval field (e.g. for instrument identification and music similarity) but to go beyond the state of the art with the joint analysis of large scale behavioural and contextual data such as, for instance user localisation, user environment or user activity (calm, running, etc.). The project will focus on exploring how to train models efficiently using such heterogeneous data. Deep learning network architectures with multiple layers and heterogeneous entry layers will be considered, but methods of reinforcement learning (to allow for adaptive recommendation) or ensemble methods are also of interest for the task. A specific focus will be given to semi-supervised learning strategies to cope with the fact that the majority of the data is unlabelled.

Voice models for lead vocal extraction and lyrics alignment

Supervisor: Gael Richard (Host: TPT; Secondment: AN)

A common approach for increasing the performance of audio source separation systems consists in informing the source models with additional information about the music content such as a score. In this context, this PhD project proposes to make use of lyrics information, such content being widely available on the internet, for informing the extraction of lead vocals from polyphonic music while performing the alignment of the text on the audio. This topic is thus at the crossroads between the fields of informed source separation and voice recognition and the goal will be to propose new models and algorithms that should help in performing both tasks in parallel. Iterative and joint approaches will be investigated. It is expected that both tasks will take advantage of each other as currently most systems for lyrics alignment usually rely on an independent raw vocal pre-extraction step and source separation systems using text information assume the availability of a roughly aligned text. This approach will raise several issues that will be addressed during the project. Among these is the learning of a voice recognition model adapted to singing voice through transfer learning, as phonemic transcription data are currently almost exclusively available for speech signals. An evaluation of the degree of segmentation of the vocal signal (from a complete set of phonemes to a simpler vowel/consonant approach) which is necessary to improve both tasks will be performed. The separation task will also explore different strategies to best exploit the information of the class of vocal sounds in the source model and on how it can be combined with other types of more traditional music-related features such as the melody line and the rhythmic structure . Targeted applications of this work are the automatic production of karaoke content and the automatic generation of lyrics from musical pieces.

Multimodal movie music track remastering

Supervisor: Gael Richard (Host: TPT; Secondment: TC)

The goal of this PhD, which is focused on movie tracks, will be to exploit information from multiple modalities to perform sophisticated audio track remastering by replacing the background music, potentially by a different recording which shares some similarity with the original music (for example rhythm or timbral similarity). This will be achieved by either including an explicit speech/background music separation stage or by exploiting a unique global remastering framework where speech separation is not explicit. For example, essential information can be extracted from the movie script, including information about the acoustic scene such as the presence/absence of given speakers and music, exterior vs interior scenes, the potential number of sound sources apart from music, and localisation of the different instances of the same or similar music excerpts. The script information can also help automatic clustering tools that localise similar sound excerpts in the movie track, which will enable interesting new separation and remixing paradigms. Besides presence/absence of given speakers, the movie (image) can bring more detailed information on the speakers’ activity such as rough phonetic information (from lip analysis) which will further help the automatic separation. Finally, the new target music background will help the separation and remixing in several ways. In particular, by exploiting psychoacoustic properties of the ear (e.g. masking properties of a sound by another sound), radically new model-constrained separation approaches will be built where the separation artefacts could be masked by the new target music source. To that aim, it appears essential to select the new target music based on high-level information such as rhythmic or timbral similarity to maximise masking capabilities.

Context-driven music transformation

Supervisor: Gael Richard (Host: TPT; Secondment: TC)

Adapting, transforming or repurposing existing music has numerous applications especially in the movie and video-game industries. Precursor works proposed to retrieve music from a database which is well adapted to the video content (in terms of correlation between the musical rhythm and movement dynamics in the video scene) or to transform the rhythmic expressiveness of audio recordings. More recent studies aim at retargeting music based on the video content by exploiting automatic music segmentation and rhythm analysis. The aim of this project is to extend the previous concepts to “context-” and “content-aware” repurposing of real music (e.g. without explicit MIDI information). This will allow us to adapt the music not only to video content when available but also to geographic or local societal constraints (e.g. adapting the music to regional hits), or to a specific rhythm, genre or mood. The musical knowledge for the transformation, obtained by means of sophisticated MIR models, will come either from a class of “similar” music pieces or from a single music example using supervised (e.g. directly exploiting knowledge from dedicated MIR models) or unsupervised (e.g. by mimicking the musical concepts without explicit labelling) approaches. A specific focus will be given to the development of style transfer learning by advanced Deep Neural Network architectures grounded on e.g. RNN or LSTM, which are well adapted for processing audio signals. Although radically new for music signals, very promising results were recently obtained with Convolutional Neural Networks for style transfer in image processing.

Defining, extracting and recreating studio production style from audio recordings

Supervisor: Gael Richard (Host: SONY; Secondment: TPT)

Studio production in music has emerged between the ‘50s and the ’70s. Musicians become producers by using the recording medium for its own creative potential. Many pieces of popular music are barely recognisable when studio production is stripped off. The purpose of MIR is to extract information from music, yet no specific descriptors for the effect of studio production exist. A study based on the million-song dataset sees no evolution in popular music for the last fifty years, whereas observation based on the recorded content indicate significant and regular evolution in studio production between the ‘40s and the ’10s. The purpose of this project is to expand the field of MIR to at least one aspect of studio production, such as the dynamic spatial setups of vocals. The ESR will develop a definition to this musical feature, means to extract its evolution from the audio signal, and a framework to generate music using this feature as a parameter. The resulting framework will be integrated into SONY’s automatic generation environment. The ESR will be helped by the expertise of SONY’s staff in studio production, and will be able to rely on multiple databases made by SONY, such as lead-sheets, manually transcribed popular music, and audio multi-tracks.

Large-scale multi-modal music search and retrieval without symbolic representations

Supervisor: Gerhard Widmer (Host: JKU; Secondment: KI)

This project is motivated by the Karajan Archive, curated by KI, which contains a large number of studio and live audio recordings, alternate takes and rehearsals, annotated scores, films, meta-data, etc. Making such repositories searchable and intuitively explorable in a content-based way, requires powerful technologies for multi-modal cross- linking - between different recordings, but also between recordings and scores - and music identification (e.g., for recognising a piece or specific section based on some recording snippet and then retrieving the appropriate meta- information). The big challenges in this are the large amount of data; the fact that in general, there are no symbolic (machine-readable) scores available for the overwhelming majority of pieces; and possibly substantial structural differences between different recordings. The goal of this project is to develop methods for the automatic structuring and cross-linking of such heterogeneous collections without the need for symbolic score representations, supporting tasks such as the retrieval of score images based on audio queries (and vice versa), the alignment of multiple performances to sheet music for purposes of score-based listening and comparison, or the identification of the piece in unknown recordings (again, without a machine-readable score) for automatic meta-data provision. There are two starting points for this. Approach 1 is multi-modal learning (with "deep learning" methods), where latent semantic representation spaces are learned that maximise the correlation between corresponding items in different modalities. This was first demonstrated for music at JKU, where systems learn to align audio recordings to (simple) scores without any symbolic representation (the scores are given only as pixel images). An alternative path involves the use of (incomplete and partly incorrect) symbolic representations based on automatic audio transcription. This was first demonstrated by Arzt et al. for the task of near-instantaneous identification of piano pieces from snippets of live recordings; symbolic fingerprints are extracted from the audio in real time, and used to index into a database of scores. We also showed how transcription-based note patterns can be matched in the symbolic domain to reveal musically relevant similarities. These and related approaches will be carried further and combined in this project, aiming at robust and efficient algorithms that support efficient search and retrieval in large collections of heterogeneous music-related material.

Live tracking and synchronisation of complex musical works via multi-modal analysis

Supervisor: Gerhard Widmer (Host: JKU; Secondment: VSO)

The feasibility of live tracking of complex symphonic works and, based on this, the synchronisation of various external events with the live music, has recently been demonstrated by consortium member JKU in a regular concert of the Royal Concertgebouw Orchestra Amsterdam, in Feb. 2015. However, with more complex stage works such as opera, where the soundscape may contain much more than just music (e.g., speech, shouting, shooting) and the course of events does not always strictly follow a score (consider, e.g., freely spoken passages, battle scenes, interruptions by applause, repeated arias due to audience cheering), fully automatic tracking becomes even more difficult. At the same time, such technologies would be highly useful in live streaming tasks (e.g., for the automatic synchronisation of sub-titles or other meta-information like the score, or synchronisation to the same performance by different devices), or in the concert hall (automatic control of lighting, camera timing, live video cutting). Project JKU2 aims at developing robust, multi-modal technologies for fully autonomous performance tracking in such scenarios, exploiting various complementary information sources and techniques (real-time audio-to-audio and audio-to-score matching, acoustic event detection and classification, motion tracking in video streams). Efforts will be made to improve robustness by using multiple reference (and perhaps also rehearsal) recordings to improve the system's ability to predict imminent, and interpret unexpected, musical and extra-musical events. This project will benefit from resources provided by VSO (audio recordings, mastered opera videos, multi-camera recordings, libretti, scores), but also their experience with the problems involved in live production and streaming in a real-world context.

How to Apply

See the application web site.


Email the relevant supervisor or project coordinator Simon Dixon


EU flag This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowsa-Curie grant agreement No. 765068.