Automatic Description Of Music For Analyzing Music Productions: A Case Study In Detecting Mellotron Sounds In Recordings


The invention and expansion of sound recording technologies, the development of computers and the subsequent digital revolution, radically transformed the way music is currently conceived, created, produced, distributed and experienced in different cultures around the world. Nowadays, music is almost completely dependent on technological processes and its access is frequently mediated by digital technologies. In the studio, for instance, technological infrastructures have always shaped the way different recording, mixing or mastering methods are consolidated, influencing in this way the final musical outcome. Furthermore, digital music collections have become available via global networks in constantly-increasing amounts, prompted by recent developments in audio technology and the creation of innovative online distribution platforms. This represents a new opportunity, not only for developing media technology for the access and distribution of music, but for the automatic description of music recordings. This paper addresses the problem of the automatic description of music within the context of music production. First, Music Information Research (MIR) concepts and techniques are introduced, followed by some of its possible musicological and practical applications in record production. Secondly, the detection of musical instruments in polyphonic audio is addressed, with a focus on the analysis of  Mellotron sounds. Finally, a specific methodology is proposed, along with the experiments conducted and the results obtained.

Music Information Research

Music Information Research (MIR) is a growing and active interdisciplinary field of research. For the last 15 years it has led to the problem of describing, organizing and categorizing large music collections by means of computational methods and digital processing. MIR’s object of study is music information i.e. any kind of information extracted from or related directly or indirectly to music- in any of its representational forms: scores, audio, text, video, etc. MIR addresses the way humans interact with this information in different contexts, and this calls for knowledge coming from different disciplines such as psychoacoustics, cognitive science, information sciences, signal processing, human-computer interaction, among others. As this broad scope requires, MIR has become a multidisciplinary (but rarely interdisciplinary) field, where many disciplines have addressed MIR problems from their own perspectives by employing previously-established and well-consolidated methods and techniques, but where there is seldom a dialogue between those different fields. Therefore, multiple strategies and methods have been implemented in MIR, usually involving automatic acoustic or text analysis, computer experiments (matching algorithm outputs to human-agreed annotations or score information), score analysis, user satisfaction surveys, etc (Herrera & Gouyon, 2013).

The automatic description of audio music content shows its best promises when accessing  large, ever-growing, digital music collections (sometimes, music for which there is no notation system available e.g. a score), that require the development of specialized functionalities for storage, indexing, search, classification, curation, analysis, recommendation, visualization, playlist generation or even music creation.

It is relevant then to find out what kind of information is being used to describe music and where it can be extracted from, considering also how much music information is carried by musical audio itself and how much belongs to contextual or cultural sources. Due to the size of these collections, manual annotation is not feasible and, in many cases, metadata describing the music is not available either.

In order to categorize the automatic description of music audio content, different abstraction levels are employed:

  • Low-level, which refers to basic acoustic features such as frequency (related to pitch in perception), intensity (related to loudness), duration, etc.
  • Mid-level, comprising higher musical features, such as intervals, dynamics, melody, harmony, rhythm, tempo, key, tonality, instruments, etc.
  • High-level, related to more complex concepts such as mood, expression, emotion, performance, genre, similarity, etc.

Some common tasks found in MIR literature [1] are, among others: beat and BPM detection, melodic description, chord recognition, transcription from audio to score, instrument recognition, genre or mood classification, music similarity and the detection of versions, variations or quotes. MIR research usually employs a bottom-up approach, that is, starting from the extraction of low-level features, and then piecing together systems that are more complex, in order to reach to higher levels of description. It is important to remark the presence in MIR of an inherent problem in information sciences, commonly known as the semantic gap: the difference between the computational representation and the semantic attributes of an object or category (Celma, Herrera & Serra: 2006). Trying to bridge low-level features with more abstract high-level features in music information still remains a difficult task.

MIR and record production

MIR’s wide scope of applications has not been fully explored. For instance, in the context of record production there are two possible modalities of use for them: MIR may help studies on the products of record production (i.e. as tools for musicologists), or MIR technologies could be used directly in a record production environment to enhance, facilitate or complement the production processes (i.e. as tools for sound engineers and music producers). MIR at the moment is not robust enough for music production environments, but it can be useful or even necessary for academic research on the art of record production. That is, MIR technologies provide musicological tools through which analyzing music recordings is possible beyond the limitations of classical techniques such as sonograms, or beyond the constraints of collection sizes (i.e. making possible analyses, annotations and comparisons at sizes beyond manual human management). For instance, identifying patterns and metrics on a large corpus of music and correlating them to the way they evolve during a long time span (Serrà et al: 2012). Therefore, an audio recording can be characterized automatically (with a non-negligible amount of errors that could require human supervision) with some of their music theoretical features through methods such as pitch estimation and note tracking (e.g. Bay et al, 2012), chord and key recognition (e.g. Cho & Bello: 2011; Peeters: 2006), rhythm and beat tracking (e.g. Ellis: 2007; Zapata: 2013), genre classification (e.g. Tzanetakis & Cook: 2002), similarity to other recordings (e.g. Aucouturier & Pachet: 2002; Panagakis & Kotropoulos: 2013), production techniques (e.g. Tzanetakis et al: 2007; McFee et al: 2012) or musical instruments (e.g. Herrera et al: 2003; Fuhrmann: 2012). This could also help to expand verbal descriptions and semantic categories commonly used in musicological research, as it makes it possible to quantify aesthetic and perceptual judgments (i.e. ‘heaviness’ in guitars can be quantified and specified, differences in snare sounds according to genres or historical periods can be quantified etc.).

Current musicological approaches acknowledge the relevance of studying recorded music as such, including the analysis by empirical means of all processes involved in music production (Cook: 2010) and applying frameworks for the description of real-world recording scenarios (Fazekas & Sandler: 2011).

It is also possible to measure and quantify specific sound features for a given recording or mixing configuration and relate them to the perception and cognition of recorded music. Above-mentioned computable features might prove useful for record producers too, as these technologies would help to make decisions or automatize complex tasks. Automated signal processing techniques have been applied to reduce unwanted recording artifacts (Clifford & Reiss: 2011), to accelerate the mixing phase (Montecchio & Cont: 2011) or to help as an aid for the tuning of percussion instruments (Richardson & Toulson: 2011). Research on similar applications, such as the automation of mixing and panning for different inputs into a constrained number of channels (e.g. Gonzalez & Reiss: 2007), employing score information for remixing stereo music (Woodruff et al: 2006) or pursuing a semantic approach for automatic mixing (De Man: 2013), has also increased in recent years. There are already commercial applications available for some of these techniques, such as automatic equalization according to real-time pitch tracking [2]or even music information exchange between the plug-in and the digital audio workstation for better software performance [3].

Automatic detection of instruments

One of the most frequent problems in audio content description is precisely that of classification according to different criteria (Herrera et al: 2002). These classification systems can relate to specific sound and musical features, or to more abstract and culturally subjective semantic descriptions (e.g. danceability, energeticness, grooviness, genre, etc.). Manual classification of large collections of music calls for the development of automatic classification systems. One of the main problems in the development of an automatic classifier is to find the way specific low-level encodings of the waveform can be related to higher-level descriptions. When analyzing audio in order to get information from the musical content, instruments and their timbre prove to be one of the most relevant and objective criteria for description. The difficulty of defining timbre from a strictly scientific and objective point view has been pointed out several times (Sethares: 1999; O’Callaghan: 2007). There is not a single and direct connection or association between physical and acoustic measurable features and specific related timbres, which means that in order to describe a timbral sensation accurately, a multidimensional approach must be adopted. Timbre as a human sensation thus cannot be placed into a one-dimensional unit where all possible timbres could be scaled and ordered. In the MIR context, human perception of timbre can be translated to the recognition of a specific musical instruments when searching and analyzing audio files in large databases. Timbre description and analysis actually depends on perceptual features which can be extracted and computed, by means of signal processing, from audio recordings. These features are not available or explicit in other representation forms, such as the score. In that way, this approach to music information -based on the sound features of the instrument instead of other melodic, harmonic or rhythmical models- could be used to create automatically computed classes, labels or tags.

The detection of musical instruments in a specific piece of music might be highly relevant in the analysis of music recordings, as instruments define most of the timbral qualities in any piece of music. The automatic description of a piece of music by finding a particular musical instrument or group of instruments involves analyzing the direct source of the physical sound, and the way it is categorized or grouped linguistically. When creating a computational model for identifying and classifying musical instruments, the equivalent human performance should also be taken into account. Some studies show that even subjects with musical training rarely show a positive recognition greater than 90% in this task, depending in the number of categories used, and in the most difficult cases the value of identification goes down to 40% (Herrera et al: 2006). For instance, families of instruments are more easily identifiable than singular instruments. It is also common to confuse an instrument with another one having a very similar timbre. Subjects can improve their discrimination performance by listening and being trained on comparing paired instruments, or by listening to instruments within a musical context, instead of listening to isolated or sustained musical notes (Herrera et al: 2006).

Perceptually, instruments are determinant for specific textures, atmospheres, moods, contrasts and distinctiveness in a piece of music. Additionally, instruments give information on the genre, the historical and the geographical origin of music. It could be of some interest for several fields (musicology, psychoacoustics, commercial applications, etc.) to retrieve from a large database, and automatically classify, pieces of music which make use of a certain musical instrument. And it would be relevant to do that regardless of the musical style, genre, historical period or geographic location, or without taking into account any additional metadata. Some applications and motivations for using computational models for the automatic labeling and classification of musical instruments are:

  • Finding the acoustic features that make the sound of an instrument identifiable or remarkable within a specific musical context. Thus, timbre can be used as an acoustic fingerprint (keeping in mind all possible range of sounds that a singular instrument can accomplish).
  • Improving the performance of a genre classifier. Culturally, there are instruments associated to a particular musical genre or style. Research on genre classification usually employs global timbre description as one of the main relevant attributes. However, individual instruments are rarely taken into account for this task. Their inclusion should increase the accuracy of an automatic genre detector.
  • Improving the performance of a geographical classifier. There are musical instruments associated to specific regions on the planet, so specific pieces of music are related to their geographic location. Gómez et al. (2009) showed how by including timbre features, performance in classifying geographically pieces of music is improved, helping to complement other musical features such as tonal profiles or tuning.
  • Improving the performance of a historical classifier. In a similar way, musical instruments can be associated to specific historical periods. In both academic and popular music, the specific time of invention and development of an instrument determines its use in a well-defined temporal lapse. It could also be important to study the usage of a specific instrument through time, finding its relative recurrence or historical preferences.
  • Perceptually, instruments and their timbres are relevant to shape subjective features in our reception of music. The presence of a single instrument or combination of instruments could define the overall texture or atmosphere in a piece of music. Similarly, the inclusion of an instrument in a specific section of the piece could create a contrast or distinctiveness that could increase interestingness or surprise.

Several of these applications could be combined to achieve different classification systems. For example, developing a virginals classifier could also help classifying music containing it by genre (classical, renaissance, early baroque), by historical period (16th-17th century), by geographic area (Northern Europe, Italy). A conga classifier could help classifying music belonging to the latin genre (as well as subgenres such as salsa, merengue, reggaeton) from specific countries (Cuba, Puerto Rico, Dominican Republic) and so on. All of them require a musicological/organologic approach, getting to know the history, development and context of the instrument or class and their more important physical features.

Sound descriptors

In order to detect a musical instrument in a recording, the acoustic features that make the sound of an instrument identifiable or remarkable must be found. To accomplish this, audio descriptors numerically capturing different timbre dimensions are extracted, quantified and coded from raw digital audio signals. In order to discriminate the sound source, several temporal and spectral features are decoded by humans in the way from the cochlea to the primary auditory cortex, which is the place where sound labeling happens in the brain (Herrera et al: 2006). Timbre descriptors can be obtained from the time-domain signal, from its spectrum in the frequency domain, or from a combination of both (spectro-temporal descriptors). Although not intrinsically related to timbre, the description of the energy of a signal could be used in combination with other descriptors for specific instrument identification, if required (especially the temporal evolution of energy of a note, known as ‘amplitude envelope’). The goal is therefore to know the most relevant acoustic and perceptual features of the musical instrument itself, and to identify a set of descriptors that could be associated with a particular sound. It could be the case that some descriptors included in this rich set are not relevant to the study and analysis of a specific instrument, and furthermore, its computational results could be misleading for a generic classifier. By selecting a small set of pertinent descriptors, redundancy is avoided, computational time is decreased and, ideally, generalization in detection should be more accurate (this would be a particular case of the so-called Ockham’s Razor). As it is difficult to know beforehand which descriptors describe more accurately a specific musical instrument, some feature selection techniques must be applied before or during the process of building a classification model for the sounds.

Case study: detection of Mellotron sounds in recordings

In this paper we restrict the detection of musical instruments upon the Mellotron, one of the first sample playback instruments in history, which has been widely employed in several forms of popular music since the sixties. The Mellotron has been used in music genres as diverse as art/progressive rock, psychedelic, alternative, electronica or ambient, and still continues to be used to this day. Mellotron sounds present interesting technical and perceptual qualities, which make them ideal for the study of timbre descriptors in the context of automatic classification in polyphonic audio. The Mellotron is a peculiar instrument in the history of 20th Century popular music. Modeled after the Chamberlin [4], it is recognized as one of the first playback-sample instruments in history. Originally, the idea behind the Mellotron was to emulate the sound of a full-orchestra by means of recording individual instrument notes in tape strips, which were activated through playback. For instance, instead of recording a whole string section performance for accompaniment in a song, the Mellotron features individually recorded notes of this string section which can then be played by the performer in a particular musical arrangement. The instrument can also be used in live settings, which makes it a very adequate option whenever it is difficult to get the original instrument or instruments for the performance. However, the Mellotron is not employed in recordings as commonly as other keyboard-controlled instruments, and this uniqueness makes it relevant for performing some specific classification tasks. During the second half of the sixties decade of the past century, several groups of psychedelic and progressive rock started using the Mellotron, prompted amongst others by the seminal piece Strawberry Fields Forever by The Beatles, which employed a flute Mellotron throughout the song. Some bands such as King Crimson, Genesis or The Moody Blues turned the Mellotron into a regular instrument in their compositions and then it became a trademark sound of a big portion of the progressive rock during the seventies. The Mellotron usage decayed during the eighties decade, due probably to the huge diffusion and success of cheaper digital synthesizers which emulated the sound of traditional Western instruments by means of synthesis techniques. However, the early decades of XXI century saw a revival of the Mellotron, not only as a vintage artifact or ‘retro’ curiosity, but as a main instrument and compositional tool, as new models went into production (bands such as Oasis and Air, or artists such as Aimee Mann included it prominently in their music). More recently, libraries containing digital samples from the Mellotron and even software emulators have been made available by different companies.

The electro-mechanical nature of the Mellotron (i.e. having characteristics both from electrically-enhanced and mechanically-powered musical instruments) makes it difficult to classify it within a well-defined organologic taxonomy. According to the Hornbostel-Sachs instrument classification system, for instance, the Mellotron would belong to its fifth category, electrophones. Unfortunately, when trying to classify it within any of the subcategories of this system, there is the disjunctive issue of considering the multi-timbral nature of the recorded sounds from real instruments, or the fact that it is activated and amplified electrically. However, the electro-mechanical tape mechanism imprints an unified sound to the Mellotron, disregarding the instrument being sampled. Now we refer to some of the technical features of the Mellotron which contribute to create this distinctive sound. The Mellotron main working mechanism lies in a bank of linear magnetic tape strips, in which sounds of different acoustic instruments are recorded. It uses a regular Western keyboard as a way to control the pitch of the sounds. Each key triggers a different tape strip, where individual notes belonging to a specific instrument have been recorded. Below every key, there is a tape and a magnetic head (for instance, the M400 model has 35 keys, with 35 magnetic heads and 35 tapes, while the Mark II model doubles that). Furthermore, some Mellotron models had up to three tracks in every tape, meaning that three different instruments or sounds could be recorded, and with a selector function a combination of two of them could be played simultaneously. When the instrument is switched on, a capstan (a metallic rotating spindle) is activated and remains rotating constantly. Whenever a key is pressed, the strip makes contact with the magnetic head (the reader) and the tape is played. There is an eight-second limit for playing a steady note in the instrument, due to the physical limitations of this mechanism, that is, the length of the tape strips (Vail: 2000). One of the main innovations in the Mellotron is its working tape mechanism: instead of having two reels and playing a sound until the tape length is over (as in a regular tape player system), the tapes are looped and attached to springs that allow the strips to go back to the starting position, once a pressed key is released, or after the eight-second limit.

By using tapes, the Mellotron can reproduce the attack of the instrument, a fact that could be used as a temporal cue when obtaining the values of the descriptors. However, its timbre is perceived as having an additional sound to that of its acoustic counterpart, i.e. sounds from Mellotron strings and a real string orchestra are perceived differently. One of the most frequent sound deviations that can be found in tape mechanisms is the so-called wow and flutter effect, which corresponds to rapid variations in frequency due to irregular tape motion. In analog magnetic tapes it is also frequent to have tape hiss, which is a high-frequency noise produced by the physical properties of the magnetic material. In some recordings, the characteristic sound of the spring coming back to the default position can be heard as well. Different models of the Mellotron (such as the M300, the MKII, the M400, etc) produce different sounds due to using different set of recordings, or having slight variations in the working mechanism, but these distinctions were not addressed in our research. The possibility of playing any instrument that has been previously recorded on a magnetic strip makes the Mellotron unique in its timbral diversity. However, all these different instruments are being mediated by the same physical mechanism, which could lead to some common timbral feature.

Thus, we framed our research by the following questions: could there be a set of audio descriptors that can be employed to group all sounds coming from the Mellotron, disregarding the kind of instrument being sampled? Can a machine be taught to detect the sound of this instrument by identifying these features? In general terms, do these kind of ‘rare’ or specialized musical instruments have distinctive sound features that can be recognized, described and characterized using low-level attributes? There are also some specific challenges in the detection task for this instrument. Firstly, the Mellotron constitutes one instrument with several timbres. Secondly, the Mellotron sound is not very prominent in most of the recordings it has been played. It was commonly used as a background musical accompaniment, which means that sometimes several other louder instruments appear in the recordings. Also, in most of the recordings the Mellotron does not play long continuous musical phrases, appearing only for a short time. Solo sections are hard to find as well. Additionally, recognition of this instrument proves to be difficult, even for human listeners. Although, to the best of our knowledge, there have not been scientific studies on this specific task, there is a lot of online information on this matter. For instance, the Planet Mellotron [5] website lists at least 100 albums containing allegedly sound from the Mellotron, some of them wrongly classified or very difficult to verify due to lack of sonic evidence (the supposed sound of the Mellotron could be deeply buried in the mix, so it is difficult to be perceptually discriminated), lack of meta-information (for instance, impossibility of confirmation, by musicians or producers, on the usage of the instrument in a specific piece of music), or misattributed samples.


It is possible to train classifiers with audio descriptors (temporally integrated from the raw feature values extracted from polyphonic audio data) using extensive datasets (Fuhrmann & Herrera: 2010; Essid et al: 2006). The following is a general description of the proposed approach (flow diagram can be seen in Fig. 1)

  1. Building a well-suited database for the instrument with an adequate instrument-name annotation (i.e. flute, strings, etc.), as well a database for the counterpart (i.e. a collection including samples not containing the instrument). This will constitute the so-called ground-truth, which is the basis for all subsequent steps.
  2. Extracting audio features (descriptors), frame-based, computed over time (by means of statistical analysis) from the datasets. It is important to remark that no pre-processing (e.g. source separation) is required in this process; feature extraction is done directly in all pieces belonging to a particular collection.
  3. Using specific feature selection techniques to select the most relevant attributes or those that could be more accurate for describing the specific timbre of the instrument, and help to improve the performance of the discrimination model.
  4. Training, testing and classifying the data according to the selected descriptor set, using machine learning techniques. Here, supervised learning techniques will be used, whereby annotated data is used to train a model that will generate an instrument label for each presented sound excerpt.
  5. Comparing, analyzing and evaluating descriptors, models, techniques and classification results.

Fig. 1 Automatic instrument detection and classification flow diagram for polyphonic audio (from Furhmann, Haro and Herrera, 2009)

Two main tasks were defined for building the groundtruth: first, making a representative collection of recordings that include the Mellotron; second, building collections that include the ‘real’ acoustic instruments that are being sampled by the Mellotron. The purpose here is to discriminate the Mellotron from what is not, e.g. learning to differentiate between a Mellotron choir sound from a real choir. In that way, it is possible to find the features that make the Mellotron sound to be physically and perceptually distinctive. Ideally, the selected excerpts featuring the instrument must correspond to recordings from different songs, albums, artists, periods and musical genres, in order to cover a wide range of sonic possibilities and make possible the system to generalize what a Mellotron sound involves, in terms of acoustic features. Also, in addition to excerpts featuring the solo instrument, there must be a wide diversity of instrument combinations, taking into account the predominance level of the Mellotron. Selection of several excerpts belonging to the same song was discouraged, as well as excerpts belonging to the same album (trying to avoid the so-called album effect where, due to a unity of production techniques, the sound similarity increases). Samples where the Mellotron was deeply buried in the mix were not selected, because they would have confused the classifiers, adding difficulty to the task. Fragments of, at least, 30 seconds where the Mellotron is permanently featured were selected. WAV format was used, either converted from 192 Kbps (or better) MP3 files, or directly ripped from audio compact discs. The samples were fragmented and converted from stereo to mono by mixing both channels avoiding any clipping. A total of 973 files were collected, segmented, annotated, classified and reviewed for different experiments, as described in Table 1.

Table 1. Groundtruth details and classification for the different collections for the ‘Mellotron’ and ‘Non-Mellotron’ classes

Once the ground-truth collections were double-checked, feature extraction was performed using Essentia, which is a C++/python-based library for audio analysis (collection of algorithms) that includes standard signal processing and temporal, spectral and statistical descriptors (Bogdanov et al: 2013). Here, the signal is cut into 2048 points frames (50ms), hop size of 1024, and for each frame short-time spectrum is computed and several temporal and spectral descriptors are obtained and aggregated to a pool. The default Essentia Extractor was used, which extracts many features shown to be useful for audio similarity and classification purposes. Every descriptor is then a value or values’ vector capturing the descriptor’s average computed for all frames within a sample.

For the automatic classification task, several machine learning algorithms were compared. Machine learning evolved as a branch of the artificial intelligence field, developing algorithms that find behaviors and complex patterns from real world data. Machine learning’s main purpose is to find useful approximations for modeling and predicting processes that follow some hidden regularities, but that are hard to detect manually due to the huge amount of information describing them (Alpaydin: 2004). It is crucial that these automatic systems are capable of learning and adapting (in order to have high predictive accuracy) by means of efficient algorithms that are able to process massive amounts of data and find optimal solutions to specific problems. In this particular case, it was our intention to build descriptive models gaining knowledge from data that lead eventually to predictive systems that anticipate to events in the future. Thus, supervised classification was used, where the learning algorithm maps features to taxonomy—predefined classes. For the purpose of this project, open-source free software Weka [5] was employed. Weka allows to pre-process, select features, classify or cluster data, creating predictive models by means of different machine learning techniques. Three different machine learning algorithms were selected for the experiments: decision trees, k-nearest neighbor and support vector machines. For the evaluation, we focused on the effectiveness of the system (Serrà: 2007), that is, performance based on the exact match between the predicted and the annotated labels. This is specifically done by measuring the percentage of correctly classified instances, the recall of the system (proportion of relevant material actually retrieved in answer to a search request) and the precision of the model (proportion of retrieved material that is actually relevant).

Experiments and results

A series of experiments were sequentially run, in order to gather information about the specific descriptors that could help accomplish the tasks proposed. Two classes were created then for each experiment, one for samples featuring the Mellotron, and one for samples not featuring the Mellotron. The number of instances for each class was the same in every experiment. First, an  experiment  comparing specific pieces of music played with Mellotron or, alternatively, with acoustic instruments, was done. This was intended to provide a guideline for the next experiments, by making a direct timbral comparison between the Mellotron and several instrument combinations for equivalent musical phrases. For this experiment, a special collection was built employing specific music for Mellotron arranged by Mike Dickson from his album Mellotronworks. In these recordings, classical music pieces are performed exclusively employing Mellotron sounds, by recording individual instrument scores on it and mixing them afterwards. Comparing  classical music pieces with versions for Mellotron we are assuring that harmonic or melodic content are roughly the same, so we can focus directly on the timbral differences. This pieces present several timbral combinations, belonging to different musical instruments from a typical Western orchestra. Furthermore, most of the Mellotron versions employ the same original instrumentation of the orchestral pieces, so it was a direct comparison between the recording of the instrument, and the main features the Mellotron adds to the timbre when those instruments are played. Then, a series of experiments comparing three specific instruments settings (strings, flute, choir) were conducted, all of them for polyphonic music pieces. The intention was to evaluate the overall performance of the classifiers when a different amount of relevant attributes was employed. In that way, it could be proven whether a small set of descriptors is sufficient to describe the Mellotron sound or, on the contrary, if the timbre uniqueness and complexity of the problem make it necessary to have larger sets of attributes. On the other hand, having too many parameters increases the possibility of classifying sounds according to random or suboptimal features, an undesirable circumstance that is commonly known as overfitting. From the first experiments, a set of sound descriptors was detected and proven useful for the classification tasks, showing that they might be related to the particular sound of the Mellotron. Some of these descriptors were:

  • Dissonance: describes sensory dissonance (not musical dissonance), based on the roughness of the spectral peaks.
  • Spectral crest: describes the flatness around the mean value of the spectrum.
  • Spectral skewness: describes the asymmetry around the average of the spectrum.
  • MFCC 4: measures the fourth component of the mel-frequency cepstrum vector representation. Mel-frequency cepstrum represents the short-term power spectrum shape, but in a very compact and abstract  way that is frequently useful for timbre discriminations.
  • Spectral flux: measures how quickly the power spectrum changes.

The final experiments (fifth to seventh), dealing with the totality of the collections and a larger number of instances, were intended to evaluate the findings from the previous experiments. The results in this last series of experiments showed the difficulty in modeling the sound of a specific instrument in a polyphonic mixture. It requires a larger number of descriptors in order to reach high accuracy values. In Fig. 2 the best classification results in every experiment are shown.

Fig. 2. Best performances in every experiment for every machine learning technique. Bar series #1 refers to Mellotron versus non-Mellotron with same phrases, series #2 to #4 depict results for Mellotron instrument versus acoustic flute, strings, choir, respectively. Bar series #5 combines the collections from the previous four experiments. Series #6 compares all Mellotron sounds with a generic Rock/Pop collection that had not been used before, and series #7  shows the results for the Mellotron present/absent discrimination when all databases from the previous six experiments were combined.

From these results, it can be seen how performance changes from experiment to experiment, and how different discrimination types have different degrees of difficulty. In general, the SMO support vector machine algorithm proves to be the most effective one (as it is usual in the current state-of-the-art literature), by getting the highest correctly classified values in 5 experiments. The best overall results were obtained in the third experiment (strings). The worst results were obtained in the last experiment, which illustrates the difficulty of discriminating specific sounds out of a complex polyphonic mixture, and the fact that our techniques are still very far from reflecting the way human perception and cognition work.


Detecting the presence of a particular instrument in a recording could give valuable information on the context and content of music recording collections. It can be hypothesized that the inclusion of an instrument in a recording could help to change some perceived classification features (such as genre or mood) of that piece of music, thus making the proposed methodology a necessary step in determining the connection between instruments and related high-level concepts. Reported experiments show that it is possible to automatically identify the Mellotron in a polyphonic setting (provided we accept some errors).We also detected audio descriptors that could probably be related to the physical mechanism that enables the Mellotron to generate its distinctive sound. It is important to note how some Mellotron features, such as the slight variations in frequency due to irregular tape motion, could be indeed distinctive of this instrument, but they are not always present in the recordings. This means that depending on variables such as recording and production characteristics, Mellotron model, or even date of the recording, some sound characteristics of the instrument could notably change, or even could not be featured in the audio samples. The descriptors obtained by the different models that could help differentiate and classify the Mellotron, are somehow coherent with the physical properties of the instrument. Indeed one could hypothesize that features such as irregularities in the tape motion mechanism could be related to the dissonance descriptor, or that the tape hiss could be reflected on attributes such as the spectral crest.

The methodology used here has a series of advantages worth mentioning. First, it can be applied to music in real scenarios, that is, polyphonic signals which comprise a diversity of sound sources creating a multi-timbral mixture, instead of the monophonic approach where instruments are specifically isolated. Second, it can be extrapolated to several categories, including solo instruments or several combinations of instruments that could help to classify the data according to predefined taxonomies. Some of the approaches used previously to approach this problem, imply building a model that sometimes fits only one specific instrument. As we are dealing with polyphonic music, this approach can be extended and be applied to any kind of instruments coming from any musical culture in the world, thus being pertinent for multicultural studies. No previous processing is needed, which drastically reduces the computational time when comparing it to the time needed by methods involving source separation, which are still in an incipient stage (Fuhrmann, Haro & Herrera: 2009), so it constitutes a rather simple and cost/benefit appropriated methodology. Finally, this methodology can be robust against sounds that have not been identified previously. Once a model for a specific instrument is established, it does not require any information beforehand from the audio file, i.e. the computation can be applied to raw data without any kind of high-level tags associated to it.

We have presented an illustration of automatic music description for instrument detection in polyphonic audio. MIR applications in a music production environment are manifold, as illustrated above. Nonetheless, current MIR technologies and methods are not mature enough to make possible error-free tools in the recording studio or in the musicologist scriptorium, and their outcomes still require human supervision. This is not a bad prospect, considering the existing features and potential to become a powerful aid for different facets of the art of record production.


[1]    More MIR topics can be found in the proceedings of well-established international conferences such as:

-ISMIR (International Symposium on Music Information Retrieval)

-SMC (Sound and Music Computing Conference).

[2]    Surfer EQ by SoundRadix

[3]    ARA: Audio Random Access by Celemony


[5] Planet Mellotron is a website where a comprehensive and extensive database of music recordings that include this instrument is annotated and updated regularly. (last visited in August 2013)

[6]    Weka 3 – The University of Waikato


Alpaydin, E. (2004). Introduction to machine learning. Cambridge: MIT Press.

Aucouturier, J. J., & Pachet, F. (2002). ‘Music similarity measures: What’s the use?’. In: ISMIR.

Bay, M., Ehmann, A. F., Beauchamp, J. W., Smaragdis, P., & Downie, J. S. (2012). ‘Second Fiddle is Important Too: Pitch Tracking Individual Voices in Polyphonic Music’. In: ISMIR, pp. 319-324.

Bogdanov, D., Wack, N., Gómez, E. Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J. & Serra, X. (2013). ‘Essentia: an audio analysis library for music information retrieval’. In: Proc. ISMIR.

Celma, Ò., Herrera, P., & Serra, X. (2006). ‘Bridging the music semantic gap’. In: ESWC 2006 Workshop on Mastering the Gap: from Information Extraction to Semantic Representation.

Cho, T., & Bello, J. P. (2011). ‘A Feature Smoothing Method for Chord Recognition Using Recurrence Plots’. In: ISMIR, pp. 651-656.

Clifford, A. & Reiss, J. (2011). ‘Reducing comb filtering on different musical instruments using time delay estimation’. In: Journal of Art of Record Production. Issue 5.

Cook, N. (2010). ‘The ghost in the machine: towards a musicology of recordings’. In: Musicae Scientiae, 14(2), pp. 3-21.

De Man, B. (2013). ‘A semantic approach to autonomous mixing’. In: 8th Art of Record Production Conference.

Ellis, D. P. (2007). ‘Beat tracking by dynamic programming’. In: Journal of New Music Research, 36(1), pp. 51-60.

Essid, S., Richard, G. & David, B. (2006). ‘Instrument recognition in polyphonic music based on automatic taxonomies’ In: IEEE Trans. Audio Speech Lang. Process., vol. 14, pp. 68–80.

Fazekas, G. & Sandler, M. (2011). ‘The Studio Ontology Framework.’ In: ISMIR.

Fuhrmann, F., Haro, M., & Herrera, P. (2009). ‘Scalability, generality and temporal aspects in automatic recognition of predominant musical instruments in polyphonic music’. In:Proc. of ISMIR.

Fuhrmann, F. & Herrera, P. (2010). ‘Polyphonic instrument recognition for exploring semantic similarities in music’. In: Proc. of DAFx-10.

Fuhrmann, F. (2012). Automatic musical instrument recognition from polyphonic music audio signals. PhD dissertation, Universitat Pompeu Fabra, Barcelona, Spain.

Gómez, E., Haro, M. & Herrera, P. (2009). ‘Music and geography: Content description of musical

audio from different parts of the world’. In:  Proc. of ISMIR, Kobe, Japan.

Gonzalez, E. P., & Reiss, J. (2007) ‘Automatic mixing: live downmixing stereo panner’. In:Proc. of the 10th int. conference on digital audio effects (DAFx-07), Bordeaux, France.

Herrera, P., Yeterian, A. & Gouyon, F. (2002). ‘Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques’. In: Lecture Notes in Computer Science, Volume 2445/2002, pp. 69-80.

Herrera, P., Peeters, G., and Dubnov, S. (2003). ‘Automatic classification of musical instrument sounds’. In: Journal of New Music Research, 32(1), pp. 3–22.

Herrera, P., Klapuri, A. & Davy, M. (2006). ‘Automatic Classification of Pitched Musical Instrument Sounds’ In: Klapuri, A., Davy, M. (Eds.) Signal Processing Methods for Music Transcription. New York: Springer.

Herrera, P., Gouyon, F. (2013). ‘MIRrors: Music Information Research reflects on its future’. Journal of Intelligent Information Systems, 41(3), pp. 339-343.

McFee, B., Barrington, L., & Lanckriet, G. (2012). ‘Learning Content Similarity for Music Recommendation’. In: IEEE Transactions on Audio, Speech & Language Processing, 20(8), pp. 2207-2218.

Montecchio, N., & Cont, A. (2011). ‘Accelerating the Mixing Phase in Studio Recording Productions by Automatic Audio Alignment’. In: ISMIR.

O’Callaghan, C. (2007). Sounds. New York: Oxford University Press.

Panagakis, Y & Kotropoulos, C. (2013). ‘Music classification by low-rank semantic mappings’ In: EURASIP Journal on Audio, Speech, and Music Processing vol.  (1) p. 13

Peeters, G. (2006). ‘Chroma-based estimation of musical key from audio-signal analysis’. In: ISMIR, pp. 115-120.

Richardson, P., & Toulson, R. (2011). ‘Fine Tuning Percussion–A New Educational Approach.’ In: Journal of Art of Record Production. Issue 5.

Serrà, J. (2007). Music similarity based on sequences of descriptors: tonal features applied to audio cover song identification. Master thesis in Information, Communication and Audiovisual Media Technologies. Universitat Pompeu Fabra, Barcelona.

Serrà, J., Corral, A., Boguñá, M., Haro, M. & Arcos, J. Ll.  (2012). ‘Measuring the Evolution of Contemporary Western Popular Music’. In: Sci Rep, 2, 521. Published online 2012 July 26. doi: 10.1038/srep00521

Sethares, W. A. (1997). Tuning, Timbre, Spectrum, Scale. New York: Springer.

Tzanetakis, G., & Cook, P. (2002). ‘Musical genre classification of audio signals’. In: Speech and Audio Processing, IEEE transactions on, 10(5), pp. 293-302.

Tzanetakis, G., Jones, R., & McNally, K. (2007). ‘Stereo Panning Features for Classifying Recording Production Style’. In: ISMIR, pp. 441-444.

Vail, M. (2000). Vintage Synthesizers. San Francisco: Miller Freeman.

Woodruff, J. F., Pardo, B., & Dannenberg, R. B. (2006). ‘Remixing Stereo Music with Score-Informed Source Separation’. In: ISMIR, pp. 314-319.

Zapata, J. (2013). Comparative evaluation and combination of automatic rhythm description systems. PhD dissertation, Universitat Pompeu Fabra, Barcelona, Spain