Get Complete Project Material File(s) Now! »
STRUCTURED-AUDIO ENCODING
Today, with high bandwidth music distribution channels, including high capacity swappable storage such as blu-ray, PDD and even conventional DVD’s, high speed Internet and non-swappable high storage capacities of personal music players, musical recordings could be distributed in structured formats (like described in Vercoe et al., 1998) that preserve the isolation of individual sounds until the time of playback but require much more storage space than pre-mixed music.
Structured-media formats make automatic multimedia annotation easier. In addition, they give the end user more control over the media playback. For example, an audio enthusiast could take better advantage of a seven-speaker playback setup if the audio material was not pre-mixed for stereo playback. Musicians could “mute” a particular part of a recording and play along.
Although structured formats provide immense advantages over their nonstructured counterparts (such as the current generation of compact discs and DVDs), there is currently no automatic way of adding structure to unstructured recordings.
At the recording stage, the different musical instruments are often recorded and stored using separate channels of a multi-track recording system. Performing realtime Solo AMIR while recording the instruments separately or using archived multi-track master tapes, as described in Section 12.4 and (Livshin and Rodet 2004b), can allow computing instrument Meta-data automatically and preserving it throughout the production process.
In the future, by combining robust tools from AMIR, music transcription, and speech recognition, it may be possible to build fully or partly automated tools for unstructured-to-structured encoding. See for example (Livshin et al. 2005) for a fully automatic Wave-to-MIDI conversion system.
MUSIC INFORMATION RETRIEVAL (MIR)
Successful Musical Instrument Recognition can benefit other MIR research fields, such as f0-estimation and score alignment, by allowing their algorithms to assume spectral, temporal or other qualities of the specific instruments participating in the analyzed musical piece.
Automatic Transcription
The process of listening to a piece of music and reconstructing the notated score is known as transcription. More generally, transcription is the process of determining which musical notes were played when (and by which instrument) in a musical recording or performance. In the general case of music played by multiple instruments (or a single polyphonic instrument such as the guitar or piano), the task is one of polyphonic pitch tracking. This is difficult—humans require extensive training in order to transcribe music reliably. Nevertheless, as transcription is an important tool for music theorists, music psychologists, and musicologists— not to mention music lovers who want to figure out what their favorite artists are playing in rapid passages—it would be wonderful to have tools that could aid the transcription process, or automate it entirely. Polyphonic pitch tracking research demonstrates that the task may be made simpler if good—and explicit—models of the sound sources (the musical instruments) are available (Kashino and Murase, 1998). By integrating sound source recognition with a transcription engine, the end result can be improved.
In addition to the possibility of improving the transcription engine by providing it with the instrument modules, AMIR is indispensible for instrument segmentation of the notes; in an unpublished report (Livshin et al. 2005), we have integrated an instrument recognition module with a multiple-f0 estimation module to create a system which automatically creates partituras out of recorded music, by first finding the different notes in a musical piece and then arranging them into staves by their recognized musical instruments.
A TOOL FOR COMPOSERS AND SOUND EDITORS
An AMIR option in sound editing applications can allow quick searches inside the music for parts where a certain instrument is playing.
CHALLENGES
AMIR poses many challenges and questions:
ACCURACY
How can we distinguish between similar instruments – is it possible, for example, to distinguish among seemingly identical sounds coming from different instruments, such as equally pitched sounds of viola and violin?
GENERALITY
The hypothetical goal of building an ideal classifier which could recognize all the sound variations of a musical instrument is a wholly different task than successfully classifying a specific sound database. What are the distinguishing qualities of a “Concept Instrument” which define, bound and fundamentally separate it from all other instrument classes, regardless of the recording conditions or the specific instrument (e.g., a specific Stradivarius violin) being used?
TAXONOMY
What should be the instrument classes, and which instruments should be classified in the same class when categorizing into instrument families? Should sounds recorded in different settings and playing techniques be classified in the same class? Should recordings of a string ensemble, for example, and a pizzicato sound of a single violin considered the same instrument class? Acoustic and electric guitars?
DATA VALIDITY
Does the instrument recognition algorithm learn enough sound variations of an instrument (“encapsulation”)? Are all the learned sound samples really beneficial for the recognition or maybe some actually sabotage it (“consistency”)? Are there “bad” samples or misclassified sounds (“database errors”)?
POLYPHONICITY
Being able to distinguish Solo recordings of one instrument from another, does not suffice in the multi-instrumental case, where several instruments play concurrently. The sounds are intermixed and even influence each other. How do we handle instrument recognition in MIP music?
PATTERN RECOGNITION ISSUES
Misclassification may occur due to different pattern recognition issues:
Classification Process
• Badly defined sound classes, or different classes with virtually identical sounds
• Inappropriate or weak classification algorithms
• Feature descriptors which do not cover enough distinguishing qualities or mislead
Sound sets
• Misrepresenting or insufficient Learning set
• Misrepresenting classified sounds
During the last 30 years, different automatic musical instrument recognition (AMIR) systems have been constructed, using different approaches, scopes, and levels of performance. Most of these systems have dealt with AMIR of single, isolated tones (either synthesized or natural). More recent works, from the last 10 years, have employed authentic recordings of musical phrases (Solos), and since 2003, research on AMIR in multi-instrumental, polyphonic (MIP) music began to gain popularity.
ISOLATED TONES
In (Cosi et al., 1994; De Poli and Prandoni, 1997), a series of Kohonen Self-Organizing-Map (SOM) neural networks were constructed using as inputs feature descriptor vectors, most often MFCC, computed on isolated tones of a specific pitch. One tone per instrument was used with up to 40 instruments in a single experiment. The dimension of the feature vectors was reduced sometimes using Principal Component Analysis (PCA) (Pearson 1901). Unfortunately, the presented recognition rates are unreliable as the Test set was not independent of the Learning set (see Chapter 9 for the importance of independent evaluation).
The instrument classification abilities of a feedforward neural network and a K-Nearest Neighbor classifier (KNN) were compared in (Kaminskyj and Materka 1995). The classifiers were trained on feature descriptors based on the temporal envelopes of isolated tones. The two classifiers achieved recognition rates of about 98% classifying tones of four instruments (guitar, piano, marimba, and accordion), over a one-octave pitch range. Again, like the paper mentioned in the previous paragraph, while the recognition rate seems high, both the Learning and Test data were recorded in the same recording conditions – same instruments, same players and the same acoustic environment. Adding to that the fact that the four recognized instruments have sounds which are very different from each other implies that it is doubtful whether such high recognition rates will be obtained adding additional instruments or even using independent Learning and Test sets.
Traditional pattern-recognition methods were applied by different authors for performing AMIR in isolated-tones. In (Bourne 1972), perceptually-motivated Feature Descriptors were used as a training set for a Bayesian classifier, including the overall spectrum and the relative onset times of different harmonics, extracted from 60 clarinet, French horn, and trumpet tones. The Test set included 15 tones, with eight tones which did not appear in the Learning set. All tones except one were correctly classified (around 93% recognition rate).
In an unpublished report, Casey (1996) describes a novel recognition framework based on a “distal learning” technique. Using a commercial waveguide synthesizer to produce isolated tones, he trained a neural network to distinguish between two synthesized instruments (brass and single-reed) and to recover their synthesizer control parameters. Although “recognition” results were not quantified as such, the low “outcome error” reported by Casey demonstrates the success of the approach in the limited tests.
In (Fujinaga 1998) a KNN classifier was used with a Learning set consisting of features extracted from 1338 spectral slices representing 23 instruments playing a range of pitches. Using leave-one-out crossvalidation with a genetic algorithm to identify good feature combinations, the system reached a recognition rate of approximately 50%.
(Martin and Kim 1998) exemplified the idea of testing very long lists of features and then selecting only those shown to be most relevant for performing classifications. Martin and Kim worked with log-lag correlograms to better approximate the way our hearing system processes sonic information. They examined 31 features to classify a corpus of 14 orchestral wind and string instruments. They have found the following features to be the most useful: vibrato and tremolo strength and frequency, onset harmonic skew (i.e., the time difference of the harmonics to arise in the attack portion), centroid related measures (e.g., average, variance, ratio along note segments, modulation), onset duration, and select pitch related measures (e.g., value, variance). The authors noted that the features they studied exhibited non-uniform influences, that is, some features were better at classifying some instruments and instrument families and not others. In other words, features could be both relevant and non-relevant depending on the context. The influence of non-relevant features degraded the classification success rates between 7% and 14%.
Brown (1999) used cepstral coefficients from constant-Q transforms (instead of computing them using FFT-transforms); she also clustered feature vectors in a way that the resulting clusters seemed to be coding some temporal dynamics.
Eronen and Klapuri (2000) used non-Mel scaled Cepstral Coefficients, combining these features with a long list (up to 43) of complementary descriptors; their list included, among other features, the centroid, rise and decay time, frequency/amplitude modulation (FM/AM) rate and width, fundamental frequency and fundamental-variation-related features for onset and for the remainder of the note. In a more recent study, using a large set of features (Eronen, 2001), the features which turned out to be most important were the MFCCs, their standard deviations and their deltas (differences between contiguous frames), the spectral centroid and related features, onset duration, and the crest factor (especially for instrument family discrimination).
In (Kaminskyj 2001) the main author used the RMS envelope, the Constant-Q frequency spectrum, and a set of spectral features derived from Principal Component Analysis (PCA).
Note: Most of the papers on AMIR of isolated tones, including Martin and Kim 1998, Fraser and Fujinaga 1999, Kaminskyj 2001, Agostini et al. 2001, Peeters and Rodet 2002 and others, have used tones randomly selected from a single sound database in both the Learning and Test sets for evaluation of their AMIR techniques. In (Livshin and Rodet 2003) we have shown that results of such evaluation techniques do not necessarily indicate the general ability or performance of an AMIR classification technique. Read further on this issue in Chapter 9.
In (Livshin, Peeters and Rodet 2003) we describe a classification process which produces a high recognition rate with Self-Classification evaluation (see Section 9.4.1). Over 95% recognition rate was achieved classifying 18 instruments and results of three classification algorithms were compared: Multidimensional Gaussian, K-Nearest Neighbors (KNN) and Learning Vector Quantization neural network (LVQ). Lower results were achieved with the Minus-1 (DB) evaluation method. This paper deals with many aspects of instrument recognition, including feature selection, removing outliers, evaluation techniques and more.
In (Livshin and Rodet 2006b) we have taken a similar approach to (Martin and Kim 1998), (Peeters and Rodet 2002) and (Livshin, Peeters and Rodet 2003), and used a large collection of feature descriptors, creating a weighted list of the most relevant features using Linear Discriminant Analysis on the Learning set. A KNN classifier was used to classify the sounds in a flat “all vs. all” classification of the instrument classes, as unlike in (Peeters and Rodet 2003), preliminary tests using hierarchies of instrument families did not show improvements in classification results over the flat model.
A large and diverse collection of sounds was used – tones of 10 different instruments taken from 13 sound databases. The recognition results were high – 94.84% using the Minus-1-Instance evaluation method (see Section 9.7.1). Comparing with lower results using the same classification method, a very similar feature set and the Minus-1-DB evaluation method, which also uses independent Test and Learning sets, in (Livshin and Rodet 2003a and Livshin and Rodet 2003b), we can see that these classification techniques suffice for achieving high recognition rates when using a Learning database which is large and diverse enough. At the time of publishing (Livshin and Rodet 2006b) this was the state-of-the-art for AMIR recognition in Separate tones. Read Chapter 11 for full details.
For several other historical topics related to AMIR in separate tones, see (Herrera, Peeters and Dubnov 2003).
SOLO PERFORMANCES
Until 1998, there were practically no published reports of musical instrument recognition systems that could operate on realistic musical recordings.
A vector-quantizer based on MFCC features was used in (Dubnov and Rodet 1998) as a front-end to a statistical clustering algorithm. The system was trained with 18 short excerpts from as many instruments. Although the classification results were not reported, it seems that the vector-quantizer have captured something about the “space” of instrument sounds.
A classifier that distinguishes oboe from saxophone recordings was described in (Brown 1999). For each instrument, a Gaussian mixture model (GMM) was trained on constant-Q cepstral coefficients, using one minute of music from each instrument. The recognition rate of the system was 94% for independent, noisy samples from commercial recordings. This work was extended later on, in (Brown Houix and McAdams 2001), where four wind instruments (flute, sax, oboe and clarinet) were classified using combinations of four feature types, reaching a recognition rate of 82% with the best combination of parameters and training material.
In (Marques and Moreno 1999), eight fairly different instruments (bagpipes, clarinet, flute, harpsichord, organ, piano, trombone and violin) were classified using one CD per instrument for learning and one for classification; they compared three feature types using two different classification algorithms, achieving 70% recognition rate. The best classifiers used MFCC features, correctly classifying approximately 72% of the test data. Performance dropped to approximately 45% when the system was tested with “non-professional” recordings, suggesting that the classifier has not generalized in the same way as humans do. The “non-professional” recordings were a subset of the student recordings.
(Martin 1999) has classified sets of six, seven and eight instruments, reaching 82.3% (with violin, viola, cello, trumpet, clarinet, and flute), 77.9% and 73% recognition rates respectively. Martin has used up to three different recordings from each instrument; in each experiment one recording was classified while the others were learned. The feature set was relatively large for the time and consisted of 31 one-dimensional features.
In our paper, (Livshin and Rodet 2004a), we have presented a process for recognition of Solos of seven instruments (including two highly polyphonic – guitar and piano), using independent test data (unlike some other papers), which yielded a rather high recognition rate (88%) and could also operate in realtime with just a small compromise on the recognition score (85%). At the time of publishing this was the state-of-the-art for AMIR recognition in Solos, both offline and realtime. See Chapter 12 for full details.
MULTI-INSTRUMENTAL MUSIC
Only a few studies have attempted instrument recognition for polyphonic music; these systems were mostly tested on limited and artificial examples.
A template-based time domain approach was used in (Kashino and Murase 1999). Using three different instruments (flute, violin and piano) and specially arranged ensemble recordings they achieved 68% recognition rate with both the true fundamental frequencies (f0s) and the onsets supplied to the algorithm. With the inclusion of higher level musical knowledge, most importantly voice leading rules, recognition accuracy improved to 88%.
A frequency domain approach was proposed in (Kinoshita et al. 1999), using features related to the sharpness of onsets and the spectral distribution of partials. F0s were extracted prior to the instrument classification process to determine where partials of more than one f0 would coincide. Using random two-tone combinations from three different instruments (clarinet, violin, piano), they obtained recognition accuracies between 66% and 75% (73% – 81% if the correct F0s were provided), depending on the interval between the two notes.
Table of contents :
Chapter 1 OVERVIEW
1.1 THE AMIR PROCESS
1.2 THESIS CHAPTERS
Chapter 2 INTRODUCTION
2.1 MOTIVATION
2.1.1 Intelligent search of music
2.1.2 Structured-Audio Encoding
2.1.3 Music Information Retrieval (MIR)
2.1.4 A tool for Composers and sound editors
2.2 CHALLENGES
2.2.1 Accuracy
2.2.2 Generality
2.2.3 Taxonomy
2.2.4 Data Validity
2.2.5 Polyphonicity
2.2.6 Pattern Recognition issues
Chapter 3 HISTORY
3.1 ISOLATED TONES
3.2 SOLO PERFORMANCES
3.3 MULTI-INSTRUMENTAL MUSIC
Chapter 4 TAXONOMIES
Chapter 5 DATA SETS
5.1 SEPARATE TONES
5.2 SOLO PERFORMANCES
5.3 AUTHENTIC DUOS – REAL PERFORMANCES
5.4 MULTI-INSTRUMENTAL SOLO MIXES
Chapter 6 FEATURE DESCRIPTORS
6.1 FEATURE TYPES
6.1.1 Temporal Features
6.1.2 Energy Features
6.1.3 Spectral Features
6.1.4 Harmonic Features
6.1.5 Perceptual Features
6.2 FEATURE LIST
Chapter 7 FEATURE WEIGHTING AND SELECTION
7.1 LINEAR DISCRIMINANT ANALYSIS
7.2 GRADUAL DESCRIPTOR ELIMINATION (GDE) USING DISCRIMINANT ANALYSIS
7.2.1 The GDE Algorithm
7.2.2 Example Evaluation
7.3 CORRELATION-BASED FEATURE SELECTION (CFS)
Chapter 8 CLASSIFICATION ALGORITHMS
8.1 NEURAL NETWORKS
8.1.1 Backpropagation (BP)
8.2 K-NEAREST NEIGHBORS (KNN)
8.2.1 Selection of “K”
8.3 CHOSEN CLASSIFICATION METHOD – « LDA+KNN »
Chapter 9 DIFFERENT EVALUATION TECHNIQUES AND THE IMPORTANCE OF CROSS DATABASE EVALUATION
9.1 INTRODUCTION
9.2 THE TESTING SET
9.2.1 The Sounds
9.2.2 Feature Descriptors
9.3 CLASSIFICATION ALGORITHMS
9.3.1 « LDA+KNN »
9.3.2 « BP80 »
9.4 EVALUATION METHODS
9.4.1 Self-Classification evaluation method
9.4.2 Mutual-Classification evaluation method
9.4.3 Minus-1-DB evaluation method
9.5 DISADVANTAGES OF SELF-CLASSIFICATION
9.6 CONCLUSIONS
9.7 MORE EVALUATION ALGORITHMS
9.7.1 Minus-1 Instrument Instance evaluation method
9.7.2 Minus-1-Solo Evaluation method
9.7.3 Leave-One-Out Cross validation Method
Chapter 10 IMPROVING THE CONSISTENCY OF SOUND DATABASES
10.1 ALGORITHMS FOR REMOVING OUTLIERS
10.1.1 Interquantile Range (IQR)
10.1.2 Modified IQR (MIQR)
10.1.3 Self-Classification Outlier removal (SCO)
10.2 CONTAMINATED DATABASE
10.3 EXPERIMENT
10.4 RESULTS
10.5 CONCLUSIONS
Chapter 11 AMIR OF SEPARATE TONES AND THE SIGNIFICANCE OF NON-HARMONIC “NOISE” VS. THE HARMONIC SERIES
11.1 INTRODUCTION
11.2 ORIGINAL SOUND SET
11.3 NOISE REMOVAL
11.4 HARMONIC SOUNDS AND RESIDUALS
11.5 FEATURE DESCRIPTORS
11.6 FEATURE SELECTION
11.7 CLASSIFICATION AND EVALUATION
11.8 RESULTS
11.8.1 Instrument Recognition
11.8.2 Best 10 Feature Descriptors
11.9 CONCLUSIONS
11.10 FUTURE WORK
Chapter 12 AMIR IN SOLOS
12.1 MOTIVATION
12.2 DATA SET
12.3 CLASSIFICATION
12.4 REALTIME SOLO RECOGNITION
12.5 MINUS-1-SOLO RESULTS
12.5.1 Realtime Feature Set
Chapter 13 AMIR IN MULTI-INSTRUMENTAL, POLYPHONIC MUSIC
13.1 AMIR METHODS FOR MIP MUSIC
13.1.1 “naïve” Solo Classifier
13.1.2 Source-Reduction (SR)
13.1.3 Harmonic-Resynthesis (HR)
13.2 EVALUATION RESULTS
13.2.1 Authentic Duo Recordings
13.3 SOLO MIXTURES
13.3.1 Independent Evaluation
13.3.2 Grading
13.3.3 Results
13.4 CONCLUSIONS
Chapter 14 SUMMARY
Chapter 15 FUTURE WORK
15.1 USING COMPOSITION RULES
15.2 FEATURE DESCRIPTORS
15.2.1 Utilizing information in the non-harmonic Residuals
15.2.2 Heuristic descriptors
15.2.3 Modelling Signal Evolution
15.3 PRACTICAL APPLICATIONS
15.3.1 Increasing the number of Instruments
15.3.2 Speed Improvement
15.4 PRECISE EVALUATION
15.5 HUMAN INTEGRATION