The Usefulness of Dynamics 

Get Complete Project Material File(s) Now! »

The Prototypical algorithm

We describe here the timbre similarity algorithm on which we will base our experimentations. As can be seen in Figure 3.2, it follows the same paradigm as the very many contributions on HLD extraction systems described in Chapter 2.5, namely modelling polyphonic timbre as the long-term distribution of local spectral features. However, the rationale behind our experiments is to explicitly model timbre, instead of using it implicitly to extract higher-level, correlated concepts. Therefore, we do not accumulate all features of the different sets of songs to discriminate in a common classifier, but define a metric to compare the distributions of individual songs to one another.

MIR Design Patterns and Heuristics

The algorithm described above is designed as an attempt to explicitly model polyphonic timbre similarity, in order to test the underlying assumptions of a large class of automatic music description systems. Only a few previous attempts at building audio similarity functions can be found in the literature. Foote (1997) presents a systemthat also uses cepstral coefficients as a front-end, but rather uses a supervised algorithm (tree-based vector quantizer) that learns the most distinctive dimensions in a given corpus. Adding one song to this corpus requires to redo the learning of the tree, which is expensive. On the contrary, our system is completely scalable, since it models each song separately.
Welsh et al. (1999) proposes a query by similarity system that is also able to match songs according to their timbre. He uses a large set of features (1248 floating-point per song) which are compared with the euclidean distance. However, his system doesnt address timbre similarity explicitly: his features model the pitch/tonal content of a song (“returning songs in the same key”), the noise level (“whether it is pure classical music or noisy, saturated hard rock”) and the rhythm. The timbral similarity observed in some results by the author (“a pop, male vocal song produces results where every song in the top 10 is a male vocal with guitar and drum accompaniment”) appears therefore as a side-effect of the features above, notably those describing the tonal content of the pieces. Our system is both more restrictive, and more precise: notably, the features that we use are meant to be independent of the pitch. We do not try to model music similarity at large, but only timbral similarity, which is is only one similarity relationship among many others (rhythm, melody, style, structure, etc.), some of which addressed by Welsh. We have argued in Aucouturier and Pachet (2002b) that the interestingness of a music retrieval system probably lies in the confrontation between several such similarity relationships.
Finally, Logan and Salomon (2001) proposes a similar approach to ours, which also uses Cepstrum Coefficients, only with a different modelling and a more complex matching algorithm. It is only since this contribution and our original formulation of the above mentioned algorithm in Aucouturier and Pachet (2002b) that “timbre similarity” has seen a growing interest in the Music Information Retrieval community (Baumann (2003); Baumann and Pohle (2003); Berenzweig et al. (2003); Herre et al. (2003); Kulesh et al. (2003); Pampalk et al. (2003); Pampalk (2004); Flexer et al. (2005); Stenzel and Kamps (2005); Vignoli and Pauws (2005)).

Pattern: Feature Composition

Modifying a standard feature algorithm chain by inserting an additional mathematical operation. It is very common that an author should create a local variant of a standard feature by adding:
• pre-processing such as low-pass filtering the signal, or normalizing its energy (e.g. MFCC(Normalize(x)))
• post-processing such as taking the derivative (so called delta-coefficients) or rescaling.
• intra-composition i.e. changing or adding an operation block in the middle of the chain.

Pattern: Cross-Fertilisation

Borrowing a technique, often a feature, which was developed and proved successful in pattern recognition for other domains than music.
Music audio pattern recognition developed out of a large corpus of work done in the context of speech signals, for which a scientific and commercial interest was recognized earlier, notably on the impulsion of the 1971 DARPA call on Speech Understanding Research (SUR) (Kurzweil, 1996). The research effort of the five-year SUR project, which targeted a non real-time recognition system with 90% sentence accuracy for continuous-speech sentences using thousand word vocabularies, notably lead to such great advances as dynamic programming (Itakura, 1975), and Markov modelling (Itakura, 1976). Cepstrum (see 3.2.3) has been the dominant feature for speech recognition, notably since the classic formulation of Mel52 Frequency Cepstrum Coefficients (MFCCs) by Rabiner (1989). However, it was originally invented for characterizing the seismic echoes resulting from earthquakes and bomb explosions (Tukey and Healy, 1963). Their success in the speech community logically lead to their application to music signals, which can be traced back to the best of our knowledge to Foote (1997), inspired by previous work on Speaker Recognition (Foote and Silverman, 1994). Logan and Salomon (2001), also coming from Speech Recognition, give a detailed account of how well the assumptions of MFCCs hold for musical signals. Since then, it is a common pattern that features and techniques for timbre and music modelling should be borrowed from other domains, not only speech recognition, but also sismic data or image processing.

READ  Parametrical and nonlinear magnetoelastic effects in magnetically ordered materials

Table of contents :

1 Introduction 
I Epistemology 
2 Dimensions of Timbre 
2.1 Everything but pitch and loudness
2.2 Psychophysical studies
2.3 Automatic recognition of monophonic timbres
2.4 Towards polyphonic textures
2.4.1 The demand of Electronic Music Distribution
2.4.2 The lack of perceptive models
2.5 Implicit modelling
2.6 Thesis Overview: Ten Experiments
3 Dimensions of Timbre Models 
3.1 The Prototypical algorithm
3.1.1 Feature Extraction
3.1.2 Feature Distribution Modelling
3.1.3 Distance Measure
3.2 MIR Design Patterns and Heuristics
3.2.1 Pattern: Tuning feature parameters
3.2.2 Pattern: Tuning model parameters
3.2.3 Pattern: Feature Equivalence
3.2.4 Pattern: Model Equivalence
3.2.5 Pattern: Feature Composition
3.2.6 Pattern: Cross-Fertilisation
3.2.7 Pattern: Modelling Dynamics
3.2.8 Pattern: Higher-level knowledge
3.3 Conclusion
II Experiments 
4 Experiment 1: The Glass Ceiling 
4.1 Experiment
4.2 Method
4.2.1 Explicit modelling
4.2.2 Ground Truth
4.2.3 Evaluation Metric
4.3 Tools
4.3.1 Architecture
4.3.2 Implementations
4.3.3 Algorithms
4.4 Results
4.4.1 Best Results
4.4.2 Significance
4.4.3 Dynamics don’t improve
4.4.4 “Everything performs the same”
4.4.5 Existence of a glass ceiling
4.4.6 False Positives are very bad matches
4.4.7 Existence of hubs
5 Experiment 2: The Usefulness of Dynamics 
5.1 The paradox of Dynamics
5.2 Hypothesis
5.3 Method
5.3.1 Databases
5.3.2 Algorithms
5.3.3 Evaluation Procedure
5.4 Results
6 Experiments 3-8: Understanding Hubs
6.1 Definition
6.2 Why this may be an important problem
6.3 Measures of hubness
6.3.1 Number of occurrences
6.3.2 Neighbor angle
6.3.3 Correlation between measures
6.4 Power-law Distribution
6.5 Experiment 3: Features or Model ?
6.5.1 Hypothesis
6.5.2 Experiment
6.5.3 Results
6.6 Experiment 4: Influence of modelling
6.6.1 Hypothesis
6.6.2 Experiment
6.6.3 Results
6.7 Experiment 5: Intrinsic or extrinsic to songs ?
6.7.1 Hypothesis
6.7.2 Experiment
6.7.3 Results
6.8 Experiment 6: The seductive, but probably wrong, hypothesis of equivalence classes
6.8.1 Hypothesis
6.8.2 Experiment
6.8.3 Results
6.9 Experiment 7: On homogeneity
6.9.1 Hypothesis
6.9.2 Experiment
6.9.3 Results
6.10 Experiment 8: Are hubs a structural property of the algorithms ?
6.10.1 Hypothesis
6.10.2 Experiment
6.10.3 Results
7 Experiments 9 & 10: Grounding 
7.1 Experiment 9: Inferring high-level descriptions with timbre similarity
7.1.1 Material
7.1.2 Methods
7.1.3 Results
7.2 Experiment 10: The use of context
7.2.1 Method
7.2.2 Results
7.2.3 Exploiting correlations with decision trees
7.3 An operational model for grounding high-level descriptions
7.3.1 Algorithm
7.3.2 Preliminary results
8 Conclusion: Toward Cognitive Models 
III Synth`ese en franc¸ais – Digest in French 
IV Appendices 
A Composition of the test database 
B Experiment 1 – Details 
B.1 Tuning feature and model parameters (patterns 3.2.1 and 3.2.2)
B.1.1 influence of SR
B.1.2 influence of DSR
B.1.3 influence of N,M
B.1.4 influence of Windows Size
B.2 Alternative distance measure (pattern 3.2.4)
B.3 Feature Composition (pattern 3.2.5)
B.3.1 Processing commonly used in Speech Recognition
B.3.2 Spectral Contrast
B.4 Feature Equivalence (pattern 3.2.3)
B.5 Modelling dynamics (pattern 3.2.7)
B.5.1 Delta and Acceleration Coefficients
B.5.2 Texture windows
B.5.3 Dynamic modeling with hidden Markov models
B.6 Building in knowledge about note structure (pattern 3.2.8)
B.6.1 Removing noisy frames
B.6.2 Note Segmentation
B.6.3 Comparison of the 2 approaches
B.7 Model Equivalence (pattern 3.2.4)
B.7.1 Pampalk’s Spectrum histograms
B.7.2 MFCCs Histograms
B.8 Borrowing from Image Texture Analysis (pattern 3.2.6)
B.8.1 Image Texture Features
B.8.2 Application to Audio
B.8.3 Vector Quantization
B.8.4 Conclusions of texture analysis
C Comparison of implementation performance 
C.1 Feature Extraction
C.2 Distribution Modelling
D Nearest Neighbor Algorithm 
D.1 Tradeoff between Precision and CPU-time
D.2 Algorithm formulation
D.2.1 Definitions and Assumptions
D.2.2 Efficiency
D.2.3 Implementation
D.3 Application to Timbre Similarity
D.3.1 The Precision-Cputime Tradeoff
D.3.2 Formulation of the Problem
D.3.3 Practical Implementation
D.3.4 Results
E Multiscale segmentation 
F Measures of Hubs 
F.1 Rank-based metrics
F.2 Distance-based metrics
F.3 Correlation between measures
F.3.1 Number of N-occurrences
F.3.2 Number of N-occurrences and Neighbor difference and angle
F.3.3 Number of N-occurrences and TI violations
G Pearson’s Â2-test of independence 

GET THE COMPLETE PROJECT

Related Posts