Birdsong audio analysis - finding how well two clips match_问答_开发者

I have ~100 wav au开发者_如何学Pythondio files at sample rate of 48000 of birds of the same species I'd like to measure the similarity between. I'm starting with wave files, but I know (very slightly) more about working with images, so I assume my analysis will be on the spectrogram images. I have several sample of some birds from different days.

Here are some example of the data, along with (apologies for unlabeled axes; x is sample, y is linear frequency times something like 10,000 Hz):

Birdsong audio analysis - finding how well two clips match

These birdsongs apparently occur in "words", distinct segments of song which is probably the level at which I ought to be comparing; both differences between similar words and the frequency and order of various words.

Birdsong audio analysis - finding how well two clips match

I want to try to take out cicada noise - cicadas chirp with pretty consistent frequency, and tend to phase-match, so this shouldn't be too hard.

Birdsong audio analysis - finding how well two clips match

It seems like some thresholding might be useful.

I'm told that most of the existing literature uses manual classification based on song characteristics, like Pandora Music Genome Project. I want to be like Echo Nest; using automatic classification. Update: A lot of people do study this.

My question is what tools should I use for this analysis? I need to:

Filter/threshold out general noise and keep the music
Filter out specific noises like of cicadas
Split and classify phrases, syllables, and/or notes in birdsongs
Create measures of difference/similarity between parts; something which will pick up differences between birds, minimizing differences between different calls of the same bird

My weapon of choice is numpy/scipy, but might something like openCV might be useful here?

Edit: updated my terminology and reworded approach after some research and Steve's helpful answer.

Had to make this an answer as it's simply too long for a comment.

I'm basically working in this field right now so I feel I have some knowledge. Obviously from my standpoint I'd recommend working with audio rather than images. I also recommend using MFCCs as your feature extraction (which you can think of as coefficients which summarise/characterise specific sub-bands of audio frequency [because they are]).

GMMs are the go.

To perform this task you must have some (preferably a lot) of labelled/known data, otherwise there is no basis for the machine learning to take place.

A technicality which you may find useful:

'Then, during testing, you submit a query MFCC vector to the GMM, and it will tell you which species it thinks it is.'

More accurately, you submit a query to each GMM (which if you're using them correctly, each gives you a likelihood score [probability] of that particular feature vector being emitted by that probability distribution). Then you compare all the likelihood scores you receive from all the GMMs and classify based on the highest you receive.

UBMs

Rather than "filtering out" noise, you can simply model all background noise/channel distortion with a UBM (Universal Background Model). This model consists of a GMM trained using all the training data available to you (that is, all the training data you used for each class). You can use this to get a 'likelihood ratio' (Pr[x would be emitted by specific model] / Pr[x would be emitted by background model (UBM)]) to help remove any biasing that can be explained by the background model itself.

Interesting question, but quite broad. I do recommend looking at some existing literature on automatic bird song identification. (Yup, there are a bunch of people working on it.)

This paper (edit: sorry, dead link, but this chapter by Dufour et al. 2014 might be even clearer) uses a basic two-stage pattern recognition method that I would recommend trying first: feature extraction (the paper uses MFCCs), then classification (the paper uses a GMM). For each frame in the input signal, you get a vector of MFCCs (between 10 to 30). These MFCC vectors are used to train a GMM (or SVM) along with the corresponding bird species labels. Then, during testing, you submit a query MFCC vector to the GMM, and it will tell you which species it thinks it is.

Although some have applied image processing techniques to audio classification/fingerprinting problems (e.g., this paper by Google Research), I hesitate to recommend these techniques for your problem or ones like it because of the annoying temporal variations.

"What tools should I use for this analysis?" Among many others:

feature extraction: MFCCs, onset detection
classification: GMM, SVM
Google

Sorry for the incomplete answer, but it's a broad question, and there is more to this problem than can be answered here briefly.

You are apparently already performing STFT or something similar to construct those images, so I suggest constructing useful summaries of these mixed time/frequency structures. I remember a system built for a slightly different purpose which was able to make good use of audio waveform data by breaking it into a small number (< 30) of bins by time and amplitude and simply counting the number of samples which fell in each bin. You might be able to do something similar, either in the time/amplitude domain or the time/frequency domain.

Depending on the way you want to define your application you can either need a supervised or unsupervised approach. In the first case you will need some annotation process in order to provide the training phase with a set of mappings from samples (audio files) to classes (bird IDs or whatever your class is). In the case of unsupervised approach, you need to cluster your data so that similar sounds are mapped to the same cluster.

You could try my library: pyAudioAnalysis which provides high-level wrappers for both sound classification and sound clustering.