The approach towards identification of musical instruments relies on the set of features, collectively known as "Timbre".
There are two broad types of features of a music signal specifying its properties:
- The temporal features (time domain features), which are simple to extract and have easy physical interpretation, like : the energy of signal, zero crossing rate, maximum amplitude, minimum energy, etc.
- The spectral features (frequency-based features), which are obtained by converting the time based signal into the frequency domain using the Fourier Transform, like: fundamental frequency, frequency components, spectral centroid, etc.
Timbre depends primarily upon the spectral features,although it also depends upon the sound pressure and the temporal characteristics of the sound. Some of the prominent features are described below :
[1]
Amplitude Envelope:-
It refers to the changes in the amplitude of a sound over time, and is an influential property as it affects our perception of timbre.
[2]
Zero-crossing rate:-
The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.This is a good measure of the pitch as well as the noisiness of a signal.This feature has been used heavily in both speech recognition and music information retrieval, being a key feature to classify percussive sounds.

where
s is a signal of length T and
1 is an indicator function.
.
[3]
Pitch
is a perceptual property of sounds that allows their ordering on a frequency-related scale. Pitch may be quantified by frequency; "high" pitch means very rapid oscillation, and "low" pitch corresponds to slower oscillation.
[4]
Spectral Envelope:-
Spectral envelope is a curve in the frequency-amplitude plane, derived from a fourier magnitude spectrum. It wraps tightly and smoothly around the magnitude spectrum ,linking the peaks.
[5]
Spectral Centroid:-
Spectral Centroid is simply the centroid of the spectral envelope, and is one of the most important attributes governing timbre.
[6]
Spectral Rolloff Frequency :-
This is a measure of the amount of the right-skewedness of the Energy spectrum.
The spectral rolloff point is the frequency in the spectrum below which 85% of the total energy is contained. This ratio is fixed by default to 85 %. (ref.
Tzanetakis and Cook, 2002)
[7]
Intensity:-
The sum of the energy in the spectral envelope approximates the instantaneous loudness of the signal. Tracking this over time leads to simple measures of amplitude modulation, which can reveal tremolo (an important feature for brass instruments). Intensity is the basis of many other spectral
features like
Spectral Rolloff frequency among others.
[8]
Spectral Crest:-
It is defined as ratio of peak and rms value of the spectrum. It is usually expressed in dB , thus it's alternatively defined as the level difference between the peak and the RMS value of the waveform. Most ambient noise has a crest factor of around 10 dB while impulsive sounds such as gunshots can have crest factors of over 30 dB.
[9]
Spectral flatness or tonality coefficient:-
( also popularly known as Wiener entropy) is a measure used in digital signal processing to quantify how noise-like a sound is, as opposed to being tone-like. The meaning of tonal in this context is in the sense of the amount of peaks or resonant structure in a power spectrum, as opposed to flat spectrum of a black noise. A high spectral flatness (approaching 1.0 for black noise) indicates that the spectrum has a similar amount of power in all spectral bands — this would sound similar to black noise, and the graph of the spectrum would appear relatively flat and smooth. A low spectral flatness (approaching 0.0 for a pure tone) indicates that the spectral power is concentrated in a relatively small number of bands — this would typically sound like a mixture of sine waves, and the spectrum would appear "spiky".
[10]
The skewness of a spectrum:-
The skewness of a spectrum is the third central moment of this spectrum, divided by the 1.5 power of the second central moment.
[11]
Spectral Slope:-
The spectral slope is — similar to the spectral decrease — a measure of the slope of the spectral shape. It is calculated using a linear approximation of the magnitude spectrum more specifically, a linear regression approach is used. In the presented form, the linear function is modeled from the magnitude spectrum.
[12]
Spectral leakage/Decrease:-
The spectral decrease estimates the steepness of the decrease of the spectral envelope over frequency. The result of the spectral decrease is a value vSD(n) <= 1. Low results indicate the concentration of the spectral energy at bin 0.
[13]
Variance of Spectral Centroid:-
Its the variance of the spectral centroid over our signal.This is a useful feature as it tells the very nature of our spectral spead over the frequency range.
[14]
The mel-frequency cepstrum (MFC) :-
is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The difference between the normal cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound
Step by Step Explanation of MFCC Feature Extraction:-
1. Frame the signal into short frames.
2. For each frame calculate the periodogram estimate of the power spectrum.
3. Apply the mel filterbank to the power spectra, sum the energy in each filter.
4. Take the logarithm of all filterbank energies.
5. Take the DCT of the log filterbank energies.
6. Keep DCT coefficients 2-13, discard the rest.
Intuitive Understanding of Each step in MFCC feature extraction :-
An audio signal is constantly changing, so to simplify things we assume that on short time scales the audio signal doesn't change much (when we say it doesn't change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame.
The next step is to calculate the power spectrum of each frame. This is motivated by the human cochlea (an organ in the ear) which vibrates at different spots depending on the frequency of the incoming sounds. Depending on the location in the cochlea that vibrates (which wobbles small hairs), different nerves fire informing the brain that certain frequencies are present. Our periodogram estimate performs a similar job for us, identifying which frequencies are present in the frame.
The periodogram spectral estimate still contains a lot of information not required for Musical Instrument Recognition. In particular the cochlea can not discern the difference between two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase. For this reason we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions. This is performed by our Mel filterbank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations. We are only interested in roughly how much energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and how wide to make them.
Once we have the filterbank energies, we take the logarithm of them. This is also motivated by human hearing: we don't hear loudness on a linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes our features match more closely what humans actually hear. Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean subtraction, which is a channel normalisation technique.
The final step is to compute the DCT of the log filterbank energies. There are 2 main reasons this is performed. Because our filterbanks are all overlapping, the filterbank energies are quite correlated with each other. The DCT decorrelates the energies which means diagonal covariance matrices can be used to model the features in e.g. a HMM classifier. But notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade the recognition performance, so we get a small improvement by dropping them.