Tuesday, June 14, 2016

Howto: Design and Code a Music Visualizer

Just here for code? Look no further. 
What is a Music Visualizer?
  • A generation of visuals based on the music. demo

How to Implement a Music Visualizer? 
  • Processing the audio file and run a Fourier transformation on audio data to get information about the original sound wave (amplitude and frequency)
  • Store this data 
  • Output a visual based on the stored data when music is played
Things to Think About Before Coding
  • How to play the sound?
  • How to implement Fourier transformation?
  • How to interpret information from the Fourier transformation?
  • How to sync visual with music?
  • What does the data in an audio file represent? 

How I Implemented my Music Visualization Software

I wrote my visualization software in c and used the SDL2 sound API to play an audio WAV file.  To compute the Fourier Transformation I used FFTW, a C routine library known for efficiently computing Discrete Fourier (DFT) Transformations.  My visuals (power spectrum from selected frequencies) is outputted to the Linux Terminal.



Using DFT Results to Calculate Sound Frequency

Calculating the frequencies from the DFT is a bit tricky.  The DFT results are from adding a bunch of waves at a specific frequency k. k will be from 0Hz to N-1Hz, where N is the number of samples. Adding the waves acts as a filter  (read up on constructive and deconstructive interference of waves). The DFT returns the amount of frequency k  in the signal (amplitude and phase) which is represented in complex form i.e. real and imaginary values.

Now to calculate the sound frequency from DFT we need to use the sampling rate value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
freq = i * Fs / N;      (1)
where,
freq = frequency in Hertz,
i = index (position of DFT output or can also think of it as representing the number of cycles)
Fs = sampling rate of audio,
N = size of FFT buffer or array.

To explain further, lets say that:

N = 2048          //a buffer that holds 2048 audio data samples
Fs = 44100       //a common sample rate [frames per sec] for audio signals: 44.1 kHz

The spectral bin numbers aka frequency bins using equation (1) from above would be:

    bin:      i      Fs         N            freq
     0  :     0  *  44100 /  2048  =        0.0 Hz
     1  :     1  *  44100 /  2048  =        21.5 Hz
     2  :     2  *  44100 /  2048  =        43 Hz
     3  :     3  *  44100 /  2048  =        64.5 Hz
     4  :     ...
     5  :     ...

   1024 :    1024 * 44100 /  2048  =        22.05 kHz    

Note that the useful index range for frequencies is from (1 to N/2). The 0th bin represents "DC"  and the n/2-th represents the "Nyquist" frequency. Frequencies larger than the Nyquist frequency is redundant data.

Also note that the magnitude is needed to create power spectrum .

Finding Peak Magnitude and Using it to Find the Peak Frequency

For our visual we need to distinguish which frequency (out of N-1 frequencies) has the strongest power (peak magnitude).   So we'll need to find the position of this peak magnitude and find the peak frequency.

Now to find the magnitude we need to use the results from the DFT.  The DFT will give us the real (re) and imaginary (im) values so we can treat these values as a coordinate system and  will use the Pythagorean theorem equation to find the magnitude (mag):

re^2 + im^2 = mag^2;       so,
mag = sqrt(re*re + im*im)

To find the peak frequency of all 2048 frame samples we will need to find the index where the magnitude is the largest. Then substitute that index for "i" in the frequency equation (1).   The pseudo code algorithm would look like:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// copy real input data to complex FFT buffer
for i = 0 to N - 1
    fft[2*i] = data[i]
    fft[2*i+1] = 0
perform in-place complex-to-complex FFT on fft[] buffer

// calculate power spectrum (magnitude) values from fft[]
for i = 0 to N / 2 - 1
    re = fft[2*i]
    im = fft[2*i+1]
    magnitude[i] = sqrt(re*re+im*im)

// find largest peak in power spectrum
max_magnitude = -INF
max_index = -1
for i = 0 to N / 2 - 1
    if magnitude[i] > max_magnitude
        max_magnitude = magnitude[i]
        max_index = i

// convert index of largest peak to frequency
freq = max_index * Fs / N

Instead of only calculating a single peak frequency based on the peak magnitude over N (2048) sample frames, I calculated multiple peak frequencies and peak magnitudes for the following frequency ranges also:

  • 20 to 140:  Bass range
  • 140 to 400:  Mid-Bass range
  • 400 to 2600:  Midrange
  • 2600 to 5200:  Upper Midrange
  • 5200 to Nyquist:  High end

 
The C implementation would look like:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
     double max[5] = {
                1.7E-308,
                1.7E-308,
                1.7E-308,
                1.7E-308,
                1.7E-308
        };

        double re, im;
        double peakmax = 1.7E-308 ;
        int max_index = -1;


        for (int m=0 ; m< F/2; m++){
            re = fftw.out[m][0];
            im = fftw.out[m][1];
        
            fftw.magnitude[m] = sqrt(re*re+im*im);
          
            float freq = m * (float)wavSpec.freq / F;

            if(freq > 19 && freq<= 140){
                if(fftw.magnitude[m] > max[0]){
                    max[0] = fftw.magnitude[m];
                }
            }
            else if(freq > 140 && freq<= 400){
                if(fftw.magnitude[m] > max[1]){
                    max[1] = fftw.magnitude[m];
                }
            }
            else if(freq >400 && freq<= 2600){
                if(fftw.magnitude[m] > max[2]){
                    max[2] = fftw.magnitude[m];
                }
            }
            else if(freq > 2600 && freq<= 5200){
                if(fftw.magnitude[m] > max[3]){
                    max[3] = fftw.magnitude[m];
                }
            }
            else if(freq > 5200 && freq<= audio.SamplesFrequency/2){
                if(fftw.magnitude[m] > max[4]){
                    max[4] = fftw.magnitude[m];
                }
            }
            if(fftw.magnitude[m] > peakmax){
                peakmax = fftw.magnitude[m];
                max_index = m;
            }
        }//end for


To simplify the code, we can store the frequency ranges into an array and just process that array:


   double freq_bin[] = {19.0, 140.0, 400.0, 2600.0, 5200.0, nyquist };

        for(int j = 0; j < frames/2; ++j){

          re =  fftw.out[j][0];
          im =  fftw.out[j][1];
      
          magnitude = sqrt(re*re+im*im);

         double freq = j * (double)wavSpec.freq / frames;

         for (int i = 0; i < BUCKETS; ++i){
           if((freq>freq_bin[i]) && (freq <=freq_bin[i+1])){
             if (magnitude > peakmaxArray[i]){
               peakmaxArray[i] = magnitude;
             }
           }
         }

         if(magnitude > peakmax){ 
              peakmax = magnitude;
              max_index = j;
         }
  
 }




We now have frequency and power information of the original sound wave and can store this data into another array which will later be accessed to create our visual.

This algorithm analyzes at most 2048 sample frames at a time, for this specific example. Run this algorithm "n" times until all waveform data in audio file is processed. I'll leave it up to you to find out the value of "n". Hint: requires knowing the size of audio data and other useful information about the sound in a wav file. So read up on wav audio files.

Lastly,
We can create a visual in the form of a magnitude vs frequency 2d graph or any 3d representation, like a sphere, while the music is playing!

But how do we sync the visuals with the music?

Well that's easy, we can utilize the sound API's, in my case SDL2, features.  SDL uses a callback function to refill the buffer with audio data whenever it's about to be empty.  The buffer has to be filled in order to continue playing the music. So whenever the callback function is called just output the correct visual.

And that's it!
You should now be capable of implementing a music visualizer.

Happy coding


Other peoples work worth mentioning:


Useful readings: