Decoding Shazam: Unraveling Music Recognition Technology
This post delves into Moustapha AGACK’s Devoxx FR 2023 presentation, “Jay-Z, Maths and Signals! How to clone Shazam 🎧,” exploring the technology behind the popular song identification application, Shazam. AGACK shares his journey to understand and replicate Shazam’s functionality, explaining the core concepts of sound, signals, and frequency analysis.
Understanding Shazam’s Core Functionality
Moustapha AGACK begins by captivating the audience with a demonstration of Shazam’s seemingly magical ability to identify songs from brief audio snippets, often recorded in noisy and challenging acoustic environments. He emphasizes the robustness of Shazam’s identification process, noting its ability to function even with background conversations, ambient noise, or variations in recording quality. This remarkable capability sparked Moustapha’s curiosity as a developer, prompting him to embark on a quest to investigate the inner workings of the application.
Moustapha mentions that his exploration started with the seminal paper authored by Avery Wang, a co-founder of Shazam, which meticulously details the design and implementation of the Shazam algorithm. This paper, a cornerstone of music information retrieval, provides deep insights into the signal processing techniques, data structures, and search strategies employed by Shazam. However, Moustapha humorously admits to experiencing initial difficulty in fully grasping the paper’s complex mathematical formalisms and dense signal processing jargon. He acknowledges the steep learning curve associated with the field of digital signal processing, which requires a solid foundation in mathematics, physics, and computer science. Despite the initial challenges, Moustapha emphasizes the importance of visual aids within the paper, such as insightful graphs and illustrative spectrograms, which greatly aided his conceptual understanding and provided valuable intuition.
The Physics of Sound: A Deep Dive
Moustapha explains that sound, at its most fundamental level, is a mechanical wave phenomenon. It originates from the vibration of objects, which disturbs the surrounding air molecules. These molecules collide with their neighbors, transferring the energy of the vibration and causing a chain reaction that propagates the disturbance through the air as a wave. This wave travels through the air at a finite speed (approximately 343 meters per second at room temperature) and eventually reaches our ears, where it is converted into electrical signals that our brains interpret as sound.
These sound waves are typically represented mathematically as sinusoidal signals, also known as sine waves. A sine wave is a smooth, continuous, and periodic curve that oscillates between a maximum and minimum value. Two key properties characterize these signals: frequency and amplitude.
- Frequency is defined as the number of complete cycles of the wave that occur in one second, measured in Hertz (Hz). One Hertz is equivalent to one cycle per second. Frequency is the primary determinant of the perceived pitch of the sound. High-frequency waves correspond to high-pitched sounds (treble), while low-frequency waves correspond to low-pitched sounds (bass). For example, a sound wave oscillating at 440 Hz is perceived as the musical note A above middle C. The higher the frequency, the more rapidly the air molecules are vibrating, and the higher the perceived pitch.
- Amplitude refers to the maximum displacement of the wave from its equilibrium position. It is a measure of the wave’s intensity or strength and directly correlates with the perceived volume or loudness of the sound. A large amplitude corresponds to a loud sound, meaning the air molecules are vibrating with greater force, while a small amplitude corresponds to a quiet sound, indicating gentler vibrations.
Moustapha notes that the human auditory system possesses a limited range of frequency perception, typically spanning from 20 Hz to 20 kHz. This means that humans can generally hear sounds with frequencies as low as 20 cycles per second and as high as 20,000 cycles per second. However, it’s important to note that this range can vary slightly between individuals and tends to decrease with age, particularly at the higher frequency end. Furthermore, Moustapha points out that very high frequencies (above 2000 Hz) can often be perceived as unpleasant or even painful due to the sensitivity of the ear to rapid pressure changes.
Connecting Musical Notes and Frequencies
Moustapha draws a direct and precise relationship between musical notes and specific frequencies, a fundamental concept in music theory and acoustics. He uses the A440 standard as a prime example. The A440 standard designates the A note above middle C (also known as concert pitch) as having a frequency of exactly 440 Hz. This standard is crucial in music, as it provides a universal reference for tuning musical instruments, ensuring that musicians playing together are in harmony.
Moustapha elaborates on the concept of octaves, a fundamental concept in music theory and acoustics. An octave represents a doubling or halving of frequency. When the frequency of a note is doubled, it corresponds to the same note but one octave higher. Conversely, when the frequency is halved, it corresponds to the same note but one octave lower. This logarithmic relationship between pitch and frequency is essential for understanding musical scales, chords, and harmonies.
For instance:
- The A note in the octave below A440 has a frequency of 220 Hz (440 Hz / 2).
- The A note in the octave above A440 has a frequency of 880 Hz (440 Hz * 2).
This consistent doubling or halving of frequency for each octave creates a predictable and harmonious relationship between notes, which is exploited by Shazam’s algorithms to identify musical patterns and structures.
The Complexity of Real-World Sound Signals
Moustapha emphasizes that real-world sound is significantly more complex than the idealized pure sine waves often used for basic explanations. Instead, real-world sound signals are typically composed of a superposition, or sum, of numerous sine waves, each with its own unique frequency, amplitude, and phase. These constituent sine waves interact with each other, through a process called interference, creating complex and intricate waveforms.
Furthermore, real-world sounds often contain harmonics, which are additional frequencies that accompany the fundamental frequency of a sound. The fundamental frequency is the lowest frequency component of a complex sound and is typically perceived as the primary pitch. Harmonics, also known as overtones, are integer multiples of the fundamental frequency. For example, if the fundamental frequency is 440 Hz, the first harmonic will be 880 Hz (2 * 440 Hz), the second harmonic will be 1320 Hz (3 * 440 Hz), and so on.
Moustapha illustrates this complexity with the example of a piano playing the A440 note. While the piano will produce a strong fundamental frequency at 440 Hz, it will simultaneously generate a series of weaker harmonic frequencies. These harmonics are not considered “noise” or “parasites” in the context of music; they are integral to the rich and distinctive sound of the instrument. The specific set of harmonics and their relative amplitudes, or strengths, are what give a piano its characteristic timbre, allowing us to distinguish it from a guitar, a flute, or other instruments playing the same fundamental note.
Moustapha further explains that the physical characteristics of musical instruments, such as the materials from which they are constructed (e.g., wood, metal), their shape and size, the way they produce sound (e.g., strings vibrating, air resonating in a tube), and the presence of resonance chambers, all significantly influence the production and relative intensities of these harmonics. For instance, a violin’s hollow body amplifies certain harmonics, creating its characteristic warm and resonant tone, while a trumpet’s brass construction and flared bell shape emphasize different harmonics, resulting in its bright and piercing sound. This is why a violin and a piano, or a trumpet and a flute, sound so different, even when playing the same fundamental pitch.
He also points out that the human voice is an exceptionally complex sound source. The vocal cords, resonance chambers in the throat and mouth, the shape of the oral cavity, and the position of the tongue and lips all contribute to the unique harmonic content and timbre of each individual’s voice. These intricate interactions make voice recognition and speech analysis challenging tasks, as the acoustic characteristics of speech can vary significantly between speakers and even within the same speaker depending on emotional state and context.
To further emphasize the difference between idealized sine waves and real-world sound, Moustapha contrasts the pure sine wave produced by a tuning fork (an instrument specifically designed to produce a nearly pure tone with minimal harmonics) with the complex waveforms generated by various musical instruments playing the same note. The tuning fork’s waveform is a smooth, regular sine wave, devoid of significant overtones, while the instruments’ waveforms are jagged, irregular, and rich in harmonic content, reflecting the unique timbral characteristics of each instrument.
Harnessing the Power of Fourier Transform
To effectively analyze these complex sound signals and extract the individual frequencies and their amplitudes, Moustapha introduces the Fourier Transform. He acknowledges Joseph Fourier, a renowned 18th-century mathematician and physicist, as the “father of signal theory” for his groundbreaking work in this area. Fourier’s mathematical insights revolutionized signal processing and have found applications in diverse fields far beyond audio analysis, including image compression (e.g., JPEG), telecommunications, medical imaging (e.g., MRI), seismology, and even quantum mechanics.
The Fourier Transform is presented as a powerful mathematical tool that decomposes any complex, time-domain signal into a sum of simpler sine waves, each with its own unique frequency, amplitude, and phase. In essence, it performs a transformation of the signal from the time domain, where the signal is represented as a function of time (i.e., amplitude versus time), to the frequency domain, where the signal is represented as a function of frequency (i.e., amplitude versus frequency). This transformation allows us to see the frequency content of the signal, revealing which frequencies are present and how strong they are.
Moustapha provides a simplified explanation of how the Fourier Transform works conceptually. He first illustrates how it would analyze pure sine waves. If the input signal is a single sine wave, the Fourier Transform will precisely identify the frequency of that sine wave and its amplitude. The output in the frequency domain will be a spike or peak at that specific frequency, with the height of the spike corresponding to the amplitude (strength) of the sine wave.
He then emphasizes that the true power and utility of the Fourier Transform become apparent when analyzing complex signals that are the sum of multiple sine waves. In this case, the Fourier Transform will decompose the complex signal into its individual sine wave components, revealing the presence, amplitude, and phase of each frequency. This is precisely the nature of real-world sound, which, as previously discussed, is a mixture of many frequencies and harmonics. By applying the Fourier Transform to an audio signal, it becomes possible to determine the constituent frequencies and their relative strengths, providing valuable information for music analysis, audio processing, and, crucially, song identification as used by Shazam.