Ringing, Pinging, and Dinging: A Theory of Audio Icons

Illustrated by Fanyi Pan

How can we better understand the audio interfaces that surround us?

an abstract illustration of geometric shapes with different colors

I vividly remember the momentous occasion when I received my first iPhone. I was in 7th grade and it was a gift from my parents for my 12th birthday. Prior to my integration of the smartphone into a cyborg-level relationship of dependency, I used a flip phone, and eventually one of the highly coveted cellphones with a slide out keyboard (these held a degree of cultural currency when I was in middle school). The mobile phones I had before my iPhone gave me sonic feedback with the push of every key; the beeps and vibrations became so tacit I barely noticed them. I took them as a scalable thing-at-hand, unnoticed, uninterrogated. Until I got my iPhone.

My iPhone was noisy, too. Even more noisy than the phones I had before. It made clicking sounds when I typed, emitted that “bubbly” noise when I sent a message, and sent out the familiar tones of an analog phone when I pressed the flat keys to make a call. Then, suddenly, it stopped making sounds. My phone went silent, and it freaked me out. I assumed it was broken, until I consulted Google. There, I found out that there was actually a button on the side of my phone that easily turned the entire OS sound on and off, and I must have accidentally pressed it. I switched my phone back, and the sound I was so reliant on for direction through uncharted digital territory resumed. I never gave much consideration to audio cues before this instance, when I realized those cues actually played a significant role in how I participated and performed within an enclosed digital architecture.

Of course, I didn’t have the vernacular to describe interface metaphors at age 12. Examining the event retrospectively, though, the implementation of audio icons had a profound influence on my empirical assessment of digital materials. They directed my cognitive associations like a non-verbal language of sorts, communicating without enunciating(1), or guiding without directly telling me this or that. One might argue that my concern over the absence of a sonic interface was simply the result of a more complete integration of aural information with my tacit understanding of how things work, that I responded in such a way because I was accustomed to hearing and seeing at the same time. Saying this, though, could be interpreted as be grossly technologically deterministic, and, if further extended, might assume the dangerous behaviorist approach—an opinion that ignores the complex imbrication of human intention and material agency within sociotechnical systems, and questions the existence of free will. A good example of this is B.F. Skinner’s “superstition experiment” with a trained pigeon. The bird pushes a button, not because it makes relational semiotic judgments about what will happen, but because it is hungry, and accustomed to receiving food if it presses the button. It’s a pure stimulus response(2)

an abstract drawing of headphones

In current academic conversations around sound and sonic materialism, or the physical and spatial qualities of sound, it has been widely posited that sound evades representation. Sound doesn’t abide by the signified/signifier relationship that prosaic language does, but follows a more poetic ontological(3) mode. Its expression is tightly coupled to the content itself. In other words, the sound does not dichotomize between what is experienced subjectively as a felt duration and what that experience means, linguistically. My experience with my iPhone, though, has led me to believe otherwise. The digital sounds which have become deeply embedded in our quotidian lives are signifying a specific referent, and depend on that referent for coherence. Referents, for those who aren’t familiar with semiotics, are the objects or phenomena that a sign or a symbol stands for. It is usually arbitrary.

The signal/referent relationship between a digital sound and its signified, is often experienced simultaneously as a shared impression amongst disparate bodies. Semiotics permit us to comprehend a panoply of heterogenous phenomena the same way. What should be an experience of the singular, varying from individual to individual, coagulates into qualia(4), which Lily Chumley and Nicolas Harkness define as an experience of “things in the world,” a fundamental unit of aesthetic perception. Sound is part of that qualic system of valuation. The beauty (or horror, depending on your perspective) of Graphical User Interfaces (GUIs) is that they are semiotic. They exist and will never confront users with an “in-itself” ontology(5). In fact, the very material composition of digital sounds in their discreteness presupposes a simulative nature. These noises are deliberate and adhere to strict models, based on an existing analog reality. In efforts to reproduce the same psychical response, we have to feel sound as an interior temporality, or the way we hear sound resonate inside of us as subjects, despite its exterior constitution.

The history of digital sounds began at Bell Labs in the late 1960s. Computers had been making noise as an epiphenomenon, since the genesis of the ENIAC(6). But it was computer scientist, Max Mathews who wrote the first actual music program, called MUSIC. MUSIC initially played only simple tones, but as Mathews continued to develop the software, he released different iterations that were built off of the previous version’s capacities. MUSIC II introduced four-part polyphony and ran on the largest mainframe computer available. MUSIC III made use of a new “unit generator” subroutine that required minimal input from the composer, making the whole process more accessible. MUSIC IV and MUSIC V further tempered the program, and the technology for computer composition, both hardware and software, began to proliferate across various research centers and universities, including MIT, UC San Diego, and Stanford(7).

Despite these advances, the process of computer composition remained tedious and oftentimes, grueling. From one account:

“A typical music research center of the day was built around a large mainframe computer. These were shared systems, with up to a dozen composers and researchers logged on at a time. (Small operations would share a computer with other departments, and the composers probably had to work in the middle of the night to get adequate machine time.) Creation of a piece was a lot of work. After days of text entry, the composer would ask the machine to compile the sounds onto computer tape or a hard drive. This might take hours. Then the file would be read off the computer tape through a digital to analog converter to recording tape. Only then would the composer hear the fruits of his labor.”(8)

It wasn’t until Mathews devised the hybrid system, allowing a computer to manipulate an analog synthesizer, that digital music became fathomable for a more mainstream public, as the analog synth was the most commonly sought after mechanism for music production at the time. Coincidentally, Mathews discovered that the elemental circuitry of a sequencer was actually digital(9). A sequencer is the apparatus that allows one to record audio using a keyboard and playing it back. But the essential difference between a sequencer and a basic audio recording is that a sequencer records sound as data. “Think of it as a word processor for music,” Phil Huston explains(10). What this means is that the way sequencers operate is necessarily fragmented and discrete. It represents a sound in various informational forms, breaking down a continuous sound wave into a sequence of digits. What has been posited as a signifying or a subjective (autonomous) quality of sonic materialism becomes questionable, because digital sound is built off of different types of acoustical and physical models. 

Acoustical models are based on the sensation of hearing itself, as a person receiving information and integrating it into their subjective experience of the sound’s texture. Chris Chafe aptly calls this method “psychoacoustical.” Physical models focus on the “objective” materiality of sound, timbre, vibrations, frequency, etc., the very literal composition of a sound source. According to Chafe, this modeling technique is based on analyzing the “principles,” a set of patterned dynamics, and building from the granular up, adding new variations that seems significant(11). Though these models and diagrams seem to make no distinction between content and expression, we need to consider that they exist to signify or replicate an analog referent. In this case, the numbers themselves (which follow sets of parameters) are bound to something else. They are made from the particularities of a certain input, and they refer to this input through their very existence.

We have made great advancements since the advent of digital music. But if we deconstruct the fundamental continuities behind the whole process of digital sound creation, the semiotics remain the same. The digital music of Mathews’ time existed for the purpose of replicating the experience of sound through informational representation. We think of OS sounds differently today, and although analog synth composition is still very fashionable, it is not what most people would imagine when the words “digital sound” are coupled together. One of the most ubiquitous tone series of the past three decades, is ambient musician Brian Eno’s composition for the Windows startup interface. This particular melody can be thought of as a “music logo” of sorts; it has come to signify something about the essence of Windows as a brand and communicates a distinct feeling meant to direct us to participate with technologies in a particular way.

an abstract illustration resembling musical notes

Since I began studying skeuomorphism for a thesis project, I became fascinated with auditory icons and “earcons,” those noisy cues which communicate information about an OS object, event, or function. Signifying sounds are mostly limited to practical applications in the fields of UX design and human-computer interaction, as well as integration into vehicular atmospheres and medical equipment, when we consider interfaces in the broader sociotechnical sense. The purpose of audio cues is to direct bodies and command a subject to behave in a certain way. We encounter these sounds every day in moments of interface, both digital and otherwise (i.e. when driving a car or riding the subway). They are commonly applied to enhance user reaction time and accuracy, learnability, and overall performance(12). While these aural messages are valuable tools for engagement with an operating system, they also illuminate much about analog materials themselves, adding a new dimension to “sonic materialism” once  timbre, frequency, and resonance are coupled to a referent, and derive their purpose and meaning from these parent forms.

It is important to distinguish between auditory icons and earcons, as their origins (both psychoacoustical and practical), are not the same. The term “earcon” was coined by computer scientist, Meera Blattner, and is used to describe manufactured sounds(13), new additions to the sonic lexicon, such as simple tones and beeping. Earcons are more of an indirect audio cue used to communicate information about a certain interface object or event. Aesthetically, they play a role in reticulating the interface that is equivalent in significance to the visual designs themselves. Auditory icons, developed in 1986 by Bill Gaver(14), serve the same purpose but instead emulate a familiar sound we already have established associations with.

Most published work on auditory icons is found in the form of patents or review articles, so the field of aural interface interaction is largely undertheorized. Auditory icons were originally used in the field of informatics to supplement the desktop metaphor with all of its skeuomorphic buttons. The idea was to ensure digital technologies were as intuitive as possible, by using discrete features to represent an analog reality(15), much like the application of a sequencer in analog synth composition. In the instance of UI, and digital sound in general, representation is indispensable. Interfaces cannot persist without signification, and they rely on interplay between aural and visual cues to foment universal accessibility(16). Usability/learnability is the most coveted and evaluated attribute of audio icon. As a metric, user testing indicates how, when, and where certain icons may be properly implemented, and how successful they may be in providing direction through reference. The necessity of user testing itself proves that this genre of sound mandates reception by a listening subject, their own individual synthesis, and making their own sense out of a material reality.

an abstract illustration of a cello

Gaver proposed two metrics for evaluating auditory icons. One surveys the icon’s “proximal stimulus qualities.” These are the material attributes of the sound, like frequency, duration, and intensity. The other judges “distal stimulus qualities” which are related to how accurately the icon’s materials represent the intended referent(17). Mapping is also essential for the production of auditory icons. And as a practice, it is exactly what it sounds like. Think of that Borges quote from “On Exactitude in Science”: “…In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, those Unconscionable Maps no longer satisfied, and the Cartographers Guilds struck a Map of the Empire whose size was that of the Empire, and which coincided point for point with it.”(18)

The idea is for mapping to appear natural, the sound an icon makes when you click on it should seem intuitive and reflect the physical qualities of the digital materials at hand. The more believable the mapping, the more closely coupled the sound to the analog referent. For example, a “heavier” object will emit a lower tone, a lighter object corresponds to higher frequency. Mapping accounts for a more total relationship between two qualitatively unrelated substances (and treats sound more as a substance than an event), it “maps” our existing concept structures onto a digital interface through simulated likeness. Gibson’s theory of affordances, which brought the ecological attitude to perception, espouses the idea that we cannot experience the world directly and thus grasp phenomena through signifying chains of representation(19). Mapping, which are taken from visual studies, ensures that UI sounds are scaled to the size of the empire; they are composed according to the qualities of a referent and function solely as a synthetic surrogate. 

Gaver offers three different categories of auditory icons. Symbolic icons are the most abstract of the triad. They have a completely arbitrary relationship with their intended referent and hold only by repeated and routinized association with such referent. Metaphorical icons are not entirely arbitrary, but they simply gesture. They provide direction to the referent without offering a description of it. And they usually coexist with the referent in our actual sonic environment. Iconic/nomic icons physically and materially resemble the source of the referent sound, and they are cited as the strongest mappings, meaning they derive positive results from user testing. The choice of which audio icon to use in a specific instance is entirely dependent on the way the digital environment is arranged, and how closely that environment can be configured to activate analog associations. 

an abstract illustration on a black background

The goal in applying auditory cues to computing is to collapse the boundary between sight and sound, synthesizing a more total UI experience. Auditory icons function as an aural syntagma. Their purpose is to direct. Their materials do not operate independently of reception and cannot be excised from the signified/signifier algorithm. UI sonic environments are necessarily signifying. And material intensive quantities, or the attributes of the sound that can be represented numerically, serve a semiotic purpose. From Cabral and Remijn’s article, “Physical features of the sound(s) used in auditory icons, such as pitch, reverberation, volume, and ramping, can be manipulated to convey, for example, the perceived location, distance, and size of the referent.”(20) 

The sounds of our digital lives, and particularly the aforementioned processes of aural discretization, complicate sonic materialism as a content-blind approach. The mythology of sound as pure expression, without any underlying meaning, cannot hold for digital noises. Within the confines of a UI, and even in the case of ostensibly pure information (the sequencer’s role within a synthesizer), content and intention are essential. Conversion from the continuous to the discrete does not ignore the continuous, just because the formal qualities of the information have changed. There is still a coherent sense across different means of expression, and a residual meaning structure. Numbers, in this instance, are still qualic, because the prior phenomena is communicated through sequences of numbers. My reliance on the aural reassurance of my iPhone’s functionality came out of a need for sound to make sense of my digital surroundings, to align my experience with some recognizable mode of participation, ultimately enabled by the ways in which sound is configured materially.

Footnotes

1. I do not mean “enunciation” in the Deleuzean sense following assemblage theory, but rather the physical act of enunciating, as the voice does

2. B.F. Skinner, “Superstition in the Pigeon,” Journal of Experimental Psychology 38, no.1, (June, 1947): 168-172

3. Ontology is the structure of being, how entities exist phenomenally

4. Lily Chumley and Nicolas Harkness, “Introduction: QUALIA,” Anthropological Theory 13, no. 1/2, (2013): 3-11

5. “In-itself” meaning ontologically coherent without subjective synthesis or reception by a listener. GUI’s are dependent on our perception to make sense.

6. The ballistics missile calculator used during the first World War, set a precedent for the gender dynamics behind computation (as well as the master/slave dialectic).

7. “Computer Music (So Far)—Short History of Computer Music,” University of California Santa Cruz, accessed March 11, 2020

8. Ibid

9. Ibid

10. Phil Huston, “What does that sequencer actually do?,” accessed March 12, 2020

11. Chris Chafe, “A Short History of Digital Sound Synthesis by Composers in the USA,” (PhD diss., CCRMA Stanford University, 2009), 1-9

12. John Paulo Cabral and Gerard Bastiaan Remijn, “Auditory icons: Design and Physical Charactertics,” Applied Ergonomics 78, no.1, (Winter, 2019): 224-239.

13. Steve Draper, “HCI Lecture 12: Sound and User Interfaces,” accessed January 30, 2020

14. Cabral and Remijn, “Auditory icons,” 224.

15. Ibid., 224.

16. A nice fantasy, but part of the analytic tradition is bringing culture and social life back into the picture, so that nothing exists non-discursively.

17. Ibid., 225.

18. Jorge Luis Borges, Collected Fictions, trans. Andrew Hurley (New York: Penguin Random House, 1999) 213.

19. Cabral and Remijn, “Auditory Icons,” 225.

20. Ibid., 226

Cecilia McLaren

Cecilia McLaren finished her undergraduate career at NYU Steinhardt’s department of Media, Culture, and Communication this past year. She is writing and thinking about post-structuralism, interface semiotics, and affect theory, specifically the ways in which mediated environments prompt us to search for new relational modes, both interpersonal and intrapersonal.