Issue 2
What You See May Not Be There: The State of the Subject Amid Pervasive Computer Vision
P hotography has since its dawn maintained a strong degree of truth-authority over other representational mediums. As philosopher Vilem Flusser1 puts it, the contents of an image appear “to be on the same level of reality as their significance,” and photographs are more readily accepted as an objective window. By their very existence, pictures claim to offer a trustworthy record of a subject being somewhere, doing something. The tenacity of this ‘truthfulness’ illusion—a feature of the medium since its dawn—is still on display, throughout our transition into the digital age. But the ways in which it manifests are changing.
Today’s computer vision algorithms are complicating the nature of fidelity present in the still and moving image. This is well demonstrated through a line of computer vision research referred to as predictive vision, interested in part in “predicting plausible futures of static images.” Carl Vondrick and his colleagues, leading the front in this discipline, created an AI software program called Tiny Video, which analyzes a still-frame of video to predict what might happen next in the scene and turns its predictions into algorithmically-generated (e.g., not ‘real’) video content. Trained on over 5,000 hours of footage downloaded from Flickr of newborn babies, people playing golf, and other moving images of relatively banal subject matter, the program generates short GIFs of pixel blobs that move like wiggling babies or walking figures amid a grassy backdrop. The machine-hallucinated animations don’t yet look exactly realistic—in fact they are slightly disturbing to look at—but as the researchers state, “the motions are fairly reasonable” for these types of human gestures. Tiny Video points to the trajectory of computer vision advancements, programs increasingly able to not only process an image or video but generate a realistic copy of its subject’s looks and movements—essentially a 2D or 3D avatar—based on speculation of their actions deduced from patchy data.
The discordance between Flusser’s observation and Tiny Video’s outputs captures the current state of the digital image. Our visual culture assumes the veracity of the photos and videos circulated within it. Meanwhile, the image today undergoes increasing layers of abstraction via computer vision algorithmic operations, their yields based more on conjecture than fact. While Tiny Video is an academic research experiment, when technologies of this sort reach a certain threshold of believability they immediately get adopted by the market and their use cases proliferate. The implications of this evolution of both technical prowess and market adoption rest predominantly with the subject—all of us featured in photos shared online or captured via any number of pervasive body or security cameras ‘in the wild’—whose physical and behavioral identities get co-opted and interpreted by the algorithmic methods.
These increased layers of abstraction create a distorted portrait of the subject that, because of its computational genesis, gets treated as ontological fact. Algorithms have a good chance of getting the subject wrong outright by mismatching a face to an identity. And they will invariably be introducing their own guesses and biases in drawing more subjective inferences about one’s behavior or character or future actions. The impact on the subject is inevitable—the job they get or don’t get, their insurance rates, or whether they are implicated in a crime will depend more and more on data collected via computer vision software and hardware. Not only are most of these practices exempt from privacy laws but we don’t currently have a standard mode of accountability for the conclusions they draw.
Photography has long had the authority to appropriate a subject’s identity and affect their lives for years to come. In the case of a notorious 1957 photograph of desegregation at Little Rock Central High School, Elizabeth Eckford, the African American student, and Hazel Bryan, the white antagonist with the hateful expression behind her, had their lives changed by this image, taken by Will Counts, for the half century since. Theorist Roland Barthes described photography as a lamination procedure, in which the subject and their 2D likeness get permanently attached to one another.2 Today, this lamination continues but with more persistent and chaotic conditions. Anyone who has ever Google Image-searched themselves knows how surreptitiously digital images can travel across the internet to unexpected websites and servers without their knowledge or consent.
A picture’s susceptibility to being tampered with is also not a condition unique to AI-assisted automation. The analog photograph has a long history of being edited to leave things out, and of course frames of a video can easily be dropped to tell a different story. From the notorious 1922 film Nanook of the North—an early pioneer of the documentary form—to recent police body cam videos, that which is accepted as documentation may very well have been staged in the first place. However, computer vision goes a step, or multiple steps, further in the scope of its augmentations.
From facial recognition, to emotion detection, to predictive vision, the original capture is just a jumping-off point from which the subject’s identity gets appropriated, operated upon, and generated anew. Very often these algorithms are proprietary, in the service of industries that have a special interest in creating a colored-in, fully fleshed-out picture of the average citizen for the purposes of targeted advertising or law enforcement.
While in many cases, computer vision applications operate in isolation from one another, it’s conceivable that they will be, if they aren’t already, combined in any number of ways that increase their guesswork and exacerbate misrepresentations of a subject. It’s as if your likeness had a clandestine Second Life, getting rendered and puppetted around via dubious corporate transactions.
Facial Recognition and Identity Verification
The first layer of abstraction involves matching a subject in an image to an identity. Facial recognition software is becoming almost as ubiquitous as cameras, which means a subject will be recognizable to systems’ proprietors nearly any time they leave their house, or even in their own bedrooms with their own devices. The process is also fraught with complex, overconfident mathematical models and biased or incomplete datasets.
Originating in the the 1960’s, facial recognition programs have evolved from passive scanners of front-facing headshots to involved algorithms that perform computational gymnastics. From attempting to match two photos taken from wildly different angles to literally filling in the missing pixels of occluded parts of a subject’s face, today’s operations call for significant guesswork on the part of the computer.
A potential method for comprehensive facial recognition is the creation of a 3D model of a person’s bust to serve as a reference for identity verification. In 2014, Facebook revealed it employed such a procedure with its DeepFace algorithm. Though the company hasn’t disclosed use of this specific approach since, it recently acknowledged that it knows when you are present in a photo whether you’re tagged or not. But what about software that has access to far less photographic data, or even none at all? A recent study and online demo released by the University of Nottingham introduces an algorithm that extrapolates a 3D model from just a single 2D photograph. And as we see with Heather Dewey-Hagborg’s speculative project Stranger Visions, such a rendering can be visioned from just a hair or other biomatter from which DNA can be extracted, requiring no original photo whatsoever. Regardless of the mode by which it was conceived, a 3D model of one’s head is often the result of hypothetical geometric information filled in to piece together a photorealistic avatar. While researchers are at work devising ways to identify subjects with limited information, industry is rife with efforts to capture more facial data.
As our daily lives increasingly take place in the presence of cameras, image tagging opportunities are plentiful. More and more, facial recognition capabilities are seamlessly enmeshed with security and body camera systems. Movidius, an artificial intelligence company specializing in machine vision, recently had its technology integrated into Hikvision security cameras to make video analytics possible in real time locally on the device. Suppliers of police body cameras promise live facial recognition “to tell almost immediately if someone has an outstanding warrant against them.” And, it’s only a matter of time before facial recognition systems are part and parcel to every retail experience. Walmart was recently granted a patent to integrate a custom facial tracking system at its registers. And Amazon opened its first Amazon Go store this January, where a tapestry of in-store cameras and real-time facial recognition technology takes the place of the checkout experience altogether. The opportunities for these systems to fail us escalate in sync with their deployment.
Far from a binary, objective practice, identity verification via facial recognition involves significant room for error. This story from 2016 about a man from Denver twice wrongly implicated because of mismatched security footage reveals the extent to which systems that involve computer modeling can elicit false positives. Identifying suspects in photos is a practice known to be extremely difficult for humans, and not necessarily easier for computers.
The data sets used to both train an algorithm and serve as a reference for an identity match are often incomplete and include disproportionate representations of a given population. As a reflection of the uneven makeup of their training sets, algorithms are known to misidentify African Americans at a higher rate than Caucasians. The FBI’s facial recognition technology database, which was compiled via questionable if not illegal means to include nearly half of unwitting American adults, was found to follow suit with this trend of misidentification. But even when they’re not outright amiss in naming the correct subject, facial recognition has nuanced and insidious overtones.
When we go online, we expect that our browsing behavior will be recorded, archived, and shared. When we travel with our phones we generally expect our location data to be tracked. But we don’t expect our face to be identified by countless inconspicuous imaging devices scattered throughout public spaces. And even if sophisticated facial recognition apparatuses were to correctly identify us at every capture point, they still amass an incomplete, erroneous portrait of the individual similar to how analysis of one’s browsing history inevitably does.
Emotion Detection and Analytics
Not only will these cameras recognize where and when a person is present but will log their emotions and supposed intentions at each capture as well. Emotion detection involves dramatic inferences on a photo with an aim to unveil the inner life of a subject.
Like many other computer vision technologies, emotion detection relies on a model that is reductive, yet celebrated and normalized. The accepted standard for this software is the Facial Action Coding system, devised by psychologist Paul Ekman in 1965, used to read human emotion via facial anatomy. It claims the entire human emotional range can be schematized into seven discrete expressions—happiness, sadness, surprise, fear, anger, disgust, contempt—that can be unveiled via examination of frontal snapshots. Ekman was commissioned by the US Department of Defense to develop this research and went on to apply it as a means for detecting deception, an application widely embraced by the law enforcement community.3 Despite its dark history and extreme simplification of complicated human qualities, the Facial Action Coding system forms the basis for most services out there.
Unlike facial identification, there is no definitive right or wrong when it comes to emotion detection. As Alessandro Vinciarelli, a researcher in the related field of social signal processing, explains, “nonverbal behavior is the physical, machine-detectable evidence of social and psychological phenomena that cannot be observed directly” but instead only inferred. Vinciarelli makes his inferences from physical measurements of people’s faces on camera and via other sensors. Transposing anatomical features and subjective emotional or personality traits harkens back to the centuries-old practice of physiognomy, in which measurement instruments would be used to assess one’s character. Physiognomy was long ago debunked as pseudoscience and criticised for its propagation of institutional racism. Much before the aid of computational devices, the practice involved bridging a significant logic gap between the physical and psychological. The fact that a computer can scan photographic details at the pixel level and perform statistical operations does not mean it can shrink the logic gap inherent in this model. However, the algorithms’ assumptions are upheld as credible data and the scope of their applications are now distending.
With ubiquitous image capture comes innumerable opportunities for emotion analytics to be put to work on snapshots of citizens. The facial recognition system Walmart intends to implement will scan shoppers’ faces for signs of unhappiness. Startup Faception is a company targeting law enforcement agencies, among other industries, claiming its personality-reading computer vision product can be used to “apprehend potential terrorists or criminals before they have the opportunity to do harm.” Unlike a fingerprint used to match evidence to an identity, this software looks to one’s imagined personality as revealed by anatomical measurements in a photo to predict someone’s likelihood of being terrorist. A similar claim was made by researchers at Shanghai Jiao Tong University upon release of their study titled Automated Inference on Criminality using Face Images. The conjecture embraced by Faception and these researchers, and the inevitable discrimination that is to result, is cause for great concern, especially because the very arbitrary nature of the data provides no basis for recourse.
Over time, it’s likely that any one of the companies performing these analyses could begin to formulate an individualized psychometric signature based solely on a subject’s expressions in a set of photos. The burgeoning trend toward emotion and personality analytics can be seen in adjacent fields such as speech recognition and natural language processing with tools like IBM Watson’s Personality Insights API. The various ways in which a subject’s supposed emotional data could be traded or combined are many. And because emotion analysis is based on “soft” data, such profiles will become increasingly speculative with each added data point, all the while being exempt from judgment as to their accuracy or equity.
Predictive Vision
Finally, as we see with the nascent stages of Tiny Video, any snap of a subject can eventually be used to speculate about their future actions. An algorithm capable of generating video scenes of things that never happened invites hackers and people interested in spreading misinformation at the subject’s expense.
Carl Vondrick of Tiny Video also worked on training algorithms to anticipate human gestures. His system watched hundreds of hours of television like The Office and looked for innocuous, easy-to-spot gestures like a handshake, hug, or high five. It eventually was given a single frame from which to predict which gesture was to occur within the next five seconds, doing so with an accuracy rate at about half that of a human. In isolation these computer-generated forecasts indicate increasing abilities for AI systems to read and understand human behavior. They also come to bear at a time when predictive vision could very well become the mantra of police forces and data scientists already captivated by the promise of prediction. Similar algorithms to Vondrick’s could be integrated into police body cameras to analyze gestures in real time and claim to prevent crimes before they happen. The room for error in these predictions is vast due to the speculative nature on which they are founded—the future is difficult to prove.
A New Model
Despite where they stand on the spectrum of abstraction, computer vision algorithms, because of the way they are wielded, often work against the subject in the frame. Facial recognition and identity verification programs, while operating with a binary relationship to accuracy—a subject is either a certain person or someone else—often get the wrong answer, and the stakes can result in prison time for a misidentified subject. And even for subjects who won’t encounter such extreme circumstances, the notion that one’s face is captured and identified perpetually while moving through space is a new, uncomfortable normal.
Meanwhile, algorithms that make more subtle inferences like extracting emotion from a photo or anticipating what a subject will do next are exempt from judgment over accuracy because their models and inputs are arbitrary. As Carl Vondrick describes Tiny Video’s results—they are plausible.
Computer vision products and services are part of a market-driven system, what the Economist calls the Facial-Industrial Complex. The computer vision market is expected to grow almost 650% between 2015 and 2022. The outputs of these omnipresent systems—their presumptions about a subject’s most intimate proclivities—are the means of their transactions. An industry hungry to grow has no incentive to pause and consider when data stops being synonymous with truth.
Since computer vision’s dawn over a half century ago, we’ve seen a significant shift in the function of a computer—from a machine once employed as a superfast calculator to now an interpreter of nuanced, subjective information about the way humans behave in different situations. Once on the hook for numeric precision and accuracy, computer programs are entering the realm of the humanities, the metric moving from a binary certainty to a soft plausibility. And without the criteria to assess them for accuracy or fairness, and in absence of any agency or even awareness on the part of the subject, the algorithms’ deductions can be wielded to tell any story its proprietors want, with no consequence. Such a scenario calls for more thoughtful discourse and regulation to protect the subject’s rights to their own representation.