A Case for Clarity In the Age of Algorithmic Injustice

By Jen Kagan

Illustrated by Hye Ryeong Shin

Issue 2

O n October 16th, 2017 the New York City Council heard testimony on Bill 1696, that would “require agencies that use algorithms or other automated processing methods that target services, impose penalties, or police persons to publish the source code used for such processing.”1

Both advocates and detractors of the bill knew the stakes were high: As the first city council in the country to confront the existence of algorithms and their impact on public life, New York City was setting a precedent that could have far reaching implications. Public defenders, civil liberty and open data advocates, and privacy researchers filled the docket to push for the bill’s passage. Meanwhile, the city’s Department of Information Technology and Telecommunications (DoITT) urged caution, and the trade group Tech:NYC (paying members include Google and Facebook) suggested that “there could be better ways to address concerns underlying the proposed bill.2 So many wished to cast their stake in the hearing that it was forced to be relocated to a larger space, and even then, it was standing room only. “This is the largest attendance a Technology Committee meeting has ever had,3 council member James Vacca, who introduced the bill, gleefully noted.

So what were the implications of Bill 1696 exactly? If passed, 71 words would be appended to New York City’s existing open data law that would, in addition to requiring agencies to publish the source code of their automated processing methods, also require agencies to “permit a user to (i) submit data into such system for self-testing and (ii) receive the results of having such data processed by such system.” Agencies would have four months from the bill’s signing to comply.

Before They Called It ‘Algorithmic’

The road to algorithmic decision-making was paved by actuarial risk assessment and big data. In his 2007 book Against Prediction, attorney and Columbia University professor Bernard Harcourt examines the use of statistical risk assessments in the field of criminology. He notes that by the 1920s in the United States there was already “a thirst for prediction—a strong desire to place the study of social and legal behavior on scientific footing.” In order to satisfy this desire, the “prognostic score” emerged as a means to predict whether criminals would reoffend. By assigning risk values based on offenders’ mental and physical characteristics, criminologists developed a score they felt to be meaningful and true.5

“A certain euphoria surrounded the prediction project, reflecting a shared sense of progress, of science, of modernity,” Harcourt writes of criminal profiling in Massachusetts a hundred years ago.

More recently, the promise of Big Data seemed like it might serve the same purpose. Having more data, as Big Data advocates purport, improves predictive accuracy. And having that data-backed ‘accuracy’ grants us license to apply data-based scoring to new domains.

Today, algorithms dominate the predictive technology scene. Councilman Vacca wants to know how those algorithms -many of which are used to aid city agencies in their decision making processes- work. His motivation comes in part from frustration with the shortcomings of the current system. “How does somebody get an apartment in public housing?” Vacca asked at the City Council hearing. “I’m told that it’s strictly done by computer…On what basis?6 If the algorithms were working correctly, he implies, people who applied for public housing would be assigned to apartments near their families and doctors, high school students would not be assigned to their sixth choice school, and fire department resources would be allocated more fairly. At the very least, Vacca hopes that people might have recourse when “some inhuman computer is spitting them out and telling them where to go.7 The councilman has a name for this: Transparency.

The Problem With ‘Transparency’

What kind of transparency might we get by asking government agencies to publish the source code of their automated processing methods?

In 2008, the Free Software Foundation, overseers of the GNU8 General Public License open source software license (GPL), sued the maker of Linksys routers. Linksys used GPL-licensed software in its routers, but hadn’t published improvements it made to that software -thereby violating the terms of the license. Rather than go to court, Linksys’ parent company agreed to publish its source code and hire a free software9 compliance officer.10

I went to the Linksys website to see if I could find the source code they’d agreed to publish. Buried within Linksys’ support documentation is the GPL Code Center, a table of hardware model numbers with corresponding links to software files.11 For no reason in particular, I chose to download the 277 MB file bundle for model CM3024. The bundle contained a README (“Hitron GPL Compiling Guide”) with instructions like “How to build both firmware and toolchain,” “How to build & install toolchain alone,” “How to make image alone.” Aside from the README, the other words in the file bundle were meant for computers and didn’t give me, a human, a better understanding of how the router worked.

Publishing source code the way the GPL mandates is itself a type of transparency, but it’s not the kind that’s meaningful to the general public. The GPL primarily exists for enforcing its own, very specific philosophy of freedom12; which says that users have the freedom to run, copy, distribute, study, change, and improve the software. Thus, “free software” is a matter of liberty, not price. It certainly doesn’t exist to convey the meaning of the software it licenses.

In September of 2017, ProPublica filed a motion to unseal the source code of a DNA analysis algorithm used in thousands of court cases across the United States.13 One could presume the publication was similarly motivated to reveal the significance of the code to humans. ProPublica argued that the design of the algorithm might have resulted in sending innocent people to prison. A month later, a federal judge granted ProPublica’s motion and the source code was released. As with the Linksys code, the release of the source code to the general public, though technically transparent, is still meaningless in its indecipherability. In practice, the implications of the DNA algorithm were conveyed in the form of a 60-page code review by a student pursuing a master’s degree in computer science.

Addressing the issue of algorithmic transparency recently on a panel, data scientist Cathy O’Neil noted, “I don’t think transparency per se is a very meaningful concept in this realm because, speaking as a data scientist, I can make something that’s technically transparent but totally meaningless.”14

The transparency that open source code provides is only meaningful when there are translators who can explain what the code does. Vacca’s bill, while a step in the right direction, remains incomplete in that it fails to propose a solution for demystifying the meaning of the code itself. In effect, the burden of deciphering the algorithm would still fall on the individuals that it affects. Similarly, while ProPublica was able to get source code published, it became incumbent upon the publication’s legal team to find experts who could decipher what they had gotten the court to unseal.

So how do we get corporations and government agencies to foot the bill for this work, as opposed to ‘outsourcing’ it to private individuals and investigative journalists?

Overlapping Triangles (Image By Hye Ryeong Shin)

Data Controllers

The EU’s General Data Protection Regulation (GDPR), which replaces a 1995 data protection initiative, was adopted in 2016 and will go into effect May of this year. Written over the course of four years, its 11 chapters contain 99 articles that map out in great detail the digital rights of European Union citizens.15

In its final version, the GDPR states that large companies whose core activities include processing and monitoring personal data are required to hire a “data protection officer.”16 Data subjects (or those whose data is run through an algorithm) “should have the right not to be subject to a decision” arrived at by the automated processing of their data, and in the event that they are, they have “the right to obtain human intervention, to express his or her point of view, to obtain an explanation of the decision reached after such assessment and to challenge the decision.”17

The GDPR has also expanded its jurisdiction and increased fines for noncompliance. The GDPR now “applies to all companies processing the personal data of data subjects residing in the Union, regardless of the company’s location.”18 Those that don’t comply risk being fined €20,000,000 or 4% of their annual revenue, whichever is higher.

That language used by the GDPR reflects a depth of understanding and proactive engagement with data and power dynamics that our City Council has not reached. To be fair, New York is “the first city and the first legislative body in [the United States] to take on this issue,” as Vacca points out. But, in scaffolding legal frameworks here, the city council might benefit from borrowing some of the language the authors of the GDPR developed over many years.

A New Bill

In the end, Councilmember Vacca’s tiny but mighty 71 word bill was not passed. Instead, the City Council passed a revised version of Bill 1696 that is more detailed –and also more measured. The revision includes a task force appointed by the mayor. The task force will spend 18 months producing a report. The report will recommend procedures for determining whether algorithms disproportionately impact people based on their identities, and, similar to the GDPR, will come up with ways for people affected by an automated decision system to access an explanation of the system’s inner workings.19

The requirement that agencies publish source code is notably missing from the amended bill, replaced with more meaningful measures of transparency that shifts the burden of technical understanding and explanation from the general public back to the writers and wielders of algorithms themselves—or as the GDPR calls them, “data controllers.”

The proposed scope of the report is daunting. “I mean, if you use a computer program, you’re using an algorithm,” Craig Campbell of the city’s Department of Information Technology and Telecommunications sighed under questioning by Vacca.20 It remains to be seen how members of the task force will differentiate between the decision-making part of a computer program and the rest of its functions.

Perhaps most daunting is the prospect of eliciting participation from the city agencies most responsible for disproportionately harming New Yorkers, with and without the help of algorithms. The revised bill notes that compliance with the task force’s recommendations is not required, specifically when that compliance might interfere with a law enforcement investigation or “result in the disclosure of proprietary information.”21

Outcomes. And Then Algorithms

Maybe the way the term ‘algorithm’ is used in this conversation is a contemporary manifestation of the same fantasy about scientific progress that Harcourt describes in his book. When data controllers use it, it creates opacity and allows them to wriggle out of questions about responsibility by pointing to data and science and the objectivity of statistics, without ever having to acknowledge that they might not actually know what they’ve built. When data subjects use the term, it’s to speak in the terms by which they’ve been harmed.

Underlying this revised New York City Council bill is the belief that knowing what goes into the decision-making machine will make redress for harmful decisions more possible. That’s hopefully true at least in part, and it’s why data subjects should insist on multiple transparencies, and push back when data controllers argue that algorithms are too complex for laypeople to understand or untangle, or that it’s a security risk to have more eyes on them or that, conveniently, algorithms are private property and therefore not available for public scrutiny.

It’s also important for programmers to understand and reckon with the fact that far from the well-lit offices where they write code, their same code might be used in a program that determines whether “New Yorkers go home to their families and communities or, instead, sit for days, weeks, or months on Rikers Island.”22

But it’s also true that developing language to regulate algorithms, algorithmic decision-making, data controllers, and data subjects is just one of many strategies for addressing what lies at the heart of the matter: inequality. Those facing injustice, algorithmic or otherwise, will need, as they have always needed, infrastructure for finding each other, sharing their stories, and developing their own demands on their own terms. We need strategies for fighting injustice that have nothing to do with the technical details of algorithms. Maybe that’s why organizer and filmmaker Astra Taylor said about organizing around algorithmic injustice: “I don’t think you should lead with the algorithms,” contesting the premise of this entire article. “Outcomes. And then algorithms.”23

Jen Kagan (NYU ITP ‘17) is a programmer, writer and teacher who’s written code for Mozilla and p5.js. She is a current technology fellow at Coworker.org and a research fellow at NYU’s Interactive Telecommunications Program. Twitter: @jen_kajan | Website: jennnkagan.com/ | Email: kaganjd@gmail.com/