Racial and Gender Bias in Commercial AI Products

Weston Montgomery
15 min readMar 5, 2021

As technology continues to advance in society, the question of ethics will keep becoming prevalent. Researchers over the past few years have analyzed the social biases found in commercial AI products. Buolamwini (2018) found that facial recognition software from companies like Microsoft and IBM had a gender classification error rate which was 20.8 to 30.4% greater for darker women than lighter men. Gender stereotypes are also apparent in publically available knowledge graphs. Researches from Amazon Science found that Wikidata and Freebase knowledge graphs include biases which perpetuate gender-profession stereotypes. Concerns about the racial biases in commercial AI products have led to temporary and permanent bans on the sale of facial recognition software to police by major tech companies. In addition, many state and local governments are moving away from gang databases and predictive policing software.

This paper will discuss the social biases in AI and knowledge graph software, the many improvements companies have made, and the debate over whether law enforcement should be using facial recognition and predictive policing in the first place.

Buolamwini’s research at MIT

The first major paper in this field was published in 2018 by MIT researcher Joy Buolamwini (Figure 1) and Microsoft researcher Timnit Gebru. Buolamwini noticed facial recognition software struggled to detect her face during her undergraduate studies. It did however detect her lighter skin colleagues more easily, and it was actually able to detect her face when she wore a white mask (Figure 1). After making this observation, she wrote her master’s thesis on the topic of accuracy disparities in commercial facial recognition software across different genders and skin types.

Figure 1: MIT Researcher Joy Buolamwini holding a white mask [1]

In the 2018 paper, “Gender Shades”, a dataset sample called the Pilots Parliament Benchmark (PPB) (Figure 2) was created to measure the accuracy disparity. The PPB is comprised of 1270 images of parliamentary members from three African and three European countries. These images were manually labeled by binary gender (male/female) and binary phenotype (darker/lighter skin). The binary gender is publicly available information provided by the government, but the binary phenotype was determined by the dermatologist-approved Fitzpatrick skin type. Fitzpatrick types I, II, and III were classified as “light skin”, while Fitzpatrick types IV, V, and VI were classified as “dark skin”.

Figure 2. Pilot Parliaments Benchmark (PPB) [2]

The PPB was used to perform an external audit of the gender classifier feature in commercial software developed by Microsoft, Face++, and IBM. The analysis found that overall, the three classifiers were quite accurate (Figure 3); the three companies had overall accuracies between 87.9 and 93.7%. After blocking the PPB into four groups (darker males, darker females, lighter males, and lighter females), the researchers found a clear disparity in the classifiers’ error rates (Figure 4). The error rate for lighter males was as low as 0% (Microsoft), while the error rate for darker females was as high as 34.7% (IBM). Researchers claim this discrepancy is mostly due to how the two most popular datasets previously used by tech companies to benchmark facial recognition algorithms, IJB-A and Adience, are mostly comprised of lighter faces: 79.6% for IBJ-A and 86.2% for Adience (Buolamwini, 2018).

Figure 3. Overall gender classifier accuracy [2]

Figure 4. Blocked gender classifier accuracy [2]

Fifteen months after “Gender Shades” was published, Buolamwini reinvestigated these error rates. Since the PPB dataset is publicly available, the three companies were able to quickly improve their software. In just over a year, Face++ was able to improve their gender classifier accuracy with darker females by 30.4%, Microsoft by 19.28%, and IBM by 17.73% (Figure 5).

Figure 5. PPB error rate comparison between August 2018 and May 2017 [3]

Buolamwini noted Face++ is headquartered in China, and they may have used a primarily Asian training dataset for the software prior to 2017. Diversifying the training datasets and benchmarks are key factors to the improvements made by all three companies (Buolamwini, 2019).

NIST audit

In 2019, the National Institute of Standards and Technology (NIST) published an audit on facial recognition software used by government agencies. The audit referenced the two papers published previously by Buolamwini, but the NIST used a different approach to test their vendors’ software. The NIST used four large datasets: domestic mugshots, application photographs, visa photographs, and border crossing photographs. The NIST used these datasets “to process a total of 18.27 million images of 8.49 million people through 189 commercial algorithms from 99 developers”. The NIST investigated demographics based on sex, age, and race or country of birth.

Figure 6. NIST identification applications [4]

Before discussing the NIST’s findings, it is important to make the distinction between Type I and Type II facial recognition errors. A Type I error (a.k.a. False positive) is when the facial recognition system incorrectly matches a person’s face to an entry in the database​. A Type II error (a.k.a. False negative) is when the facial recognition system can’t find someone in the database, even when they exist in the database​. Type I errors vary by factors of 10 to over 100 amongst demographics, while Type II errors only vary by factors below 3 amongst demographics. American Indians have the highest false positives with domestic law enforcement images​. African American and Asian populations also have higher Type I error rates. Women tend to have elevated false positives compared to men. Elderly people and children also have higher Type I error rates (Grother, 2019).

Gender and racial stereotypes found in knowledge graphs

Research in racial and gender bias in facial recognition software has recently made headlines with mainstream news and media. National Public Radio (NPR) released an episode of their science podcast, Short Wave, on February 17th which heavily referenced Buolamwini and the NIST’s research (Kwong, Sofia, & Hanson, 2021). The Public Broadcasting Service (PBS) is also premiering a documentary titled Coded Bias on March 22nd which follows Buolamwini’s work (Coded Bias, 2021). In comparison to the research in facial recognition bias, research in the field of gender and racial stereotypes in knowledge graph embeddings is a bit newer and far less mainstream. In November 2020, Amazon Science published the first study on this topic.

Figure 7. Visual representation of knowledge graph “triple” [5]

Amazon Science conducted their study on Wikidata and Freebase. Figure 7 shows how knowledge graphs store information in “triples” consisting of a left entity, a relation, and a right entity. Figure 8 visually represents the left entity with the blue dots at the bottom, the relation with the orange dots, and the right entity with the blue dots at the top. Researchers found that as the gender relation shifts from female to male, the profession relation shifts from nurse to doctor.

Figure 8. 2D representation of Amazon’s method for measuring bias [5]

Amazon’s researchers propose a couple solutions on how to “debias” knowledge graph embeddings. The first approach is to modify the embedding of person1, so gender predictions become impossible. They achieved this by using Kullback-Leibler (KL) divergence to benchmark the knowledge graph embedding against target distributions (Figure 9). The researchers note this method may be flawed because it prevents the model from predicting “noncontroversial” triples. They use the example of how it might be beneficial for the embedding to show how nuns are more likely to be female. To factor these “noncontroversial” relations back into the model, the researchers strategically add gender attributes back into the left entity (Figure 10). After debiasing their model with the two methods described above, Amazon Science measured gender bias by comparing the accuracy across male and female entities. They found that after applying their debiasing method, they were able to reduce the model’s gender bias score (lower the better) from 2.79 to 0.19 (Fisher, 2020).

Figure 9. Debiasing approach using Kullback-Leibler (KL) divergence [5]

Figure 10. Reintroduction of non-controversial gender attribute after debiasing [5]

Racial biasing in police technology

At first glance, the debiasing in facial recognition software and knowledge graphs may produce the notion that society’s use of this technology is moving in the right direction. When it comes to law enforcement’s adoption of these tools, activists claim that simply debiasing these algorithms is not enough. MIT fellow and director of the Technology for Liberty Program at the ACLU of Massachusetts, Kade Crockford, argues “even if the algorithms are equally accurate across race, and even if the government uses driver’s license databases instead of mugshot systems, government use of face surveillance technology will still be racist. That’s because the entire system is racist.” Crockford relates today’s facial recognition software to the 18th-century “lantern laws’’ of New York City. These laws required Black, Indigeneous, and Mixed Race people to carry lanterns when walking the streets after sunset without White company (Crockford, 2020). Government surveillance has historically disenfranchised Black, indigenous, and people of color (BIPOC ) people, and this trend only continues with facial recognition software.

In June of 2020, major tech companies began to pull out of facial recognition API sales to law enforcement. IBM stopped building and selling facial recognition software (O’brian, 2020)​. Amazon paused police sales of facial recognition software (Allyn, 2020)​. Microsoft doesn’t sell facial recognition software to police (Greene, 2020)​. Critics claim these tech companies are pulling away from selling facial recognition software, so they can aid Congress in drafting legislation to better regulate the technology. It would be a bad financial move to keep investing into technology which may need to be drastically changed in the next few years after new laws are implemented.

Other major points of concern are the use of state gang databases and predictive policing software. California has a statewide database called CalGang which provides law enforcement with gang-related intelligence. In 2016, the California State Auditor released a report titled “The CalGang Criminal Intelligence System: As the Result of Its Weak Oversight Structure, It Contains Questionable Information That May Violate Individuals’ Privacy Rights”. The report found the database lacked leadership which ensured the accuracy of the data. Police departments in Los Angeles and Santa Ana “failed to provide proper notification for more than 70 percent of the 129 juvenile records [they] reviewed”. CalGang also failed to adhere to federal regulations aimed at protecting criminal intelligence information. One of the most shocking findings in the report was how 42 entries in the CalGang system were less than one year of age; 28 were entered after “admitting to being gang members”. When the internal audit was published, CalGang’s entries were 65.9% Latinx, 20.5% Black, 93.1% male, and 57.3% aged 18–30 years. Being entered in the system could impact employment and other opportunities. For years, people feared young Black and Brown men had their freedom and opportunity taken away by “the system”; the state’s audit only revealed what people had assumed for years (Howle, 2016).

Clearly, this news angered many Californians. In early 2019, famous Los Angeles rapper, Nipsey Hussle, requested to meet with Los Angeles Police Department (LAPD) leadership to discuss CalGang. The LAPD decided to delay the meeting’s date to April 1st due to concerns about Nipsey’s gang affiliation. Tragically, Nipsey was assassinated the day before the scheduled meeting. At the time of the shooting, Kerry Lathan was visiting Nipsey, and Kerry was caught in the crossfire and suffered injuries. Kerry was out on parole after his 26-year sentence, and he was interviewed twice at the hospital after the shooting. The California Department of Corrections and Rehabilitation considered Kerry’s visit with Nipsey a parole violation. After media attention and a petition was signed with 20,000 signatures, Kerry was released from custody after 12 days. “In 2017, more than a third of parolees locked up in California were there because of a technical parole violation, not for committing a crime.” (Madden & Carmichael, 2020) Kerry would have added to this unfortunate statistic if he were not in the presence of a local hero. This story is further evidence of the potential repercussions that can occur from statewide gang databases.

In June 2020, the state attorney general of California banned the use of LAPD-generated CalGang records state-wide. CalGang has nearly 80,000 gang-suspected individuals​, and LAPD data took up about 25% of CalGang records (Becerra, 2020). In 2010, a UCLA professor and the LAPD worked together to develop PredPol​. PredPol is one of the most commonly used predictive policing software used in the country; it uses a machine learning algorithm. In April 2020, LAPD terminated their use of PredPol (Miller, 2020)​. This termination is historic because the LAPD helped develop PredPol a decade ago.

Figure 11. “Heat map” PredPol generates to show areas where there is a high probability crimes will occur [6]

In June 2020, Santa Cruz became the first U.S. city to ban “predictive policing” (Asher-Schapiro, 2020)​. This is a watershed moment because Santa Cruz is where PredPol Inc has its headquarters​.

Future Work and Conclusions

Buolamwini and the NIST’s research makes it clear that using training data and benchmarks which are representative of the user base is critical for any type of software. Although the PPB is a great first step, it is clear that other groups are being overlooked by facial recognition software based on the NIST’s findings. A more robust benchmark which utilizes a diverse set of faces based on race, ethnicity, nationality, gender, disability, weight, and age should be used to test the accuracy of commercial products.

Even if we reach the ultimate goal of eliminating bias in facial recognition software, we should continue to prohibit the sale and use of this technology to law enforcement. These tools have historically been abused by police to monitor BIPOC people to seek out criminals and illegal immigrants. This form of targeted surveillance is a detriment to the freedom promised by our nation. We must also limit the sale of this technology to other countries who seek to use it in their military or police force.

In 2018, researchers from Harvard, the University of Southern California (USC), and University of California Los Angeles (UCLA) presented their work at the Association for the Advancement of Artificial Intelligence (AAAI) and Association for Computing Machinery (ACM) conference on Artificial Intelligence, Ethics, and Society. The presenters talked about how they developed partially generated neural networks to aid law enforcement. Their prototype used 4 inputs: primary weapon, number of suspects, neighborhood, and location. From these inputs, their model would classify crimes as gang related or non-gang related (Seo et al., 2018). Similar to PredPol and other predictive policing software which is already commercially available, this new technology does not explicitly factor in race as an input. Regardless, this tool could easily be misused by law enforcement; To quote Kade Crockford again: “the entire system is racist” (Crockford, 2020). Attendees at the conference seemed to agree. After the researchers presented their findings, audience members questioned if the team knew whether or not the training data were biased and what would happen if individuals were mislabeled as gang members. As we discussed earlier, Buolamwini’s research found biased training data could lead to software which misgenders minorities. When it comes to policing software, biased training data can be a matter of life and death for BIPOC people. The partially generative neural network was trained from LAPD data from 2014 to 2016 on over 50,000 gang-related and non-gang related crimes. This is the same data that Attorney General Becerra banned in June 2020 due to its heavy bias. Sadly, the researchers were unable to answer the audience’s ethical concerns during the Q&A portion of their presentation. One member rhetorically asked if the researchers were “developing algorithms that would help heavily patrolled communities predic police raids”. A researcher replied by stating, “[these are the] sort of ethical questions that I don’t know how to answer appropriately” being just a “researcher”. The audience member who originally asked the question then quoted a song: “Once the rockets are up, who cares where they come down?”. He then angrily walked out of the room (Hutson, 2018).

A senior author of the paper was a USC computer scientist who is now the Director of AI for Social Good at Google Research India and the Director of the Center for Research in Computation and Society at Harvard University. These accolades make it clear these researchers probably had good intentions. With that in mind, the state audit which found that CalGang’s data is highly inaccurate was published in 2016, yet these researchers still trained their model using this biased data and published their results and presented their findings at an AI ethics conference. In addition, a co-author of the paper is a UCLA professor who helped lead to the creation of PredPol; the controversial predictive policing software which has now been banned by many police departments. This anecdote should serve as a warning that even if we have the best intentions as scientists and engineers, we must always question the data we are provided and we must consider the potential repercussions of the technology we are developing.

References

Allyn, B. (2020, June 10). Amazon Halts Police Use Of Its Facial Recognition Technology. Retrieved January 13, 2021, from https://www.npr.org/2020/06/10/874418013/amazon-halts-police-use-of-its-facial-recognition-technology​

Asher-Schapiro, A. (2020, June 17). In a U.S. first, California city set to ban predictive policing. Retrieved January 13, 2021, from https://www.reuters.com/article/us-usa-police-tech-trfn/in-a-u-s-first-california-city-set-to-ban-predictive-policing-idUSKBN23O31A​

Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification (1152137279 866112618 S. A. Friedler & 1152137280 866112618 C. Wilson, Eds.) [Scholarly project]. In Gender Shades. Retrieved January 13, 2021, from http://gendershades.org/overview.html​

Buolamwini, J., & Raji, I. D. (2019). Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products [Scholarly project]. In Algorithmic Justice League. Retrieved January 13, 2021, from https://www.ajl.org/library/research​

California State Auditor, & Howle, E. M., auditor.ca.gov (2016). https://auditor.ca.gov/reports/2015-130/index.html.

Coded bias. (n.d.). Retrieved February 19, 2021, from https://www.pbs.org/independentlens/films/coded-bias/

Crockford, K. (2020, June 19). How is face recognition surveillance technology racist? Retrieved February 18, 2021, from https://aclu-or.org/en/news/how-face-recognition-surveillance-technology-racist

Face Recognition. (2020, August 25). Retrieved January 13, 2021, from https://www.eff.org/pages/face-recognition#:~:text=A%20%E2%80%9Cfalse%20positive%E2%80%9D%20is%20when,photo%20is%20of%20%E2%80%9CJack.%E2%80%9D​

Fisher, J., Palfrey, D., Christodoulopoulos, C., & Mittal, A. (2020). Measuring social bias in knowledge graph embeddings. Retrieved January 13, 2021, from https://www.amazon.science/publications/measuring-social-bias-in-knowledge-graph-embeddings​

Fisher, J. (2020, November 25). Mitigating social bias in knowledge graph embeddings. Retrieved February 18, 2021, from https://www.amazon.science/blog/mitigating-social-bias-in-knowledge-graph-embeddings

Greene, J. (2020, June 11). Microsoft won’t sell police its facial-recognition technology, following similar moves by Amazon and IBM. Retrieved January 13, 2021, from https://www.washingtonpost.com/technology/2020/06/11/microsoft-facial-recognition/​

Grother, P., Ngan, M., & Hanaoka, K. (2019, December). Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects. Retrieved January 13, 2021, from https://doi.org/10.6028/NIST.IR.8280​

Hutson, M. (2018, February 28). Artificial intelligence could identify gang crimes-and ignite an ethical firestorm. Retrieved January 13, 2021, from https://www.sciencemag.org/news/2018/02/artificial-intelligence-could-identify-gang-crimes-and-ignite-ethical-firestorm​

Kwong, E., Sofia, M., & Hanson, B. (Eds.). (2021, February 18). Why tech companies are limiting police use of facial recognition. Retrieved February 19, 2021, from https://www.npr.org/2021/02/17/968710172/why-tech-companies-are-limiting-police-use-of-facial-recognition

Madden, S., & Carmichael, R. (2020, December 12). Caught In The System. NPR. https://www.npr.org/2020/12/12/945454343/caught-in-the-system-nipsey-hussle-lapd-affiliation.

Miller, L. (2020, April 21). LAPD will end controversial program that aimed to predict where crimes would occur. Retrieved January 13, 2021, from https://www.latimes.com/california/story/2020-04-21/lapd-ends-predictive-policing-program​

O’brien, M. (2020, June 09). IBM quits facial recognition, joins call for police reforms. Retrieved January 13, 2021, from https://apnews.com/article/5ee4450df46d2d96bf85d7db683bb0a6​

Rieland, R. (2018, March 05). Artificial intelligence is now used to predict crime. but is it biased? Retrieved February 04, 2021, from https://www.smithsonianmag.com/innovation/artificial-intelligence-is-now-used-predict-crime-is-it-biased-180968337/

Seo, S., Chan, H., Brantingham, J., Leap, J., Vayanos, P., Tambe, M., & Liu, Y. (2018, December). Partially Generative Neural Networks for Gang Crime Classification with Partial Information [Scholarly project]. Retrieved January 13, 2021, from https://www.researchgate.net/publication/330297395_Partially_Generative_Neural_Networks_for_Gang_Crime_Classification_with_Partial_Information​

State of California Department of Justice, Attorney General Xavier Becerra. (2020, July 14). Attorney General Becerra Restricts Access to LAPD-Generated CalGang Records, Issues Cautionary Bulletin to All Law Enforcement, and Encourages Legislature to Reexamine CalGang Program [Press release]. Retrieved January 13, 2021, from https://oag.ca.gov/news/press-releases/attorney-general-becerra-restricts-access-lapd-generated-calgang-records-issues

Referenced Media

[1] Senne, S. (2019). AP News. photograph, Cambridge. https://apnews.com/article/24fd8e9bc6bf485c8aff1e46ebde9ec1#:~:text=CAMBRIDGE%2C%20Mass.&text=On%20Wednesday%2C%20a%20group%20of,facial%20recognition%20software%20to%20police.

[2] Buolamwini, J., & Gebru, T. Gender Shades. http://gendershades.org/overview.html.

[3] Buolamwini, J., & Raji, I. D. (2019). Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products [Scholarly project]. In Algorithmic Justice League. Retrieved January 13, 2021, from https://www.ajl.org/library/research​

[4] Grother, P., Ngan, M., & Hanaoka, K. (2019, December). Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects. Retrieved January 13, 2021, from https://doi.org/10.6028/NIST.IR.8280​

[5] Fisher, J. (2020, November 25). Mitigating social bias in knowledge graph embeddings. Retrieved February 18, 2021, from https://www.amazon.science/blog/mitigating-social-bias-in-knowledge-graph-embeddings

[6] Rieland, R. (2018, March 05). Artificial intelligence is now used to predict crime. but is it biased? Retrieved February 04, 2021, from https://www.smithsonianmag.com/innovation/artificial-intelligence-is-now-used-predict-crime-is-it-biased-180968337/

--

--