{\huge Gender Bias: It Is Worse Than You Think}

Science meets bias and diversity \vspace{.5in}

Deborah Belle is a psychology professor at Boston University (BU) who is interested in gender differences in social behavior. She has reported a shocking result about bias.

Today I thought I would discuss the issue of gender bias and also the related issue of the advantages of diversity.

Lately at Tech we have had a long email discussion on implicit bias and how we might do a better job of avoiding it in the future. My usual inclination is to think about such issues and see if there is some science behind our assumptions. One colleague stated:

The importance of diversity is beyond reasonable doubt, isn't it?

I agree. But I am always looking for ``proof.''

Do not get me wrong. I have always been for diversity. I helped hire the first female assistant professor to engineering at Princeton decades ago. And I have always felt that it is important to have more diversity in all aspects of computer science. But is there some science behind this belief? Or is it just axiomatic---something that we believe and needs no argument---that it is ``beyond reasonable doubt?''

This is how I found Deborah Belle, while looking on the web for ``proof.'' I will just quote the BU Today article on her:

Here's an old riddle. If you haven't heard it, give yourself time to answer before reading past this paragraph: a father and son are in a horrible car crash that kills the dad. The son is rushed to the hospital; just as he's about to go under the knife, the surgeon says, ``I can't operate---that boy is my son!'' Explain. (Cue the ``Final Jeopardy!'' music.)

If you guessed that the surgeon is the boy's gay, second father, you get a point for enlightenment, at least outside the Bible Belt. But did you also guess the surgeon could be the boy's mother? If not, you're part of a surprising majority.

In research conducted by Mikaela Wapman (CAS'14) and Deborah Belle, a College of Arts & Sciences psychology professor, even young people and self-described feminists tended to overlook the possibility that the surgeon in the riddle was a she. The researchers ran the riddle by two groups: 197 BU psychology students and 103 children, ages 7 to 17, from Brookline summer camps.

In both groups, only a small minority of subjects---15 percent of the children and 14 percent of the BU students---came up with the mom's-the-surgeon answer. Curiously, life experiences that might suggest the `mom' answer ``had no association with how one performed on the riddle,'' Wapman says. For example, the BU student cohort, where women outnumbered men two-to-one, typically had mothers who were employed or were doctors---``and yet they had so much difficulty with this riddle,'' says Belle. Self-described feminists did better, she says, but even so, 78 percent did not say the surgeon was the mother.

This shocked me. I knew this riddle forever it seems. But was surprised to see that the riddle is still an issue. I think this demonstrates in a pretty stark manner how important it is to be aware of implicit bias. My word, things are worse than I ever thought.

Bias In Diversity Studies

I looked some more and discovered that there was, I believe, bias in even studies of bias. This may be even more shocking: top researchers into the importance of diversity have made implicit bias errors of their own. At least that is how I view their research.

Again I will quote an article, this time from Stanford:

In 2006 Margaret Neale of Stanford University, Gregory Northcraft of the University of Illinois at Urbana-Champaign and I set out to examine the impact of racial diversity on small decision-making groups in an experiment where sharing information was a requirement for success. Our subjects were undergraduate students taking business courses at the University of Illinois. We put together three-person groups---some consisting of all white members, others with two whites and one nonwhite member---and had them perform a murder mystery exercise. We made sure that all group members shared a common set of information, but we also gave each member important clues that only he or she knew. To find out who committed the murder, the group members would have to share all the information they collectively possessed during discussion. The groups with racial diversity significantly outperformed the groups with no racial diversity. Being with similar others leads us to think we all hold the same information and share the same perspective. This perspective, which stopped the all-white groups from effectively processing the information, is what hinders creativity and innovation.

Nice study. But why only choose to study all-white groups and groups of two whites and one black? What about the other two possibilities: all black and two blacks and one white? Did this not even occur to the researchers? I could imagine that all-black do the best, or that two black and one white do the worst. Who knows. The sin here seems to be not even considering all the four combinations.

It Is Even Worse

Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai have a recent paper in NIPS with the wonderful title, ``Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.''

Again we will simply quote the paper:

The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors, which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between the words receptionist and female, while maintaining desired associations such as between the words queen and female. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.

Here is one of their examples. Suppose we want to fill X in the analogy, ``he is to doctor as she is to X.'' A typical embedding prior to their algorithm may return X = nurse. The hard-debiasing algorithm finds X = physician. However, it recognizes cases where gender distinctions should be preserved, e.g., given ``she is to ovarian cancer as he is to Y,'' it fills in Y = prostate cancer. Their results show that their hard-debiasing algorithm performs significantly better than a ``soft-debiasing'' approach and performs as well or nearly as well on benchmarks apart from gender bias.

Overall, however, many have noted that machine learning algorithms are inhaling the bias that exists in lexical sources they data-mine. ProPublica has a whole series on this, including the article, ``Breaking the Black Box: How Machines Learn to be Racist.'' And sexist, we can add. The examples are not just linguistic---they are real policy decisions and actions.that are biased.

How to Balance Bias?

Ken wonders whether aiming for parity in language will ever be effective in offsetting bias. Putting more weight in the center doesn't achieve balance when all the other weight is on one side.

The e-mail thread among my colleagues centered on the recent magazine cover story in The Atlantic, ``Why is Silicon Valley so Awful to Women?'' The story includes this anecdote:

When [Tracy] Chou discovered a significant flaw in [her] company’s code and pointed it out, her engineering team dismissed her concerns, saying that they had been using the code for long enough that any problems would have been discovered. Chou persisted, saying she could demonstrate the conditions under which the bug was triggered. Finally, a male co-worker saw that she was right and raised the alarm, whereupon people in the office began to listen. Chou told her team that she knew how to fix the flaw; skeptical, they told her to have two other engineers review the changes and sign off on them, an unusual precaution.

One colleague went on to ascribe the 'horribleness' of many computer systems in everyday use to the male lopsidedness of their creation. This leads me to wonder: can we find the ``proof'' I want by making a study of the possibility that ``men are buggier''---or more solidly put, that gender diversity improves software development?

Recall Ken wrote a post on themes connected to his department's Distinguished Speaker series for attracting women into computing. The series includes our own Ellen Zegura on April 22. The post includes Margaret Hamilton and her work for NASA's Apollo missions, including the iconic photo of the stack of code she wrote being taller than she. Arguments over the extent of her role can perhaps be resolved from sources listed here and here, but there is primary confirmation of her strong hand in code that had to be bug-free before deployment.

We recently posted our amazement of large-scale consequences of bugs in code at underclass college level, such as overflowing a buffer. Perhaps one can do a study of gender and project bugs from college or business applications where large data could be made available. The closest large-scale study we've found analyzed acceptance rates of coding suggestions (``pull requests'') from over 1.4 million users of GitHub (news summary) but this is not the same thing as analyzing design thoroughness and bug rates. Nor is anything like this getting at the benefits of having men and women teamed together on projects, or at least in a mutual consulting capacity.

It is easy to find sources a year ago hailing that study in terms of ``Women are Better Coders Than Men...'' Ordinarily that kind of ``hype'' repulses Ken and me, but Ken says maybe this lever has a rock to stand on. What if we 'think different' and embrace gender bias by positing that women approach software in significantly different ways---where having the difference is demonstrably helpful.

Open Problems

What would constitute ``proof'' that gender diversity is concretely helpful?