How practitioners and academics think (and then forget) about fairness when building AI systems.
One of the biggest challenges, as machine learning and AI is increasingly used to make decisions about everything from credit risk to employee recruitment, is how to evaluate its fairness. Do algorithms make judgements that are sexist, racist or otherwise discriminatory? And how do those using them mitigate against bias?
When asked what they think, many AI practitioners will say the right thing. But what drives their decisions when they actually start building a system?
This was the question that motivated Johanna Fyrvald, who wrote her Masters thesis with me as supervisor last term. Through three interviews with academics and four interviews with practitioners (CTO’s and Heads of AI) at Swedish companies she aimed to find out how they think about fairness in Artificial Intelligence.
Johanna conducted her interviews in two stages. First, she asked the interviewees about their knowledge and attitudes to fairness, then in the second part of the interview she asked them to imagine themselves in a particular AI-design scenario. It was this latter part of her study that I concentrate on here. Much of what I write here is a summary of her thesis, which can be downloaded in full here.
Some of you may have already tried Johanna’s questionnaire, because to complement her qualitative interviews, she also did a quantitative online questionnaire where 303 respondents (who were prompted to go to the survey through Johanna and myself spreading it on social media) answered the same questions.
Here is the scenario. Read it carefully.
Here is the first question she posed in her interviews.
Taken purely as a mathematical task, this question gives the false positive rate of the system for men (10/400=2.5%) and women (10/100=10%). The majority of the AI-expert interviewees successfully calculated these percentages, and most of them (although not all) thought that it was unfair to women.
The argument of unfairness can be made both ways with regard to gender. On the one hand, we could argue it is biased against men, since women have less trouble getting in to building than men. On the other hand, the system ‘fails’ more often for women than for men, making it unfair to women.
Of the 303 online participants in the survey, the majority who thought the system was biased also thought that it was women who were unfairly treated.
When explaining their decision, the online participants used reasonable arguments for discrimination against both men (e.g. “men have less opportunity to sneak into the building.”) and women (e.g. “women are unfairly treated since the system makes more mistakes on them”)
In Johanna’s interviews with the seven AI experts, only one participant mentioned that there was no way to find out the false negative rates which would be needed to decide whether the system is biased and which (if any) gender is being unfairly treated. This information is given in the second question, which is:
Purely mathematically, we now know that the false negative rate is 5% for women and 2.5% for men. The question is deliberately worded to make it slightly difficult to work this out. Now, I think, the unfairness against women is the only correct answer: female employees can’t get in to the building they work in!
The proportion of online participants who identified unfairness against women increased after the second question.
The third question provides no new information but spells out the maths behind the problem.
Now nearly everyone in the online survey and all the experts Johanna’s interviews agreed there was a bias thought that it was against women.
For the fourth question, things get tricky again.
Most people in the online survey preferred the original system, despite the new system eliminating the biases.
Those which preferred the new system simply referred to the new system being fairer, while those who liked the old one gave several arguments in its defence. One respondent who preferred the older system, argued that “Errors should be eliminated, not forced to be equal across gender lines.” Another said that, “Degrading system functionality with respect to men hides the fundamental issue of system design for correctly identifying women.”
In Johanna’s interviews some of the experts said it wasn’t up to them to decide which was fairest as designers, referring to some form of external authority who should make the decision. They also suggested a number of technical solutions to the problems, such as increasing the size of the training set for women.
And finally, we come to the trickiest question of all.
This was a question which Johanna and I spent a lot of time talking about when designing the study. We discussed alternatives, such as if we used race instead of gender. What is interesting with gender is that, historically, men do commit more illegal entry crimes than women do. But, in many countries, there are also laws which mean that we can’t build systems that discriminate on the basis of gender. In which case, the crime rate for men is not a relevant factor for the design of systems.
None of the academics and practitioners interviewed by Johanna mentioned the legal dilemma in using this information. One academic said, “If you know for sure that men are more likely to enter illegally, that information should be used when building the system and men should be treated stricter than women.”
One practitioner said that, “If this information is known, then the focus should be on reducing the number of false positives for men.”
Another of the practitioners deferred to authority, saying it would be up to someone else to make the decision, saying that, “it comes back to what you are optimizing for.”
Not all of the AI experts thought that the information should be used. The expert who objected most strongly did not mention legal issues, but instead talked about ways to improve performance for women, so as to make it both accurate and fair.
Half of the 303 participants in the online study thought offending rate information was useful and should be used in design.
The online respondents who answered ‘no’, did refer to legal issues with using data like this. One of them wrote that it would be “illegal and unethical to do this.”, others pointed out that race/gender/disability profiling is illegal. And one respondent pointed out that now we were now considering intentional gender bias, rather than the unintentional bias in the earlier questions.
This last question was, for me, the most interesting. It revealed that, even within the short timespan of thinking about a design scenario, Johanna’s interviewees were setting aside legal issues and using their own judgment to guide them. In the first part of the interview, before the scenario was presented, these same AI experts had shown awareness of legal issues and talked about how important it was to follow the rules set up to deal with these situations.
I may well have made the same mistakes. It’s easy to start thinking about practical solutions when a practical situation is described. And easy to forget the principles you should be following.
I recommend that you read Johanna’s thesis, of which this article gives only the highlights. I think she has done an excellent job of combining qualitative and quantitative approaches to investigate a difficult question. There is little doubt that more research is needed in to how practitioners in AI think and act around issues of bias and fairness.