Browsed by
Author: Ruoyuan Gao

Bias, Fairness, Diversity and Novelty

Bias, Fairness, Diversity and Novelty

When dealing with bias in IR systems, we are often faced with the question of what are the differences and connections between bias, fairness, diversity and novelty. We have briefly talked about the relationship between bias and fairness in the previous article. Let us now look at diversity and novelty.  

DIVERSITY aims to increase the topical coverage of the search results. Because the information need varies by individual users, predicting the intention based on the query alone is a difficult and sometimes impossible task for the IR system. Even with the presence of additional information such as personal profiles, social identity and networking, geolocations when issuing the search request and browsing history, it is still hard for the system to precisely accommodate each individual’s information need. One perspective is on the ambiguity associated with search query. For example, “apple” may mean the fruit, or the company named Apple. Apart from the ambiguity inherent in the language, a query may also be ambiguous due to the search intent. For instance, a user searching for “rutgers” may be looking for the official website of Rutgers University, the location, description, ranking, or recent news about the university, etc. The IR system must consider all the possibilities of user search intent and return the most relevant results. Another perspective is on the subtopics or topic aspects regarding the searched topic. For instance, different opinion polarities about “gun control” should be included in the search results so that the information presented provides a comprehensive view of the topic. Increasing diversity means to include as many topical subgroups as possible. As a result, diversity can alleviate some bias by enriching the results with more perspective, avoiding results that are all from the same group. Meanwhile, diversity can increase the fairness because it accounts for all subgroups/aspects of the topic.

NOVELTY aims to reduce the redundancy in the retrieved information. For instance, given the search query “gun control”, if two consecutive results are from the same website, or one is a forwarded or cited article from another, then the users may find the second one to be redundant to the first one. In other words, novelty tries to bring as much “new” information as possible in the set of retrieved results. From this perspective, we can see that diversity and novelty can benefit each other to some extent, but none of which guarantees the other. On the one hand, diversity brings new information by introducing different subtopics/aspects. But it does not address in-group information, i.e., it does not care how many results are in the same topical group, as long as the results cover as many groups as possible. So diversity does not guarantee novelty. Novelty, on the other hand, can surface different subtopics/aspects by reducing redundant information. But novelty does not care about the topical groups, as long as the result introduces new information compared to the previous results. In terms of bias, skewed view may be avoided by increasing novelty, but since there is no guarantee on the “group” distribution, there is no guarantee on removing the bias.

FAIRNESS aims to bring balance to the retrieved results according to a subjective design need. If the goal is to enforce topical fairness, then fairness requires all subtopics/aspects to be covered, hence the maximum diversity. But fairness does not necessarily have to be concerned with topical groups. Fairness can be imposed on other groups such as gender, racial and religion. So achieving fairness and diversity can be different goals. In addition, the key points of diversity fairness is to balance the number of results from each topical group, while diversity only aims to maximize the total number of groups covered. For example, if there are two subtopic groups for a given query in the search results, then diversity can be achieved by including one result from one group, and take the rest of the results from another group. But fairness may need the same number of results from each group, depending on the notion of fairness by system design need.

To sum up, while diversity and novelty can potentially reduce bias and fairness, their goals are essentially different from the concepts and goals of unbiasedness and fairness.

Bias and Fairness in Data Science and Machine Learning

Bias and Fairness in Data Science and Machine Learning

Where does bias come from?

Bias in data science and machine learning may come from the source data, algorithmic or system bias, and cognitive bias. Imagine that you are analyzing criminal records for two districts. The records include 10,000 residents from district A and 1,000 residents from district B. 100 district A residents and 50 district B residents have committed crimes in the past year. Will you conclude that people from district A are more likely to be criminals than people from district B? If simply comparing the number of criminals in the past year, you are very likely to reach this conclusion. But if you look at the criminal rate, you will find that district A’s criminal rate is 1% which is less than district B. Based on this analysis, the previous conclusion is biased for district A residents. This type of bias is generated due to the analyzing method, thus we call it algorithmic bias or system bias. Does the criminal based analysis guarantee an unbiased conclusion? The answer is no. It could be possible that both districts have a population of 10,000. This indicates that the criminal records have the complete statistics of district A, yet only partial statistics of district B. Depending on how the reports data is collected, 5% may or may not be the true criminal rate for district B. As a consequence, we may still arrive at a biased conclusion. This type of bias is inherent in the data we are examining, thus we call it data bias. The third type of bias is cognitive bias, which arises from our perception of the presented data. An example is that you are given the conclusions from two criminal analysis agencies. You tend to believe one over another because the former has a higher reputation, even though the former may have the biased conclusion. Read a real world case of machine learning algorithms being racially biased on recidivism here: https://www.nytimes.com/2017/10/26/opinion/algorithm-compas-sentencing-bias.html.

Bias is everywhere

With the explosion of data and technologies, we are immersed in all kinds of data applications. Think of the news you read everyday on the internet, the music you listen to through service providers, the ads displayed while you are browsing webpages, the products recommended to you when shopping online, the information you found through search engines, etc., bias can be present everywhere without people’s awareness. Like “you are what you eat”, the data you consume is so powerful that it can in fact shape your views, preferences, judgements, and even decisions in many aspects of your life. Say you want to know whether some food is good or bad for health. A search engine returns 10 pages of results. The first result and most of the results on the first page are stating that the food is healthy. To what extend do you believe the search results? After glancing at the results on the first page, will you conclude that the food is beneficial or at least the benefits outweigh the harm? How likely will you continue to check results on the second page? Are you aware that the second page may contain results of the harm of the food so that results on the first page results are biased? As a data scientist, it is important to be careful to avoid biased outcomes. But as a human being who lives in the world of data, it is more important to be aware of the bias that may exist in your daily data consumption.

Bias v.s. Fairness

It is possible that bias leads to unfairness, but can it be biased but also fair? The answer is yes. Think bias as the skewed view of the protected groups, fairness is the subjective measurement of the data or the way data is handled. In other words, bias and fairness are not necessarily contradictory to each other. Consider the employee diversity in a US company. All but one employees are US citizens. Is the employment structure biased toward US citizens? Yes, if this is a result of the US citizens being favored during the hiring process. Is it a fair structure? Yes and No. According to the Rooney Rule, this is fair since the company hired at least one minority. While according to statistical parity, this is unfair since the number of US citizens and noncitizens are not equal. In general, bias is easy and direct to measure, yet fairness is subtler due to the various subjective concerns. There are just so many different fairness definitions to choose from, let alone some of which are contradictory to each other. Check out this tutorial https://www.youtube.com/watch?v=jIXIuYdnyyk for some examples and helpful insights of fairness definitions from the perspective of a computer scientist.­­­