Where does bias come from?
Bias in data science and machine learning may come from the source data, algorithmic or system bias, and cognitive bias. Imagine that you are analyzing criminal records for two districts. The records include 10,000 residents from district A and 1,000 residents from district B. 100 district A residents and 50 district B residents have committed crimes in the past year. Will you conclude that people from district A are more likely to be criminals than people from district B? If simply comparing the number of criminals in the past year, you are very likely to reach this conclusion. But if you look at the criminal rate, you will find that district A’s criminal rate is 1% which is less than district B. Based on this analysis, the previous conclusion is biased for district A residents. This type of bias is generated due to the analyzing method, thus we call it algorithmic bias or system bias. Does the criminal based analysis guarantee an unbiased conclusion? The answer is no. It could be possible that both districts have a population of 10,000. This indicates that the criminal records have the complete statistics of district A, yet only partial statistics of district B. Depending on how the reports data is collected, 5% may or may not be the true criminal rate for district B. As a consequence, we may still arrive at a biased conclusion. This type of bias is inherent in the data we are examining, thus we call it data bias. The third type of bias is cognitive bias, which arises from our perception of the presented data. An example is that you are given the conclusions from two criminal analysis agencies. You tend to believe one over another because the former has a higher reputation, even though the former may have the biased conclusion. Read a real world case of machine learning algorithms being racially biased on recidivism here: https://www.nytimes.com/2017/10/26/opinion/algorithm-compas-sentencing-bias.html.
Bias is everywhere
With the explosion of data and technologies, we are immersed in all kinds of data applications. Think of the news you read everyday on the internet, the music you listen to through service providers, the ads displayed while you are browsing webpages, the products recommended to you when shopping online, the information you found through search engines, etc., bias can be present everywhere without people’s awareness. Like “you are what you eat”, the data you consume is so powerful that it can in fact shape your views, preferences, judgements, and even decisions in many aspects of your life. Say you want to know whether some food is good or bad for health. A search engine returns 10 pages of results. The first result and most of the results on the first page are stating that the food is healthy. To what extend do you believe the search results? After glancing at the results on the first page, will you conclude that the food is beneficial or at least the benefits outweigh the harm? How likely will you continue to check results on the second page? Are you aware that the second page may contain results of the harm of the food so that results on the first page results are biased? As a data scientist, it is important to be careful to avoid biased outcomes. But as a human being who lives in the world of data, it is more important to be aware of the bias that may exist in your daily data consumption.
Bias v.s. Fairness
It is possible that bias leads to unfairness, but can it be biased but also fair? The answer is yes. Think bias as the skewed view of the protected groups, fairness is the subjective measurement of the data or the way data is handled. In other words, bias and fairness are not necessarily contradictory to each other. Consider the employee diversity in a US company. All but one employees are US citizens. Is the employment structure biased toward US citizens? Yes, if this is a result of the US citizens being favored during the hiring process. Is it a fair structure? Yes and No. According to the Rooney Rule, this is fair since the company hired at least one minority. While according to statistical parity, this is unfair since the number of US citizens and noncitizens are not equal. In general, bias is easy and direct to measure, yet fairness is subtler due to the various subjective concerns. There are just so many different fairness definitions to choose from, let alone some of which are contradictory to each other. Check out this tutorial https://www.youtube.com/watch?v=jIXIuYdnyyk for some examples and helpful insights of fairness definitions from the perspective of a computer scientist.