Beyond “social” biases in data :
Historicity, scientific Opportunities
and societal Risks
11 January 2022 - 12h15-14h00
B101, Bastions - Online, Zoom
Registration mandatory - Under this link
The Data Science Competence Center (CCSD) of the University of Geneva is pleased to invite you to the sixth edition of the Data Science Seminars, exploring social biases in data.
Biases are inherent to any collection and analysis process, and depend on the objective of a given research. Of course, this problem is well known in science, and more particularly in statistics, which has developed complex tools and methods to limit or better understand their consequences on results. Tackling "social" biases with consciousness thus appears as a way to ensure quality research and the possibility of opening up new and unexpected avenues. However, with the rise of Big Data - and the aggregation of large disparate data sets - this critical awareness seems in many cases to be relegated - voluntarily or involuntarily - to the background of the analysis. Coupled with the sometimes blind trust placed in certain emerging data analytics methods, this tendency is not without societal risk, as these data sometimes support political, economic and social decision-making processes, reinforcing the importance of a in-depth understanding of "social" biases.
Through concrete examples drawn from their research, the speakers at this seminar will present how “social” biases appear in their research and how they tackle these issues. These presentations will notably highlight that all data sets are almost always biased. Consequently, the study of this problem has a long-lasting history in sciences, but also in official statistics, while a rich tool of methods exists to handle what today is often called ‘social bias’. They will also underline that the distinction between noise and information can never be taken for granted, while the bias of one research project can become the very data of another one. In this respect, separating noise from information is always a political choice, which should be thoroughly thought, explicitly discussed, and painstakingly implemented. Finally, the speakers will also show how the great uniformity of the population of developers and managers (white men from the middle or upper classes) reinforces the gender bias of artificial intelligence, and how this situation calls for a better understanding of the stakes of data through a more education.
Program
Bias – Sample Bias – Selection Bias – Social Bias
Stefan Sperlich, Professor of Statistics and Econometrics, Geneva School of Economics and Management.
In order to focus the discussion and understand each other, it is helpful to first clarify notation and objective. Typically, the objective is to estimate or test a specific parameter or function for a specific population of interest. The meaning of a “Bias” is relatively clear in statistics; and for any discussion in data analytics it is recommended to resort to such a clear definition. This “clearness” seems to get blurred when turning from a statistic (descriptive or inferential) to the dataset itself, e.g. from estimation bias to sample bias. While it is often distinguished then between experimental versus observational data, it might be more insightful to distinguish between a targeted survey (or census, series, etc) versus alternative data sources. Such distinction could help to understand that in the general linguistic sense, all data sets are almost always biased, rendering this notion a bit meaningless. This, however, is not a failure of statistics or data analytics; it just means that you need to adapt it to your objective. This requires to understand what population is actually represented by your data set and/or to be provided with external information about the population of interest (in order to link the dataset to it), or to have a clear idea about the data generating process (see notion ‘social’). It is a misbelief that this is intrinsic to, or special for, data (analytics) in social sciences; the study of this problem has a long-lasting history in sciences, especially biometrics, but also in official statistics. Consequently, there exists a rich tool of methods to handle what today is often called ‘social bias’.
Our bias about bias
Tommaso Venturini, Associate Professor, Geneva School of Social Sciences.
When thinking about the bias that comes with data (any kind of data, really), it is important to keep in mind that, in research, the distinction between noise and information can never be taken for granted. Quite the contrary such distinction is always dependent on the questions and objectives of the researchers and on the situation in which they operate. The bias of one research project can become the very data of another one. Keeping this in mind it is crucial to avoid believing that problem of data biases can be solved only through technical or statistical fixes. Separating noise from information is always a political choice, which should be thoroughly thought, explicitly discussed, and painstakingly implemented. This highlights, among other things, the importance to always be able to have a qualitative check of one’s quantitative corpus – computational analysis should never be warranted without the guarantee of a meticulous close reading check.
Inequality by design ? How to think together about gender bias in data
Isabelle Collet, Associate Professor, Faculty of Psychology and Educational sciences
Today, women represent less than 15% of students in computer science. This quasi non-gender mix of the field has consequences not only on gender equality in employment, but also on the inclusiveness and performance of digital applications. The great uniformity of the population of developers and managers (white men from the middle or upper classes) tends to make the needs and characteristics of other populations, especially women, disappear. The purpose of this conference is to expose the gender bias of artificial intelligence, and then to consider the educational solutions to be put in place to allow the whole society to understand the stakes of data.