Making it Real – Developing Socially, Politically, and Ethically Aware Data Scientists

by Nitin Kohli, School of Information

Teaching Effectiveness Award Essay, 2019

Behind the Data: Humans and Values (Info 188) deals with the social, political, and ethical considerations of data science. Data science solutions are actively being deployed in diverse settings, implicating values such as privacy, fairness, and freedom of expression. Nevertheless, concerns related to values and technology are not entirely novel. Scholars in the social sciences, the humanities, and the law have written extensively on the relationship between technology and society for centuries. However, students with STEM backgrounds tend to find these works inaccessible, especially early on in their undergraduate years. As such, there is gap between the takeaways of these readings and the functional toolkit that many data science classes offer.

To address this disconnect, I decided to teach the values implications of data science through the data science pipeline itself. That is, in order to connect social and ethical readings to the practice of data science, I created activities that would simulate the work data scientists are expected to do. The hope was that by creating seemingly apolitical technical tasks, I could organically surface the issues raised in readings, forcing students to recognize and grapple with the realities of data science in practice, while ingraining the learnings from these readings into their toolkit.

One of the activities I created involved addressing privacy risks in “anonymized” datasets. It is often a misconception that removing personal attributes (such as an individual’s name or date of birth) is sufficient to rule out learning who was in a dataset. This is not true, and unfortunately certain aspects of law deem this acceptable. While academic case studies exist on this topic, I didn’t want my students to passively read about failure cases of anonymization—I wanted them to actively internalize this concept and understand the harms that follow. To this end, I created an anonymized dataset of taxi rides in a fictional city (including time of day, tip percentage, and location info) and gave them an external dataset of celebrity spottings. The goal was to reidentify as many celebrities in the anonymized dataset and to learn which were poor tippers.

This exercise served multiple purposes. First, it allowed students to gain perspectives about privacy and security by thinking from the perspective of an adversary. This change in persona showed students that their work doesn’t exist in a vacuum and that, if they are not careful, sensitive information can be leaked. Additionally, students gained hands-on experiences working with threat models and specific data attacks. These served as fundamental building-blocks which we further scaffolded into an exploration of mathematically robust privacy techniques. Furthermore, this gave us a segue into discussions of modern privacy breaches, which fostered discussions of the limitations of redaction-based methods that are often codified in laws.

It is difficult, if not impossible, to measure how much more socially, politically, or ethically aware a student has become. But what I can say is that this strategy of teaching values implications through work practice fostered engagement within the class. Material that had once been inaccessible started to become real. Our classroom interactions were livelier, as my students developed a newfound level of confidence when reading and discussing work outside their domain. This spilled over to their written assignments as well. Their analyses of values implications grew more rigorous and thoughtful over the semester, as they continually situated their analysis in work practice. By teaching values implications through work practice, students were able to see first-hand the social, political, and ethical implications of their work.