The Problem of Zombie Datasets:A Framework For Deprecating Datasets
What happens when a machine learning dataset is deprecated for legal,
ethical, or technical reasons, but continues to be widely used? In this paper,
we examine the public afterlives of several prominent deprecated or redacted
datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC,
Brainwash, and HRT Transgender, in order to inform a framework for more
consistent, ethical, and accountable dataset deprecation. Building on prior
research, we find that there is a lack of consistency, transparency, and
centralized sourcing of information on the deprecation of datasets, and as
such, these datasets and their derivatives continue to be cited in papers and
circulate online. These datasets that never die -- which we term "zombie
datasets" -- continue to inform the design of production-level systems, causing
technical, legal, and ethical challenges; in so doing, they risk perpetuating
the harms that prompted their supposed withdrawal, including concerns around
bias, discrimination, and privacy. Based on this analysis, we propose a Dataset
Deprecation Framework that includes considerations of risk, mitigation of
impact, appeal mechanisms, timeline, post-deprecation protocol, and publication
checks that can be adapted and implemented by the machine learning community.
Drawing on work on datasheets and checklists, we further offer two sample
dataset deprecation sheets and propose a centralized repository that tracks
which datasets have been deprecated and could be incorporated into the
publication protocols of venues like NeurIPS.