The community lacks theory-informed guidelines for building good data sets.
We analyse theoretical directions relating to what aspects of the data matter
and conclude that the intuitions derived from the existing literature are
incorrect and misleading. Using empirical counter-examples, we show that 1)
data dimension should not necessarily be minimised and 2) when manipulating
data, preserving the distribution is inessential. This calls for a more
data-aware theoretical understanding. Although not explored in this work, we
propose the study of the impact of data modification on learned representations
as a promising research direction.