A causal view on compositional data
Elisabeth Ailer, Christian L. Müller, Niki Kilbertus
Many scientific datasets are compositional in nature. Important examples
include species abundances in ecology, rock compositions in geology, topic
compositions in large-scale text corpora, and sequencing count data in
molecular biology. Here, we provide a causal view on compositional data in an
instrumental variable setting where the composition acts as the cause.
Throughout, we pay particular attention to the interpretation of compositional
causes from the viewpoint of interventions and crisply articulate potential
pitfalls for practitioners. Focusing on modern high-dimensional microbiome
sequencing data as a timely illustrative use case, our analysis first reveals
that popular one-dimensional information-theoretic summary statistics, such as
diversity and richness, may be insufficient for drawing causal conclusions from
ecological data. Instead, we advocate for multivariate alternatives using
statistical data transformations and regression techniques that take the
special structure of the compositional sample space into account. In a
comparative analysis on synthetic and semi-synthetic data we show the
advantages and limitations of our proposal. We posit that our framework may
provide a useful starting point for cause-effect estimation in the context of
compositional data.