A Bayesian Nonparametric Approach to Species Sampling Problems with Ordering
Cecilia Balocchi, Federico Camerlenghi, Stefano Favaro
Species-sampling problems (SSPs) refer to a vast class of statistical
problems that, given an observable sample from an unknown population of
individuals belonging to some species, call for estimating (discrete)
functionals of the unknown species composition of additional unobservable
samples. A common feature of SSPs is the invariance with respect to species
labelling, i.e. species' labels are immaterial in defining the functional of
interest, which is at the core of the development of the Bayesian nonparametric
(BNP) approach to SSPs under the popular Pitman-Yor process (PYP) prior. In
this paper, we consider SSPs that are not invariant to species labelling, in
the sense that an ordering or ranking is assigned to species' labels, and we
develop a BNP approach to such problems. In particular, inspired by the
population genetics literature on age-ordered alleles' compositions, with a
renowned interest in the frequency of the oldest allele, we study the following
SSP with ordering: given an observable sample from unknown population of
individuals belonging to some species (alleles), with species' labels being
ordered according to weights (ages), estimate the frequencies of the first r
order species' labels in an enlarged sample obtained by including additional
unobservable samples. Our BNP approach relies on an ordered version of the PYP
prior, which leads to an explicit posterior distribution of the first r order
frequencies, with corresponding estimates being simple and computationally
efficient. We apply our approach to the analysis of genetic variation, showing
its effectiveness in the estimation of the frequency of the oldest allele, and
then discuss other applications in the contexts of citations to academic
articles and online purchases of items.