Manipulating spectra
Frequency spectra are stored in dadi.Spectrum
objects. Computationally, these are a subclass of numpy.masked_array
, so most of the standard array manipulation techniques can be used. (In the examples here, I will typically be considering two-dimensional spectra, although all these features apply to higher-dimensional spectra as well.)
You can do arithmetic with Spectrum
objects:
fs3 = fs1 + fs2
fs2 = fs1 * 2
Note that most operations involving two Spectrum
objects only make sense if they correspond to data with the same sample sizes.
Standard indexing and slicing operations work as well. For example, to access the counts corresponding to 3 observations in population 1 and 5 observations in population 2, simpliy
counts = fs[3, 5]
More complicated slices are also possible. The slice notation :
indicates taking all corresponding entries. For example, to access the slice of the Spectrum
corresponding to entries with 2 derived allele observations in population 2, take
slice = fs[:, 2]
Summary statistics
The frequency spectrum encompasses many common summary statistics, and dadi provides methods to calculate them from Spectrum
objects.
Single-population statistics
Watterson's theta can be calculated as
thetaW = fs.Watterson_theta()
The expected heterozygosity π assuming random mating is
pi = fs.pi()
Tajima's D is
D = fs.Tajima_D()
Multi-population statistics
The number of segregating sites S is simply the sum of all entries in the FS (except for the absent-in-all and derived-in-all entries). This can be calculated as
S = fs.S()
Wright's FST can be calculated as
Fst = fs.Fst()
This estimate of Fst assumes random mating, because the FS does not store heterozygote. Calculation is by the method of Weir and Cockerham2. For a single SNP, the relevant formula is at the top of page 1363. To combine results between SNPs, we use the weighted average indicated by equation 10.
Folding
By default, dadi considers the data in the Spectrum
to be polarized, i.e. that the ancestral state of each variant is known. In some cases, however, this may not be possible, and the FS must be folded, indicating that only the minor allele frequency is known. To fold a Spectrum
object, simply
folded = fs.fold()
The Spectrum
object will record the fact that it has been folded, so that the likelihood and optimization machinery can automatically fold model spectra when the data are folded.
Masking
Finally, Spectrum
arrays are masked, i.e. certain entries can be set to be ignored. Most typically, the ignored entries are the two corners: [0, 0]
and [n1, n2]
, corresponding to variants observed in zero samples or in all samples. More sophisticated masking is possible, however. For example, if your calling algorithm is such that singletons in population 1 cannot be confidently called, you may want to ignore those entries of the FS in your analysis. To do so, simply
fs.mask[1, :] = True
Note that care must be taken when doing arithmetic with Spectrum
objects that are masked in different ways.
Marginalizing
If one has a multidimensional Spectrum
it may be useful to examine the marginalized Spectrum
corresponding to a subset of populations. To do so, use the marginalize
method. For example, consider a three-dimensional Spectrum
consisting of data from populations A, B, and C. To consider the marginal two dimensional spectrum for populations A and C, we need marginalize over population B.
fsAC = fsABC.marginalize([1])
And to consider the marginal one-dimensional FS for population B, we marginalize over populations A and C.
fsB = fsABC.marginalize([0, 2])
Note that the argument to marginalize
is a list of dimensions to marginalize over, indexed from \(\theta\).
Projection
One can also project an FS down from a larger sample size to a smaller sample size. Implicitly, this involves averaging over all possible re-samplings of the larger sample size data. This is very often done in the case of missing data: if some sites could not be called in all individuals, one can set a lower bound on the number of successfull calls necessary to include a SNP in the analysis; SNPs with more successful calls can then be projected down to that number of calls.
In dadi, this is implemented with the project
method. For example, to project a two-dimensional FS down to sample sizes of 14 and 26, use
proj = fs.project([14, 26])
Sampling
One can simulate Poisson sampling from an FS using the sample
method.
sample = fs.sample()
Each entry in the sample
output FS will have a Poisson number of counts, with mean given by the corresponding entry in fs
. If all sites are completely unlinked, this is a proper parametric bootstrap from your FS.
Scrambling
Occasionally, one may wish to ask whether the FS really represents samples from two populations or rather subsamples from a single population. A rough check of this is to consider what the FS would look like if the population identifiers were scrambled amongst the individuals for whom you have data. The scramble
method will do this.
scrambled = fs.scramble()
As an example, one could consider whether the FS for JPT and CHB shows evidence of differentiation between the two populations. Note that this is an informal test, and we have not developed the theory to assign statistical significance to the results. It is, nevertheless, a useful guide.