Frequency spectra are stored in
dadi.Spectrum objects. Computationally, these are a subclass of
numpy.masked_array, so most of the standard array manipulation techniques can be used. (In the examples here, I will typically be considering two-dimensional spectra, although all these features apply to higher-dimensional spectra as well.)
You can do arithmetic with
fs3 = fs1 + fs2 fs2 = fs1 * 2
Note that most operations involving two
Spectrum objects only make sense if they correspond to data with the same sample sizes.
Standard indexing and slicing operations work as well. For example, to access the counts corresponding to 3 observations in population 1 and 5 observations in population 2, simpliy
counts = fs[3, 5]
More complicated slices are also possible. The slice notation
: indicates taking all corresponding entries. For example, to access the slice of the
Spectrum corresponding to entries with 2 derived allele observations in population 2, take
slice = fs[:, 2]
The frequency spectrum encompasses many common summary statistics, and dadi provides methods to calculate them from
Watterson's theta can be calculated as
thetaW = fs.Watterson_theta()
The expected heterozygosity π assuming random mating is
pi = fs.pi()
Tajima's D is
D = fs.Tajima_D()
The number of segregating sites S is simply the sum of all entries in the FS (except for the absent-in-all and derived-in-all entries). This can be calculated as
S = fs.S()
Wright's FST can be calculated as
Fst = fs.Fst()
This estimate of Fst assumes random mating, because the FS does not store heterozygote. Calculation is by the method of Weir and Cockerham2. For a single SNP, the relevant formula is at the top of page 1363. To combine results between SNPs, we use the weighted average indicated by equation 10.
By default, dadi considers the data in the
Spectrum to be polarized, i.e. that the ancestral state of each variant is known. In some cases, however, this may not be possible, and the FS must be folded, indicating that only the minor allele frequency is known. To fold a
Spectrum object, simply
folded = fs.fold()
Spectrum object will record the fact that it has been folded, so that the likelihood and optimization machinery can automatically fold model spectra when the data are folded.
Spectrum arrays are masked, i.e. certain entries can be set to be ignored. Most typically, the ignored entries are the two corners:
[0, 0] and
[n1, n2], corresponding to variants observed in zero samples or in all samples. More sophisticated masking is possible, however. For example, if your calling algorithm is such that singletons in population 1 cannot be confidently called, you may want to ignore those entries of the FS in your analysis. To do so, simply
fs.mask[1, :] = True
Note that care must be taken when doing arithmetic with
Spectrum objects that are masked in different ways.
If one has a multidimensional
Spectrum it may be useful to examine the marginalized
Spectrum corresponding to a subset of populations. To do so, use the
marginalize method. For example, consider a three-dimensional
Spectrum consisting of data from populations A, B, and C. To consider the marginal two dimensional spectrum for populations A and C, we need marginalize over population B.
fsAC = fsABC.marginalize()
And to consider the marginal one-dimensional FS for population B, we marginalize over populations A and C.
fsB = fsABC.marginalize([0, 2])
Note that the argument to
marginalize is a list of dimensions to marginalize over, indexed from \(\theta\).
One can also project an FS down from a larger sample size to a smaller sample size. Implicitly, this involves averaging over all possible re-samplings of the larger sample size data. This is very often done in the case of missing data: if some sites could not be called in all individuals, one can set a lower bound on the number of successfull calls necessary to include a SNP in the analysis; SNPs with more successful calls can then be projected down to that number of calls.
In dadi, this is implemented with the
project method. For example, to project a two-dimensional FS down to sample sizes of 14 and 26, use
proj = fs.project([14, 26])
One can simulate Poisson sampling from an FS using the
sample = fs.sample()
Each entry in the
sample output FS will have a Poisson number of counts, with mean given by the corresponding entry in
fs. If all sites are completely unlinked, this is a proper parametric bootstrap from your FS.
Occasionally, one may wish to ask whether the FS really represents samples from two populations or rather subsamples from a single population. A rough check of this is to consider what the FS would look like if the population identifiers were scrambled amongst the individuals for whom you have data. The
scramble method will do this.
scrambled = fs.scramble()
As an example, one could consider whether the FS for JPT and CHB shows evidence of differentiation between the two populations. Note that this is an informal test, and we have not developed the theory to assign statistical significance to the results. It is, nevertheless, a useful guide.