Importing data
dadi represents frequency spectra using dadi.Spectrum
objects. As desccribed in the Manipulating spectra section, Spectrum
objects are subclassed from numpy.masked_array
and thus can be constructed similarly. The most basic way to create a Spectrum
is manually:
fs = dadi.Spectrum([0, 100, 20, 10, 1, 0])
This creates a Spectrum
object representing the FS from a single population, from which we have 5 samples (The first and last entries in fs
correspond to mutations observed in zero or all samples. These are thus not polymorphisms, and by default dadi masks out those entries so they are ignored.)
For nontrivial data sets, entering the FS manually is infeasible, so we will focus on automatic methods of generating a Spectrum
object. The most direct way is to load a pre-generated FS from a file, using
fs = dadi.Spectrum.from_file(filename)
The appropriate file format is detailed in the next section. We have also added fuction to generate the FS from a VCF file.
Frequency spectrum file format
dadi uses a simple fie format for storing the FS. Each file begins with any number of comment lines beginning with #
. the first non-comment line contain P integers giving the dimensions of the FS array, where P is the number of populations represented. For a FS representing data from 4×4×2 samples, this would be 5 5 3
. (Each dimension is one larger than the number of samples, because the number of observations can range, for example, from 0 to 4 if there are 4 samples, for a total of 5 possibilities.) On the same line, the string folded
or unfolded
denoting whether or not the stored FS is folded.
The actual data is stored in a single line listing all the FS elements separated by spaces, in the order fs[0,0,0] fs[0,0,1] fs[0,0,2] ... fs[0,1,0] fs[0,1,1] ...
This is followed by a single line giving the elements of the mask in the same order as the data, with 1
indicating masked and 0
indicating unmasked.
The file corresponding to the Spectrum fs
can be written using the ccommand fs.to_file(filename)
Frequency spectra from VCF files
Data can be loaded from VCF files. The main function for accomplishing this is make_data_dict_vcf
in the dadi.Misc
submodule. The function has two required arguments:
- The name of the VCF (can be gzipped [*.vcf.ga])
- The name of a file describing how individuals map to populations.
This second file is a plain-text, two-column file containing the individual names in column one and their respective populations in column two:
i0 pop0
i1 pop0
i2 pop0
...
iN pop2
Examples of these files can be found in the examples/fs_from_data/
folder in the dadi source distribution. Generating a frequency spectrum with these files can then be achieved through the creation of a data dictionary with the following two lines of code:
dd = dadi.Misc.make_data_dict_vcf("example.vcf.gz", "popfile.txt")
fs = dadi.Spectrum.from_data_dict(dd, ['pop0', 'pop1'], projections = [20, 30], polarized = False)
Projection is often used to deal with missing data. But projection cannot be used when modeling inbreeding. It also turns the dadi's likelihood function into a composite likelihood, even for unlinked data. We have thus included an option to take a smaller subsample of individuals from a population at each site, so that variants with less missing data than the specified subsampling size are not completely ignored. To specify how many individuals should be subsampled, the function takes an additional dictionary as an argument, where the dictionary simply maps the population names to the desired number of individuals to subsample.
# create the subsample dictionary
ss = {'pop0':5, 'pop1':10}
# pass it as an additional argument
dd = dadi.Misc.make_data_dict_vcf("example.vcf.gz", "popfile.txt", subsample = ss)
fs = dadi.Spectrum.from_data_dict(dd, ['pop0', 'pop1'], projections = [10, 20], polarized = False)
Subsampling offers an alternative to down projecting your data that preserves individual genotypes. Projecting will consider sampled chromosomes/alleles as exchangeable across all individuals, which is usually OK for a randomly mating population. However, if you want to include inbreeding in your model (see the Inbreeding section), then projecting will erase the signal of excess homozygosity that inbreeding creates by sampling chromosomes instead of individuals. When generating a frequency spectrum from a subsampled data dictionary, be sure to set the projections argument to 2 times the subsample size you specified for each population so that no down projecting is done.
SNP data methods
From a data dictionary, the method Spectrum.from_data_dict
can be used to create a Spectrum
.
fs = Spectrum.from_data_dict(dd, pop_id = ['YRI', 'CEU'], projections = [10, 12], polarized = True)
The pop_ids
argument specifies which populations to use to create the FS, and their order. projections
denotes the population sample sizes for resulting FS. (Recall that for a diploid organism, assuming random mating, we get two samples from each individual.) Note that the total number of calls to Allele1
and Allele2
in a given population need not be the same for each SNP. When constructing the Spectrum, each SNP will be projected down to the requested number of samples in each population. (Note that SNPs cannot be projected up, so SNPs without enough calls in any population will be ignored.) polarized
specifies whether dadi should use outgroup information to polarize the SNPs. If polarized = True
, SNPs withouth outgroup information, or with that information —— will be ingored. If polarized = False
, outgroup information will be ignored and the resulting Spectrum
will be folded.
If your data have missing calls for some individuals, projecting down to a smaller sample size will increase the number of SNPs you can use for analysis. On the other hand, some fraction of the SNPs will now project down to frequency 0, and thus be uninformative. As a rule of thumb, we often choose our projection to maximize the number of segregating sites in our final FS (assessed via fs.S()
), although we have not formally tested whether this maximizes statistical power.
SNP data format
As a convenience, dadi includes several methods for generating frequency spectra directly from SNP data. That relevant SNP file format is described here. An large example can be found in the examples/fs_from_data/data.txt
file included with the dadi source distribution, and a small example is shown in Listing 1.
Human Chimp Allele1 YRI CEU Allele2 YRI CEU Gene Position
ACG ATG C 29 24 T 1 0 abcb1 289
CCT CCT C 29 23 G 3 2 abcb1 345
Listing 1: Example of SNP file format
The data file begins with any number of comment lines that being with #
. The first parsed line is a column header line. Whitespace is used to separate entries within the table, so no spaces are allowed within any entry. Individual rows make be commented out using #
.
The first column contains the in-group reference sequence at that SNP, including the flanking bases. If the flanking bases are unknown, they can be denoted by -
. The header label is arbitrary.
The second column contains the aligned outgroup reference sequency at that SNP, including the flanking bases. Unknown entries can be denoted by `-. The header label is arbitrary.
The third column gives the first segregating allele. The column header must be exactly Allele1
.
Then follows an arbitrary number of columns, one for each population, each giving the number of times Allele1
was observed in that population. The header for each column should be the population identifier.
The next column gives the second segregating allele. The column header must be exactly Allele2
.
Then follows one column for each population, each giving the number of times Allele2
was observed in that population. The header form each column should be the population identifier, and the columns should be in the same order as for the Allele1
entries.
Then follows an arbitrary number of columns which will be concatenated with _
to assign a label for each SNP.
The Allele1
and Allele2
headers must be exactly those values because the number of columns between those two is used to infer the number of population in the file.
The method Misc.make_data_dict
reads the above SNP file format to generate a Python data dictionary describing the data:
dd = Misc.make_data_dict(filename)