Skip to content

Spectrum_mod

Contains Spectrum object, which represents frequency spectra.

Spectrum

Bases: masked_array

Represents a frequency spectrum.

Spectra are represented by masked arrays. The masking allows us to ignore specific entries in the spectrum. Most often, these are the absent and fixed categories.

The constructor has the format

fs = dadi.Spectrum(data, mask, mask_corners, data_folded, check_folding, pop_ids, extrap_x) data: The frequency spectrum data mask: An optional array of the same size as data. 'True' entires in this array are masked in the Spectrum. These represent missing data categories. (For example, you may not trust your singleton SNP calling.) mask_corners: If True (default), the 'observed in none' and 'observed in all' entries of the FS will be masked. Typically these entries are unobservable, and dadi cannot reliably calculate them, so you will almost always want mask_corners=True. data_folded: If True, it is assumed that the input data is folded. An error will be raised if the input data and mask are not consistent with a folded Spectrum. check_folding: If True and data_folded=True, the data and mask will be checked to ensure they are consistent with a folded Spectrum. If they are not, a warning will be printed. pop_ids: Optional list of strings containing the population labels. extrap_x: Optional floating point value specifying x value to use for extrapolation.

Fst()

Wright's Fst between the populations represented in the fs.

This estimate of Fst assumes random mating, because we don't have heterozygote frequencies in the fs.

Calculation is by the method of Weir and Cockerham Evolution 38:1358 (1984). For a single SNP, the relevant formula is at the top of page 1363. To combine results between SNPs, we use the weighted average indicated by equation 10.

Source code in dadi/Spectrum_mod.py
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
def Fst(self):
    """
    Wright's Fst between the populations represented in the fs.

    This estimate of Fst assumes random mating, because we don't have
    heterozygote frequencies in the fs.

    Calculation is by the method of Weir and Cockerham _Evolution_ 38:1358
    (1984).  For a single SNP, the relevant formula is at the top of page
    1363. To combine results between SNPs, we use the weighted average
    indicated by equation 10.
    """
    # This gets a little obscure because we want to be able to work with
    # spectra of arbitrary dimension.

    # First quantities from page 1360
    r = self.Npop
    ns = self.sample_sizes
    nbar = numpy.mean(ns)
    nsum = numpy.sum(ns)
    nc = (nsum - numpy.sum(ns**2)/nsum)/(r-1)

    # counts_per_pop is an r+1 dimensional array, where the last axis simply
    # records the indices of the entry. 
    # For example, counts_per_pop[4,19,8] = [4,19,8]
    counts_per_pop = numpy.indices(self.shape)
    counts_per_pop = numpy.transpose(counts_per_pop, axes=list(range(1,r+1))+[0])

    # The last axis of ptwiddle is now the relative frequency of SNPs in
    # that bin in each of the populations.
    ptwiddle = 1.*counts_per_pop/ns

    # Note that pbar is of the same shape as fs...
    pbar = numpy.sum(ns*ptwiddle, axis=-1)/nsum

    # We need to use 'this_slice' to get the proper aligment between
    # ptwiddle and pbar.
    this_slice = [slice(None)]*r + [numpy.newaxis]
    s2 = numpy.sum(ns * (ptwiddle - pbar[tuple(this_slice)])**2, axis=-1)/((r-1)*nbar)

    # Note that this 'a' differs from equation 2, because we've used
    # equation 3 and b = 0 to solve for hbar.
    a = nbar/nc * (s2 - 1/(2*nbar-1) * (pbar*(1-pbar) - (r-1)/r*s2))
    d = 2*nbar/(2*nbar-1) * (pbar*(1-pbar) - (r-1)/r*s2)

    # The weighted sum over loci.
    asum = (self * a).sum()
    dsum = (self * d).sum()

    return asum/(asum+dsum)

S()

Segregating sites.

Source code in dadi/Spectrum_mod.py
1150
1151
1152
1153
1154
1155
1156
1157
1158
def S(self):
    """
    Segregating sites.
    """
    oldmask = self.mask.copy()
    self.mask_corners()
    S = self.sum()
    self.mask = oldmask
    return S

Tajima_D()

Tajima's D.

Following Gillespie "Population Genetics: A Concise Guide" pg. 45

Source code in dadi/Spectrum_mod.py
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
def Tajima_D(self):
    """
    Tajima's D.

    Following Gillespie "Population Genetics: A Concise Guide" pg. 45
    """
    if not self.Npop == 1:
        raise ValueError("Only defined on a one-dimensional SFS.")

    S = self.S()

    n = 1.*self.sample_sizes[0]
    pihat = self.pi()
    theta = self.Watterson_theta()

    a1 = numpy.sum(1./numpy.arange(1,n))
    a2 = numpy.sum(1./numpy.arange(1,n)**2)
    b1 = (n+1)/(3*(n-1))
    b2 = 2*(n**2 + n + 3)/(9*n * (n-1))
    c1 = b1 - 1./a1
    c2 = b2 - (n+2)/(a1*n) + a2/a1**2

    C = numpy.sqrt((c1/a1)*S + c2/(a1**2 + a2) * S*(S-1))

    return (pihat - theta)/C

Watterson_theta()

Watterson's estimator of theta.

Note that is only sensible for 1-dimensional spectra.

Source code in dadi/Spectrum_mod.py
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
def Watterson_theta(self):
    """
    Watterson's estimator of theta.

    Note that is only sensible for 1-dimensional spectra.
    """
    if self.Npop != 1:
        raise ValueError("Only defined on a one-dimensional fs.")

    n = self.sample_sizes[0]
    S = self.S()
    an = numpy.sum(1./numpy.arange(1,n))

    return S/an

Zengs_E()

Zeng et al.'s E statistic.

Citation

Zeng et al. "Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants" (2006) Genetics

Source code in dadi/Spectrum_mod.py
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
def Zengs_E(self):
    """
    Zeng et al.'s E statistic.

    Citation:
        Zeng et al. "Statistical Tests for Detecting Positive Selection by
        Utilizing High-Frequency Variants" (2006) Genetics
    """
    num = self.theta_L() - self.Watterson_theta()

    n = self.sample_sizes[0]

    # See after Eq. 3
    an = numpy.sum(1./numpy.arange(1,n))
    # See after Eq. 9
    bn = numpy.sum(1./numpy.arange(1,n)**2)
    s = self.S()

    # See immediately after Eq. 12
    theta = self.Watterson_theta()
    theta_sq = s*(s-1.)/(an**2 + bn)

    # Eq. 14
    var = (n/(2.*(n-1.)) - 1./an) * theta\
            + (bn/an**2 + 2.*(n/(n-1.))**2 * bn - 2*(n*bn-n+1.)/((n-1.)*an)
               - (3.*n+1.)/(n-1.)) * theta_sq

    return num/numpy.sqrt(var)

combine_pops(tocombine)

Combine two or more populations in the fs, treating them as a single pop

Parameters:

Name Type Description Default
tocombine list

Unordered set of population numbers to combine, numbering from 1.

required
Note
  • The populations will alwasy be combined into the slot of the population with the smallest index. For example, if the sample sizes of the spectrum are (1,2,3,4,5) and tocombine=[4,2,1], then the output spectrum will have sample_sizes (7,3,5) when populations 1, 2, and 4 are combined.

  • The pop_ids of the new population will be the pop_ids of the combined populations with a '+' in between them.

Source code in dadi/Spectrum_mod.py
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
def combine_pops(self, tocombine):
    """
    Combine two or more populations in the fs, treating them as a single pop

    Args:
        tocombine (list): Unordered set of population numbers to combine,
            numbering from 1.

    Note:
        - The populations will alwasy be combined into the slot of the 
        population with the smallest index. For example, if the sample sizes of 
        the spectrum are (1,2,3,4,5) and tocombine=[4,2,1], then the output spectrum 
        will have sample_sizes (7,3,5) when populations 1, 2, and 4 are combined.

        - The pop_ids of the new population will be the pop_ids of the combined 
        populations with a '+' in between them.
    """
    tocombine = sorted(tocombine)
    result = self
    # Need to combine from highest to lowest index
    for right_pop in tocombine[1:][::-1]: 
        result = result.combine_two_pops([tocombine[0], right_pop])
    # Need to fix pop_id of combined pop
    if self.pop_ids:
        result.pop_ids[tocombine[0]-1] = '+'.join(self.pop_ids[_-1] for _ in tocombine)
    return result

combine_two_pops(tocombine)

Combine two populations in the fs, treating them as a single pop

Parameters:

Name Type Description Default
tocombine list

Indices for populations being combined (starting from 1)

required
Notes
  • The two populations will alwasy be combined into the slot of the population with the smallest index. For example, if the sample sizes of the spectrum are (2,3,4,5) and tocombine=[4,2], then the output spectrum will have sample_sizes (2,8,4) when populations 2 and 4 are combined.

  • The pop_ids of the new population will be the pop_ids of the two combined populations with a '+' in between them.

Source code in dadi/Spectrum_mod.py
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
def combine_two_pops(self, tocombine):
    """
    Combine two populations in the fs, treating them as a single pop

    Args:
        tocombine (list): Indices for populations being combined (starting from 1)

    Notes:
        - The two populations will alwasy be combined into the slot of the 
        population with the smallest index. For example, if the sample sizes of
        the spectrum are (2,3,4,5) and tocombine=[4,2], then the output spectrum
        will have sample_sizes (2,8,4) when populations 2 and 4 are combined.

        - The pop_ids of the new population will be the pop_ids of the two combined
        populations with a '+' in between them.
    """
    # Calculate new sample sizes
    tocombine = sorted([_-1 for _ in tocombine]) # Account for indexing from 1
    new_ns = list(self.sample_sizes)
    new_ns[tocombine[0]] = self.sample_sizes[tocombine[0]] + self.sample_sizes[tocombine[1]]
    del new_ns[tocombine[1]] # Remove pop being combined away

    # Create new pop ids
    new_pop_ids = None
    if self.pop_ids:
        new_pop_ids = list(self.pop_ids)
        new_pop_ids[tocombine[0]] = '{0}+{1}'.format(self.pop_ids[tocombine[0]], self.pop_ids[tocombine[1]])
        del new_pop_ids[tocombine[1]]

    # Create new spectrum
    new_data = np.zeros(shape=[n+1 for n in new_ns])
    new_fs = Spectrum(new_data, pop_ids=new_pop_ids)
    # Copy over extrapolation info
    new_fs.extrap_x = self.extrap_x

    # Fill new spectrum
    for index in np.ndindex(self.shape):
        new_index = list(index)
        new_index[tocombine[0]] = index[tocombine[0]] + index[tocombine[1]]
        del new_index[tocombine[1]]
        new_index = tuple(new_index)
        new_fs[new_index] += self[index]
        # Mask entry if any of the contributing entries are masked
        new_fs.mask[new_index] = (new_fs.mask[new_index] or self.mask[index])

    return new_fs

filter_pops(tokeep, mask_corners=True)

Filter Spectrum to keep only certain populations.

Returns new Spectrum with len(tokeep) populations.

Parameters:

Name Type Description Default
tokeep int

Unordered set of population numbers to keep, numbering from 1.

required
mask_corners bool

If True, the typical corners of the resulting fs will be masked

True
Note

This is similar in practice to the marginalize operation. But here populations are numbered from 1, as in the majority of dadi.

Source code in dadi/Spectrum_mod.py
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
def filter_pops(self, tokeep, mask_corners=True):
    """
    Filter Spectrum to keep only certain populations.

    Returns new Spectrum with len(tokeep) populations.

    Args:
        tokeep (int): Unordered set of population numbers to keep, numbering from 1.
        mask_corners (bool, optional): If True, the typical corners of the resulting fs will be masked

    Note: 
        This is similar in practice to the marginalize operation. But here
        populations are numbered from 1, as in the majority of dadi.
    """
    toremove = list(range(0, self.ndim))
    for pop_ii in tokeep:
        # Apply -1 factor to account for indexing in marginalize
        toremove.remove(pop_ii-1)
    return self.marginalize(toremove)

fixed_size_sample(nsamples, only_nonmasked=False)

Generate a resampled fs from the current one.

Parameters:

Name Type Description Default
nsamples int

Number of samples to include in the new FS.

required
only_nonmasked bool

If True, only SNPs from non-masked will be resampled. Otherwise, all SNPs will be used.

False
Source code in dadi/Spectrum_mod.py
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
def fixed_size_sample(self, nsamples, only_nonmasked=False):
    """
    Generate a resampled fs from the current one.

    Args:
        nsamples (int): Number of samples to include in the new FS.
        only_nonmasked (bool, optional): If True, only SNPs from non-masked will be resampled. 
                        Otherwise, all SNPs will be used.
    """
    flat = self.flatten()
    if only_nonmasked:
        pvals = flat.data/flat.sum()
        pvals[flat.mask] = 0
    else:
        pvals = flat.data/flat.data.sum()

    sample = numpy.random.multinomial(int(nsamples), pvals)
    sample = sample.reshape(self.shape)

    return dadi.Spectrum(sample, mask=self.mask, pop_ids=self.pop_ids)

fold()

Folded frequency spectrum

The folded fs assumes that information on which allele is ancestral or derived is unavailable. Thus the fs is in terms of minor allele frequency. Note that this makes the fs into a "triangular" array.

Note that if a masked cell is folded into non-masked cell, the destination cell is masked as well.

Note also that folding is not done in-place. The return value is a new Spectrum object.

Source code in dadi/Spectrum_mod.py
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
def fold(self):
    """
    Folded frequency spectrum

    The folded fs assumes that information on which allele is ancestral or
    derived is unavailable. Thus the fs is in terms of minor allele 
    frequency.  Note that this makes the fs into a "triangular" array.

    Note that if a masked cell is folded into non-masked cell, the
    destination cell is masked as well.

    Note also that folding is not done in-place. The return value is a new
    Spectrum object.
    """
    if self.folded:
        raise ValueError('Input Spectrum is already folded.')

    # How many samples total do we have? The folded fs can only contain
    # entries up to total_samples/2 (rounded down).
    total_samples = numpy.sum(self.sample_sizes)

    total_per_entry = self._total_per_entry()

    # Here's where we calculate which entries are nonsense in the folded fs.
    where_folded_out = total_per_entry > int(total_samples/2)

    original_mask = self.mask
    # Here we create a mask that masks any values that were masked in
    # the original fs (or folded onto by a masked value).
    final_mask = numpy.logical_or(original_mask, 
                                  reverse_array(original_mask))

    # To do the actual folding, we take those entries that would be folded
    # out, reverse the array along all axes, and add them back to the
    # original fs.
    reversed = reverse_array(numpy.where(where_folded_out, self, 0))
    folded = numpy.ma.masked_array(self.data + reversed)
    folded.data[where_folded_out] = 0

    # Deal with those entries where assignment of the minor allele is
    # ambiguous.
    where_ambiguous = (total_per_entry == total_samples/2.)
    ambiguous = numpy.where(where_ambiguous, self, 0)
    folded += -0.5*ambiguous + 0.5*reverse_array(ambiguous)

    # Mask out the remains of the folding operation.
    final_mask = numpy.logical_or(final_mask, where_folded_out)

    outfs = Spectrum(folded, mask=final_mask, data_folded=True,
                     pop_ids=self.pop_ids)
    outfs.extrap_x = self.extrap_x
    return outfs

from_data_dict(data_dict, pop_ids, projections, mask_corners=True, polarized=True) staticmethod

Spectrum from a dictionary of polymorphisms.

Parameters:

Name Type Description Default
pop_ids list[str]

list of which populations to make fs for.

required
projections list[int]

list of sample sizes to project down to for each population.

required
mask_corners bool

If True (default), the 'observed in none' and 'observed in all' entries of the FS will be masked.

True
polarized bool

If True, the data are assumed to be correctly polarized by `outgroup_allele'. SNPs in which the 'outgroup_allele' information is missing or '-' or not concordant with the segregating alleles will be ignored. If False, any 'outgroup_allele' info present is ignored, and the returned spectrum is folded.

True

The data dictionary should be organized as:

{snp_id:{'segregating': ['A','T'],
         'calls': {'YRI': (23,3),
                    'CEU': (7,3)
                    },
         'outgroup_allele': 'T'
        }
}

The 'calls' entry gives the successful calls in each population, in the order that the alleles are specified in 'segregating'. Non-diallelic polymorphisms are skipped.

Source code in dadi/Spectrum_mod.py
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
@staticmethod
def from_data_dict(data_dict, pop_ids, projections, mask_corners=True,
                   polarized=True):
    """
    Spectrum from a dictionary of polymorphisms.

    Args:
        pop_ids (list[str]): list of which populations to make fs for.
        projections (list[int]): list of sample sizes to project down to for each
                    population.
        mask_corners (bool, optional): If True (default), the 'observed in none' and 'observed 
                    in all' entries of the FS will be masked.
        polarized (bool, optional): If True, the data are assumed to be correctly polarized by 
                `outgroup_allele'. SNPs in which the 'outgroup_allele'
                information is missing or '-' or not concordant with the
                segregating alleles will be ignored.
                If False, any 'outgroup_allele' info present is ignored,
                and the returned spectrum is folded.

    The data dictionary should be organized as:

        {snp_id:{'segregating': ['A','T'],
                 'calls': {'YRI': (23,3),
                            'CEU': (7,3)
                            },
                 'outgroup_allele': 'T'
                }
        }
    The 'calls' entry gives the successful calls in each population, in the
    order that the alleles are specified in 'segregating'.
    Non-diallelic polymorphisms are skipped.
    """
    import dadi.Misc
    cd = dadi.Misc.count_data_dict(data_dict, pop_ids)
    fs = Spectrum._from_count_dict(cd, projections, polarized, pop_ids, 
            mask_corners=mask_corners)
    return fs

from_data_dict_corrected(data_dict, pop_ids, projections, fux_filename, force_pos=True, mask_corners=True) staticmethod

Spectrum from a dictionary of polymorphisms, corrected for ancestral misidentification.

The correction is based upon

Hernandez, Williamson & Bustamante Mol_Biol_Evol 24:1792 (2007)

Parameters:

Name Type Description Default
force_pos bool

If the correction is too agressive, it may leave some small entries in the fs less than zero. If force_pos is true, these entries will be set to zero, in such a way that the total number of segregating SNPs is conserved.

True
fux_filename str

The name of the file containing the misidentification probabilities.

required

The file (fux_filename) is of the form:

# Any number of comments lines beginning with #
AAA T 0.001
AAA G 0.02
...

Where every combination of three + one bases is considered (order is not important). The triplet is the context and putatively derived allele (x) in the reference species. The single base is the base (u) in the outgroup. The numerical value is 1-f_{ux} in the notation of the paper.

The data dictionary should be organized as:

{snp_id:{'segregating': ['A','T'],
         'calls': {'YRI': (23,3),
                    'CEU': (7,3)
                    },
         'outgroup_allele': 'T',
         'context': 'CAT',
         'outgroup_context': 'CAT'
        }
}

The additional entries are 'context', which includes the two flanking bases in the species of interest, and 'outgroup_context', which includes the aligned bases in the outgroup.

This method skips entries for which the correction cannot be applied. Most commonly this is because of missing or non-constant context.

Source code in dadi/Spectrum_mod.py
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
@staticmethod
def from_data_dict_corrected(data_dict, pop_ids, projections,
                             fux_filename, force_pos=True,
                             mask_corners=True):
    """
    Spectrum from a dictionary of polymorphisms, corrected for ancestral
    misidentification.

    The correction is based upon:
        Hernandez, Williamson & Bustamante _Mol_Biol_Evol_ 24:1792 (2007)

    Args:
        force_pos (bool): If the correction is too agressive, it may leave some small
                entries in the fs less than zero. If force_pos is true,
                these entries will be set to zero, in such a way that the
                total number of segregating SNPs is conserved.
        fux_filename (str): The name of the file containing the 
                misidentification probabilities.

    The file (fux_filename) is of the form:

        # Any number of comments lines beginning with #
        AAA T 0.001
        AAA G 0.02
        ...

    Where every combination of three + one bases is considered
    (order is not important).  The triplet is the context and
    putatively derived allele (x) in the reference species. The
    single base is the base (u) in the outgroup. The numerical
    value is 1-f_{ux} in the notation of the paper.

    The data dictionary should be organized as:

        {snp_id:{'segregating': ['A','T'],
                 'calls': {'YRI': (23,3),
                            'CEU': (7,3)
                            },
                 'outgroup_allele': 'T',
                 'context': 'CAT',
                 'outgroup_context': 'CAT'
                }
        }

    The additional entries are 'context', which includes the two flanking
    bases in the species of interest, and 'outgroup_context', which
    includes the aligned bases in the outgroup.

    This method skips entries for which the correction cannot be applied.
    Most commonly this is because of missing or non-constant context.
    """
    # Read the fux file into a dictionary.
    fux_dict = {}
    f = open(fux_filename)
    for line in f.readlines():
        if line.startswith('#'):
            continue
        sp = line.split()
        fux_dict[(sp[0], sp[1])] = 1-float(sp[2])
    f.close()

    # Divide the data into classes based on ('context', 'outgroup_allele')
    by_context = Spectrum._data_by_tri(data_dict)

    fs = numpy.zeros(numpy.asarray(projections)+1)
    while by_context:
        # Each time through this loop, we eliminate two entries from the 
        # data dictionary. These correspond to one class and its
        # corresponding misidentified class.
        (derived_tri, out_base), nomis_data = by_context.popitem()

        # The corresponding bases if the ancestral state had been
        # misidentifed.
        mis_out_base = derived_tri[1]
        mis_derived_tri = derived_tri[0] + out_base + derived_tri[2]
        # Get the data for that case. Note that we default to an empty
        # dictionary if we don't have data for that class.
        mis_data = by_context.pop((mis_derived_tri, mis_out_base), {})

        fux = fux_dict[(derived_tri, out_base)]
        fxu = fux_dict[(mis_derived_tri, mis_out_base)]

        # Get the spectra for these two cases
        Nux = Spectrum.from_data_dict(nomis_data, pop_ids, projections)
        Nxu = Spectrum.from_data_dict(mis_data, pop_ids, projections)

        # Equations 5 & 6 from the paper.
        Nxu_rev = reverse_array(Nxu)
        Rux = (fxu*Nux - (1-fxu)*Nxu_rev)/(fux+fxu-1)
        Rxu = reverse_array((fux*Nxu_rev - (1-fux)*Nux)/(fux+fxu-1))

        fs += Rux + Rxu

    # Here we take the negative entries, and flip them back, so they end up
    # zero and the total number of SNPs is conserved.
    if force_pos:
        negative_entries = numpy.minimum(0, fs)
        fs -= negative_entries
        fs += reverse_array(negative_entries)

    return Spectrum(fs, mask_corners=mask_corners, pop_ids=pop_ids)

from_demes(g, sampled_demes, sample_sizes, pts, log_extrap=False, sample_times=None, Ne=None, ancestral_misid=False) staticmethod

Takes a deme graph and computes the SFS. demes is a package for specifying demographic models in a user-friendly, human-readable YAML format. This function automatically parses the demographic description and returns a SFS for the specified populations and sample sizes.

This function is new in version 1.1.0. Future developments will allow for inference using demes-based demographic descriptions.

Parameters:

Name Type Description Default
g str or DemeGraph

A demes DemeGraph from which to compute the SFS. The DemeGraph can either be specified as a YAML file, in which case g is a string, or as a DemeGraph object.

required
sampled_demes list[str]

A list of deme IDs to take samples from. We can repeat demes, as long as the sampling of repeated deme IDs occurs at distinct times.

required
sample_sizes list[int]

A list of the same length as sampled_demes, giving the sample sizes for each sampled deme.

required
sample_times list[floats]

If None, assumes all sampling occurs at the end of the existence of the sampled deme. If there are ancient samples, sample_times must be a list of same length as sampled_demes, giving the sampling times for each sampled deme. Sampling times are given in time units of the original deme graph, so might not necessarily be generations (e.g. if g.time_units is years)

None
Ne float

reference population size. If none is given, we use the initial size of the root deme.

None

Returns:

Name Type Description
fs Spectrum

A dadi site frequency spectrum, with dimension equal to the length of sampled_demes, and shape equal to sample_sizes plus one in each dimension, indexing the allele frequency in each deme from 0 to n[i], where i is the deme index.

Source code in dadi/Spectrum_mod.py
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
@staticmethod
def from_demes(
    g, sampled_demes, sample_sizes, pts, log_extrap=False, sample_times=None, Ne=None, ancestral_misid=False,):
    """
    Takes a deme graph and computes the SFS. ``demes`` is a package for
    specifying demographic models in a user-friendly, human-readable YAML
    format. This function automatically parses the demographic description
    and returns a SFS for the specified populations and sample sizes.

    This function is new in version 1.1.0. Future developments will allow for
    inference using ``demes``-based demographic descriptions.

    Args:
        g (str or demes.DemeGraph): A ``demes`` DemeGraph from which to compute the SFS. The DemeGraph
            can either be specified as a YAML file, in which case `g` is a string,
            or as a ``DemeGraph`` object.
        sampled_demes (list[str]): A list of deme IDs to take samples from. We can repeat
            demes, as long as the sampling of repeated deme IDs occurs at distinct
            times.
        sample_sizes (list[int]): A list of the same length as ``sampled_demes``,
            giving the sample sizes for each sampled deme.
        sample_times (list[floats], optional): If None, assumes all sampling occurs at the end of the
            existence of the sampled deme. If there are
            ancient samples, ``sample_times`` must be a list of same length as
            ``sampled_demes``, giving the sampling times for each sampled
            deme. Sampling times are given in time units of the original deme graph,
            so might not necessarily be generations (e.g. if ``g.time_units`` is years)
        Ne (float, optional): reference population size. If none is given, we use the initial
            size of the root deme.

    Returns:
        fs (Spectrum): A ``dadi`` site frequency spectrum, with dimension equal to the
            length of ``sampled_demes``, and shape equal to ``sample_sizes`` plus one
            in each dimension, indexing the allele frequency in each deme from 0
            to n[i], where i is the deme index.
    """
    global _imported_demes
    if not _imported_demes:
        try:
            global demes
            global Demes
            import demes
            import dadi.Demes as Demes

            _imported_demes = True
        except ImportError:
            raise ImportError("demes is not installed, need to `pip install demes`")

    if isinstance(g, str):
        dg = demes.load(g)
    else:
        dg = g
    if ancestral_misid==False:
        func_ex = dadi.Numerics.make_extrap_func(Demes.SFS)
    if ancestral_misid==True:
        misid_func = dadi.Numerics.make_anc_state_misid_func(Demes.SFS)
        func_ex = dadi.Numerics.make_extrap_func(misid_func)


    fs = func_ex(
        dg,
        sampled_demes,
        sample_sizes,
        pts=pts,
        sample_times=sample_times,
        Ne=Ne,
    )
    return fs

from_file(fname, mask_corners=True, return_comments=False) staticmethod

Read frequency spectrum from file.

Parameters:

Name Type Description Default
fname str

String with file name to read from. If it ends in .gz, gzip compression is assumed.

required
mask_corners bool

If True, mask the 'absent in all samples' and 'fixed in all samples' entries.

True
return_comments bool

If true, the return value is (fs, comments), where comments is a list of strings containing the comments from the file (without #'s).

False
Note

See to_file method for details on the file format.

Source code in dadi/Spectrum_mod.py
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
@staticmethod
def from_file(fname, mask_corners=True, return_comments=False):
    """
    Read frequency spectrum from file.

    Args:
        fname (str): String with file name to read from. If it ends in .gz, gzip
            compression is assumed.
        mask_corners (bool, optional): If True, mask the 'absent in all samples' and 'fixed in
                    all samples' entries.
        return_comments (bool, optional): If true, the return value is (fs, comments), where
                        comments is a list of strings containing the comments
                        from the file (without #'s).

    Note:
        See [to_file][dadi.Spectrum_mod.Spectrum.to_file] method for details on the file format.
    """
    if fname.endswith('.gz'):
        fid = gzip.open(fname, 'rb')
    else:
        fid = open(fname, 'r')

    line = fid.readline()
    # Strip out the comments
    comments = []
    while line.startswith('#'):
        comments.append(line[1:].strip())
        line = fid.readline()

    # Read the shape of the data
    shape_spl = line.split()
    if 'folded' not in shape_spl and 'unfolded' not in shape_spl:
        # This case handles the old file format
        shape = tuple([int(d) for d in shape_spl])
        folded = False
        pop_ids = None
    else:
        # This case handles the new file format
        shape,next_ii = [int(shape_spl[0])], 1
        while shape_spl[next_ii] not in ['folded', 'unfolded']:
            shape.append(int(shape_spl[next_ii]))
            next_ii += 1
        folded = (shape_spl[next_ii] == 'folded')
        # Are there population labels in the file?
        if len(shape_spl) > next_ii + 1:
            pop_ids = line.split('"')[1::2]
        else:
            pop_ids = None

    data = numpy.fromstring(fid.readline().strip(), 
                            count=numpy.prod(shape), sep=' ')
    # fromfile returns a 1-d array. Reshape it to the proper form.
    data = data.reshape(*shape)

    maskline = fid.readline().strip()
    if not maskline:
        # The old file format didn't have a line for the mask
        mask = None
    else:
        # This case handles the new file format
        mask = numpy.fromstring(maskline, 
                                count=numpy.prod(shape), sep=' ')
        mask = mask.reshape(*shape)

    fs = Spectrum(data, mask, mask_corners, data_folded=folded,
                  pop_ids=pop_ids)

    fid.close()
    if not return_comments:
        return fs
    else:
        return fs,comments

from_ms_file(fid, average=True, mask_corners=True, return_header=False, pop_assignments=None, pop_ids=None, bootstrap_segments=1) staticmethod

Read frequency spectrum from file of ms output.

Parameters:

Name Type Description Default
fid str

string with file name to read from or an open file object.

required
average bool

If True, the returned fs is the average over the runs in the ms file. If False, the returned fs is the sum.

True
mask_corners bool

If True, mask the 'absent in all samples' and 'fixed in all samples' entries.

True
return_header bool

If True, the return value is (fs, (command,seeds), where command and seeds are strings containing the ms commandline and the seeds used.

False
pop_assignments list[int]

If None, the assignments of samples to populations is done automatically, using the assignment in the ms command line. To manually assign populations, pass a list of the from [6,8]. This example places the first 6 samples into population 1, and the next 8 into population 2.

None
pop_ids list[str]

Optional list of strings containing the population labels. If pop_ids is None, labels will be "pop0", "pop1", ...

None
bootstrap_segments int

If bootstrap_segments is an integer greater than 1, the data will be broken up into that many segments based on SNP position. Instead of single FS, a list of spectra will be returned, one for each segment.

1
Source code in dadi/Spectrum_mod.py
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
@staticmethod
def from_ms_file(fid, average=True, mask_corners=True, return_header=False,
                 pop_assignments=None, pop_ids=None, bootstrap_segments=1):
    """
    Read frequency spectrum from file of ms output.

    Args:
        fid (str): string with file name to read from or an open file object.
        average (bool, optional): If True, the returned fs is the average over the runs in the ms
                file. If False, the returned fs is the sum.
        mask_corners (bool, optional): If True, mask the 'absent in all samples' and 'fixed in
                    all samples' entries.
        return_header (bool, optional): If True, the return value is (fs, (command,seeds), where
                    command and seeds are strings containing the ms
                    commandline and the seeds used.
        pop_assignments (list[int], optional): If None, the assignments of samples to populations is
                        done automatically, using the assignment in the ms
                        command line. To manually assign populations, pass a
                        list of the from [6,8]. This example places
                        the first 6 samples into population 1, and the next 8
                        into population 2.
        pop_ids (list[str], optional): Optional list of strings containing the population labels.
                If pop_ids is None, labels will be "pop0", "pop1", ...
        bootstrap_segments (int, optional): If bootstrap_segments is an integer greater than 1,
                            the data will be broken up into that many segments
                            based on SNP position. Instead of single FS, a list
                            of spectra will be returned, one for each segment.
    """
    newfile = False
    # Try to read from fid. If we can't, assume it's something that we can
    # use to open a file.
    if not hasattr(fid, 'read'):
        newfile = True
        fid = open(fid, 'r')

    # Parse the commandline
    command = line = fid.readline()
    command_terms = line.split()

    if command_terms[0].count('ms'):
        runs = int(command_terms[2])
        try:
            pop_flag = command_terms.index('-I')
            num_pops = int(command_terms[pop_flag+1])
            pop_samples = [int(command_terms[pop_flag+ii])
                           for ii in range(2, 2+num_pops)]
        except ValueError:
            num_pops = 1
            pop_samples = [int(command_terms[1])]
    else:
        raise ValueError('Unrecognized command string: %s.' % command)

    total_samples = numpy.sum(pop_samples)
    if pop_assignments:
        num_pops = len(pop_assignments)
        pop_samples = pop_assignments

    sample_indices = numpy.cumsum([0] + pop_samples)
    bottom_l = sample_indices[:-1]
    top_l = sample_indices[1:]

    seeds = line = fid.readline()
    while not line.startswith('//'):
        line = fid.readline()

    counts = numpy.zeros(len(pop_samples), numpy.int_)
    fs_shape = numpy.asarray(pop_samples) + 1
    dimension = len(counts)

    if dimension > 1:
        bottom0 = bottom_l[0]
        top0 = top_l[0]
        bottom1 = bottom_l[1]
        top1 = top_l[1]
    if dimension > 2:
        bottom2 = bottom_l[2]
        top2 = top_l[2]
    if dimension > 3:
        bottom3 = bottom_l[3]
        top3 = top_l[3]
    if dimension > 4:
        bottom4 = bottom_l[4]
        top4 = top_l[4]
    if dimension > 5:
        bottom5 = bottom_l[5]
        top5 = top_l[5]

    all_data = [numpy.zeros(fs_shape, numpy.int_)
                for boot_ii in range(bootstrap_segments)]
    for run_ii in range(runs):
        line = fid.readline()
        segsites = int(line.split()[-1])

        if segsites == 0:
            # Special case, need to read 3 lines to stay synced.
            for _ in range(3):
                line = fid.readline()
            continue
        line = fid.readline()
        while not line.startswith('positions'):
            line = fid.readline()

        # Read SNP positions for creating bootstrap segments
        positions = [float(_) for _ in line.split()[1:]]
        # Where we should break our interval to create our bootstraps
        breakpts = numpy.linspace(0, 1, bootstrap_segments+1)
        # The indices that correspond to those breakpoints
        break_iis = numpy.searchsorted(positions, breakpts)
        # Correct for searchsorted behavior if last position is 1,
        # to ensure all SNPs are captured
        break_iis[-1] = len(positions)

        # Read the chromosomes in
        chromos = fid.read((segsites+1)*total_samples)

        # For each bootstrap segment, relevant SNPs run from start_ii:end_ii
        for boot_ii, (start_ii, end_ii) \
                in enumerate(zip(break_iis[:-1], break_iis[1:])):
            # Use the data array corresponding to this bootstrap segment
            data = all_data[boot_ii]
            for snp in range(start_ii, end_ii):
                # Slice to get all the entries that refer to a given SNP
                this_snp = chromos[snp::segsites+1]
                # Count SNPs per population, and record them.
                if dimension == 1:
                    data[this_snp.count('1')] += 1
                elif dimension == 2:
                    data[this_snp[bottom0:top0].count('1'), 
                         this_snp[bottom1:top1].count('1')] += 1
                elif dimension == 3:
                    data[this_snp[bottom0:top0].count('1'), 
                         this_snp[bottom1:top1].count('1'),
                         this_snp[bottom2:top2].count('1')] += 1
                elif dimension == 4:
                    data[this_snp[bottom0:top0].count('1'), 
                         this_snp[bottom1:top1].count('1'),
                         this_snp[bottom2:top2].count('1'),
                         this_snp[bottom3:top3].count('1')] += 1
                elif dimension == 5:
                    data[this_snp[bottom0:top0].count('1'), 
                         this_snp[bottom1:top1].count('1'),
                         this_snp[bottom2:top2].count('1'),
                         this_snp[bottom3:top3].count('1'),
                         this_snp[bottom4:top4].count('1')] += 1
                elif dimension == 6:
                    data[this_snp[bottom0:top0].count('1'), 
                         this_snp[bottom1:top1].count('1'),
                         this_snp[bottom2:top2].count('1'),
                         this_snp[bottom3:top3].count('1'),
                         this_snp[bottom4:top4].count('1'),
                         this_snp[bottom5:top5].count('1')] += 1
                else:
                    # This is noticably slower, so we special case the cases
                    # above.
                    for dim_ii in range(dimension):
                        bottom = bottom_l[dim_ii]
                        top = top_l[dim_ii]
                        counts[dim_ii] = this_snp[bottom:top].count('1')
                    data[tuple(counts)] += 1

        # Read to the next iteration
        line = fid.readline()
        line = fid.readline()

    if newfile:
        fid.close()

    all_fs = [Spectrum(data, mask_corners=mask_corners, pop_ids=pop_ids)
              for data in all_data]
    if average:
        all_fs = [fs/runs for fs in all_fs]

    # If we aren't setting up for bootstrapping, return fs, rather than a
    # list of length 1. (This ensures backward compatibility.)
    if bootstrap_segments == 1:
        all_fs = all_fs[0]

    if not return_header:
        return all_fs
    else:
        return all_fs, (command,seeds)

from_phi(phi, ns, xxs, mask_corners=True, pop_ids=None, admix_props=None, het_ascertained=None, force_direct=False) staticmethod

Compute sample Spectrum from population frequency distribution phi.

Parameters:

Name Type Description Default
phi array - like

P-dimensional population frequency distribution.

required
ns list[int]

Sequence of P sample sizes for each population.

required
xxs tuple[int]

Sequence of P one-dimesional grids on which phi is defined.

required
mask_corners bool

If True, resulting FS is masked in 'absent' and 'fixed' entries.

True
pop_ids list[str]

Optional list of strings containing the population labels. If pop_ids is None, labels will be "pop0", "pop1", ...

None
admix_props tuple[tuple]

Admixture proportions for sampled individuals. For example, if there are two populations, and individuals from the first pop are admixed with fraction f from the second population, then admix_props=((1-f,f),(0,1)). For three populations, the no-admixture setting is admix_props=((1,0,0),(0,1,0),(0,0,1)). (Note that this option also forces direct integration, which may be less accurate than the semi-analytic method.)

None
het_ascertained str['xx', 'yy', or 'zz']

If 'xx', then FS is calculated assuming that SNPs have been ascertained by being heterozygous in one individual from population 1. (This individual is not in the current sample.) If 'yy' or 'zz', it assumed that the ascertainment individual came from population 2 or 3, respectively. (Note that this option also forces direct integration, which may be less accurate than the semi-analytic method. This could be fixed if there is interest. Note also that this option cannot be used simultaneously with admix_props.)

None
force_direct bool

Forces integration to use older direct integration method, rather than using analytic integration of sampling formula.

False
Source code in dadi/Spectrum_mod.py
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
@staticmethod
def from_phi(phi, ns, xxs, mask_corners=True, 
             pop_ids=None, admix_props=None, het_ascertained=None, 
             force_direct=False):
    """
    Compute sample Spectrum from population frequency distribution phi.

    Args:
        phi (array-like): P-dimensional population frequency distribution.
        ns (list[int]): Sequence of P sample sizes for each population.
        xxs (tuple[int]): Sequence of P one-dimesional grids on which phi is defined.
        mask_corners (bool, optional): If True, resulting FS is masked in 'absent' and 'fixed'
                    entries.
        pop_ids (list[str], optional): Optional list of strings containing the population labels.
                If pop_ids is None, labels will be "pop0", "pop1", ...
        admix_props (tuple[tuple], optional): Admixture proportions for sampled individuals. For example,
                    if there are two populations, and individuals from the
                    first pop are admixed with fraction f from the second
                    population, then admix_props=((1-f,f),(0,1)). For three
                    populations, the no-admixture setting is
                    admix_props=((1,0,0),(0,1,0),(0,0,1)). 
                    (Note that this option also forces direct integration,
                    which may be less accurate than the semi-analytic
                    method.)
        het_ascertained (str['xx', 'yy', or 'zz'], optional): If 'xx', then FS is calculated assuming that SNPs have
                        been ascertained by being heterozygous in one
                        individual from population 1. (This individual is
                        *not* in the current sample.) If 'yy' or 'zz', it
                        assumed that the ascertainment individual came from
                        population 2 or 3, respectively.
                        (Note that this option also forces direct integration,
                        which may be less accurate than the semi-analytic
                        method. This could be fixed if there is interest. Note
                        also that this option cannot be used simultaneously
                        with admix_props.)
        force_direct (bool, optional): Forces integration to use older direct integration method,
                    rather than using analytic integration of sampling 
                    formula.
    """
    if admix_props and not numpy.allclose(numpy.sum(admix_props, axis=1),1):
        raise ValueError('Admixture proportions {0} must sum to 1 for all '
                         'populations.' .format(str(admix_props)))
    if not phi.ndim == len(ns) == len(xxs):
        raise ValueError('Dimensionality of phi and lengths of ns and xxs '
                         'do not all agree.')
    if het_ascertained and not het_ascertained in ['xx','yy','zz']:
        raise ValueError("If used, het_ascertained must be 'xx', 'yy', or "
                         "'zz'.")

    if admix_props and het_ascertained:
        error = """admix_props and het_ascertained options cannot be used 
        simultaneously. Instead, please use the PhiManip methods to 
        implement admixture. If this proves inappropriate for your use, 
        contact the the dadi developers, as it may be possible to support
        both options simultaneously in the future."""
        raise NotImplementedError(error)

    if phi.ndim == 1:
        if not het_ascertained and not force_direct:
            fs = Spectrum._from_phi_1D_analytic(ns[0], xxs[0], phi,
                                                mask_corners)
        else:
            fs = Spectrum._from_phi_1D_direct(ns[0], xxs[0], phi, 
                                              mask_corners, het_ascertained)
    elif phi.ndim == 2:
        if not het_ascertained and not admix_props and not force_direct:
            fs = Spectrum._from_phi_2D_linalg(ns[0], ns[1], 
                                              xxs[0], xxs[1], phi,
                                              mask_corners)
        elif not admix_props:
            fs = Spectrum._from_phi_2D_direct(ns[0], ns[1], xxs[0], xxs[1], 
                                              phi, mask_corners, 
                                              het_ascertained)
        else:
            fs = Spectrum._from_phi_2D_admix_props(ns[0], ns[1], 
                                                  xxs[0], xxs[1], 
                                                  phi, mask_corners, 
                                                  admix_props)
    elif phi.ndim == 3:
        if not het_ascertained and not admix_props and not force_direct:
            fs = Spectrum._from_phi_3D_linalg(ns[0], ns[1], ns[2], 
                                              xxs[0], xxs[1], xxs[2],
                                              phi, mask_corners)
        elif not admix_props:
            fs = Spectrum._from_phi_3D_direct(ns[0], ns[1], ns[2], 
                                              xxs[0], xxs[1], xxs[2], 
                                              phi, mask_corners, 
                                              het_ascertained)
        else:
            fs = Spectrum._from_phi_3D_admix_props(ns[0], ns[1], ns[2], 
                                                   xxs[0], xxs[1], xxs[2], 
                                                   phi, mask_corners, 
                                                   admix_props)
    elif phi.ndim == 4:
        if not het_ascertained and not admix_props and not force_direct:
            fs = Spectrum._from_phi_4D_linalg(ns[0], ns[1], ns[2], ns[3],
                                              xxs[0], xxs[1], xxs[2], xxs[3],
                                              phi, mask_corners)
        elif not admix_props:
            fs = Spectrum._from_phi_4D_direct(ns[0], ns[1], ns[2], ns[3],
                                              xxs[0], xxs[1], xxs[2], xxs[3],
                                              phi, mask_corners, 
                                              het_ascertained)
        else:
            fs = Spectrum._from_phi_4D_admix_props(ns[0], ns[1], ns[2], ns[3],
                                                   xxs[0], xxs[1], xxs[2], xxs[3],
                                                   phi, mask_corners, 
                                                   admix_props)
    elif phi.ndim == 5:
        if not het_ascertained and not admix_props and not force_direct:
            fs = Spectrum._from_phi_5D_linalg(ns[0], ns[1], ns[2], ns[3], ns[4],
                                              xxs[0], xxs[1], xxs[2], xxs[3], xxs[4],
                                              phi, mask_corners)
    else:
        raise ValueError('Only implemented for dimensions 1-5.')
    fs.pop_ids = pop_ids
    # Record value to use for extrapolation. This is the first grid point,
    # which is where new mutations are introduced. Note that extrapolation
    # will likely fail if grids differ between dimensions.
    fs.extrap_x = xxs[0][1]
    for xx in xxs[1:]:
        if not xx[1] == fs.extrap_x:
            logger.warning('Spectrum calculated from phi different grids for '
                        'different dimensions. Extrapolation may fail.')
    return fs

from_phi_inbreeding(phi, ns, xxs, Fs, ploidys, mask_corners=True, pop_ids=None, admix_props=None, het_ascertained=None, force_direct=True) staticmethod

Compute a sample Spectrum from a population frequency distribution (phi) with inbreeding.

This function calculates the frequency spectrum while accounting for inbreeding coefficients and ploidy levels for each population.

Parameters:

Name Type Description Default
phi array - like

P-dimensional population frequency distribution.

required
ns list[int]

Sequence of P sample sizes for each population.

required
xxs list[array - like]

Sequence of P one-dimensional grids on which phi is defined.

required
Fs list[float]

Sequence of P inbreeding coefficients for each population.

required
ploidys list[int]

Sequence of P ploidy levels for each population.

required
mask_corners bool

If True, resulting FS is masked in 'absent' and 'fixed' entries.

True
pop_ids list[str]

Optional list of strings containing the population labels. If None, labels will be "pop0", "pop1", etc.

None
admix_props list[tuple]

Admixture proportions for sampled individuals. For example, if there are two populations and individuals from the first population are admixed with fraction f from the second population, then admix_props=((1-f, f), (0, 1)). For three populations, the no-admixture setting is admix_props=((1, 0, 0), (0, 1, 0), (0, 0, 1)). Note: This option forces direct integration, which may be less accurate than the semi-analytic method.

None
het_ascertained str

If 'xx', the FS is calculated assuming that SNPs have been ascertained by being heterozygous in one individual from population 1. This individual is not in the current sample. If 'yy' or 'zz', it is assumed that the ascertainment individual came from population 2 or 3, respectively. Note: This option forces direct integration, which may be less accurate than the semi-analytic method. This option cannot be used simultaneously with admix_props.

None
force_direct bool

Forces integration to use the older direct integration method, rather than using analytic integration of the sampling formula.

True

Returns:

Name Type Description
fs Spectrum

The resulting frequency spectrum.

Raises:

Type Description
ValueError

If the dimensionality of phi and lengths of ns, xxs, ploidys, and Fs do not all agree, or if invalid values are provided for het_ascertained.

NotImplementedError

If admix_props and het_ascertained are used simultaneously.

Source code in dadi/Spectrum_mod.py
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
@staticmethod
def from_phi_inbreeding(phi, ns, xxs, Fs, ploidys, mask_corners=True, 
                        pop_ids=None, admix_props=None,
                        het_ascertained=None, force_direct=True):
    """
    Compute a sample Spectrum from a population frequency distribution (phi) with inbreeding.

    This function calculates the frequency spectrum while accounting for inbreeding
    coefficients and ploidy levels for each population.

    Args:
        phi (array-like): P-dimensional population frequency distribution.
        ns (list[int]): Sequence of P sample sizes for each population.
        xxs (list[array-like]): Sequence of P one-dimensional grids on which phi is defined.
        Fs (list[float]): Sequence of P inbreeding coefficients for each population.
        ploidys (list[int]): Sequence of P ploidy levels for each population.
        mask_corners (bool, optional): If True, resulting FS is masked in 'absent' and 'fixed' entries.
        pop_ids (list[str], optional): Optional list of strings containing the population labels.
            If None, labels will be "pop0", "pop1", etc.
        admix_props (list[tuple], optional): Admixture proportions for sampled individuals.
            For example, if there are two populations and individuals from the first
            population are admixed with fraction f from the second population, then
            admix_props=((1-f, f), (0, 1)). For three populations, the no-admixture
            setting is admix_props=((1, 0, 0), (0, 1, 0), (0, 0, 1)).
            Note: This option forces direct integration, which may be less accurate
            than the semi-analytic method.
        het_ascertained (str, optional): If 'xx', the FS is calculated assuming that SNPs have
            been ascertained by being heterozygous in one individual from population 1.
            This individual is not in the current sample. If 'yy' or 'zz', it is assumed
            that the ascertainment individual came from population 2 or 3, respectively.
            Note: This option forces direct integration, which may be less accurate
            than the semi-analytic method. This option cannot be used simultaneously
            with admix_props.
        force_direct (bool, optional): Forces integration to use the older direct integration
            method, rather than using analytic integration of the sampling formula.

    Returns:
        fs (Spectrum): The resulting frequency spectrum.

    Raises:
        ValueError: If the dimensionality of phi and lengths of ns, xxs, ploidys, and Fs
            do not all agree, or if invalid values are provided for het_ascertained.
        NotImplementedError: If admix_props and het_ascertained are used simultaneously.
    """
    if np.all(np.asarray(Fs) == 0):
        return Spectrum.from_phi(phi, ns, xxs, mask_corners=mask_corners, 
                        pop_ids=pop_ids, admix_props=admix_props,
                        het_ascertained=het_ascertained, force_direct=force_direct)
    if admix_props and not numpy.allclose(numpy.sum(admix_props, axis=1),1):
        raise ValueError('Admixture proportions {0} must sum to 1 for all '
                         'populations.' .format(str(admix_props)))
    if not phi.ndim == len(ns) == len(xxs) == len(Fs) == len(ploidys):
        raise ValueError('Dimensionality of phi and lengths of ns, xxs, ploidys, and Fs '
                         'do not all agree.')
    if het_ascertained and not het_ascertained in ['xx','yy','zz']:
        raise ValueError("If used, het_ascertained must be 'xx', 'yy', or "
                         "'zz'.")

    Fs = np.minimum(Fs, 1 - 1e-10)

    if admix_props and het_ascertained:
        error = """admix_props and het_ascertained options cannot be used 
        simultaneously. Instead, please use the PhiManip methods to 
        implement admixture. If this proves inappropriate for your use, 
        contact the the dadi developers, as it may be possible to support
        both options simultaneously in the future."""
        raise NotImplementedError(error)

    if phi.ndim == 1:
        fs = Spectrum._from_phi_1D_direct_inbreeding(ns[0], xxs[0], phi, Fs[0],
                                                     mask_corners, ploidys[0],
                                                     het_ascertained)
    elif phi.ndim == 2:
        fs = Spectrum._from_phi_2D_direct_inbreeding(ns[0], ns[1], xxs[0], xxs[1],
                                                     phi, Fs[0], Fs[1], mask_corners,
                                                     ploidys[0], ploidys[1],
                                                     het_ascertained)
    elif phi.ndim == 3:
        fs = Spectrum._from_phi_3D_direct_inbreeding(ns[0], ns[1], ns[2], 
                                                     xxs[0], xxs[1], xxs[2], 
                                                     phi, Fs[0], Fs[1], Fs[2],
                                                     mask_corners, ploidys[0],
                                                     ploidys[1], ploidys[2],
                                                     het_ascertained)
    else:
        raise ValueError('Only implemented for dimensions 1,2 or 3.')
    fs.pop_ids = pop_ids
    # Record value to use for extrapolation. This is the first grid point,
    # which is where new mutations are introduced. Note that extrapolation
    # will likely fail if grids differ between dimensions.
    fs.extrap_x = xxs[0][1]
    for xx in xxs[1:]:
        if not xx[1] == fs.extrap_x:
            logger.warning('Spectrum calculated from phi different grids for '
                        'different dimensions. Extrapolation may fail.')
    return fs

from_sfscode_file(fid, sites='all', average=True, mask_corners=True, return_header=False, pop_ids=None) staticmethod

Read a frequency spectrum from an sfs_code output file.

This function parses the output of sfs_code and returns a frequency spectrum based on the specified filtering and formatting options.

Parameters:

Name Type Description Default
fid str or file object

Path to the file or an open file object containing sfs_code output.

required
sites str

Site type to include in the spectrum. Options are:

  • 'all': Include all sites.

  • 'syn': Include only synonymous mutations.

  • 'nonsyn': Include only non-synonymous mutations.

'all'
average bool

If True, return the average spectrum across runs; if False, return the sum.

True
mask_corners bool

If True, mask entries corresponding to alleles absent or fixed in all samples.

True
return_header bool

If True, return a tuple (fs, (command, seeds)), where command and seeds are strings containing the ms command line and seed values used.

False
pop_ids list of str

Population labels. If None, defaults to ["pop0", "pop1", ...].

None

Returns:

Name Type Description
fs Spectrum

The frequency spectrum, optionally with metadata if return_header is True.

Source code in dadi/Spectrum_mod.py
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
@staticmethod
def from_sfscode_file(fid, sites='all', average=True, mask_corners=True, 
                      return_header=False, pop_ids=None):
    """
    Read a frequency spectrum from an sfs_code output file.

    This function parses the output of sfs_code and returns a frequency spectrum
    based on the specified filtering and formatting options.

    Args:
        fid (str or file object): Path to the file or an open file object containing sfs_code output.
        sites (str): Site type to include in the spectrum. Options are:

            - 'all': Include all sites.

            - 'syn': Include only synonymous mutations.

            - 'nonsyn': Include only non-synonymous mutations.

        average (bool): If True, return the average spectrum across runs; if False, return the sum.
        mask_corners (bool): If True, mask entries corresponding to alleles absent or fixed in all samples.
        return_header (bool): If True, return a tuple (fs, (command, seeds)), where `command` and `seeds`
            are strings containing the ms command line and seed values used.
        pop_ids (list of str, optional): Population labels. If None, defaults to ["pop0", "pop1", ...].

    Returns:
        fs (Spectrum): The frequency spectrum, optionally with metadata if `return_header` is True.
    """

    newfile = False
    # Try to read from fid. If we can't, assume it's something that we can
    # use to open a file.
    if not hasattr(fid, 'read'):
        newfile = True
        fid = open(fid, 'r')

    if sites == 'all':
        only_nonsyn, only_syn = False, False
    elif sites == 'syn':
        only_nonsyn, only_syn = False, True
    elif sites == 'nonsyn':
        only_nonsyn, only_syn = True, False
    else:
        raise ValueError("'sites' argument must be one of ('all', 'syn', "
                         "'nonsyn').")

    command = fid.readline()
    command_terms = command.split()

    runs = int(command_terms[2])
    num_pops = int(command_terms[1])

    # sfs_code default is 6 individuals, and I assume diploid pop
    pop_samples = [12] *  num_pops
    if '--sampSize' in command_terms or '-n' in command_terms:
        try:
            pop_flag = command_terms.index('--sampSize')
            pop_flag = command_terms.index('-n')
        except ValueError:
            pass
        pop_samples = [2*int(command_terms[pop_flag+ii])
                       for ii in range(1, 1+num_pops)]

    pop_samples = numpy.asarray(pop_samples)
    pop_fixed_str = [',%s.-1' % i for i in range(num_pops)]
    pop_count_str = [',%s.' % i for i in range(num_pops)]

    seeds = fid.readline()
    line = fid.readline()

    data = numpy.zeros(numpy.asarray(pop_samples)+1, numpy.int_)

    # line = //iteration...
    line = fid.readline()
    for iter_ii in range(runs):
        for ii in range(5):
            line = fid.readline()

        # It is possible for a mutation to be listed several times in the
        # output.  To accomodate this, I keep a dictionary of identities
        # for those mutations, and hold off processing them until I've seen
        # all mutations listed for the iteration.
        mut_dict = {}

        # Loop until this iteration ends.
        while not line.startswith('//') and line != '':
            split_line = line.split(';')
            if split_line[-1] == '\n':
                split_line = split_line[:-1]

            # Loop over mutations on this line.
            for mut_ii, mutation in enumerate(split_line):
                counts_this_mut = numpy.zeros(num_pops, numpy.int_)

                split_mut = mutation.split(',')

                # Exclude synonymous mutations
                if only_nonsyn and split_mut[7] == '0':
                    continue
                # Exclude nonsynonymous mutations
                if only_syn and split_mut[7] == '1':
                    continue

                ind_start = len(','.join(split_mut[:12]))
                by_individual = mutation[ind_start:]

                mut_id = ','.join(split_mut[:4] + split_mut[5:11])

                # Count mutations in each population
                for pop_ii,fixed_str,count_str\
                        in zip(range(num_pops), pop_fixed_str, 
                               pop_count_str):
                    if fixed_str in by_individual:
                        counts_this_mut[pop_ii] = pop_samples[pop_ii]
                    else:
                        counts_this_mut[pop_ii] =\
                                by_individual.count(count_str)

                # Initialize the list that will track the counts for this
                # mutation. Using setdefault means that it won't overwrite
                # if there's already a list stored there.
                mut_dict.setdefault(mut_id, [0]*num_pops)
                for ii in range(num_pops):
                    if counts_this_mut[ii] > 0 and mut_dict[mut_id][ii] > 0:
                        sys.stderr.write('Contradicting counts between '
                                         'listings for mutation %s in '
                                         'population %i.' 
                                         % (mut_id, ii))
                    mut_dict[mut_id][ii] = max(counts_this_mut[ii], 
                                               mut_dict[mut_id][ii])

            line = fid.readline()

        # Now apply all the mutations with fixations that we deffered.
        for mut_id, counts in mut_dict.items():
            if numpy.any(numpy.asarray(counts) > pop_samples):
                sys.stderr.write('counts_this_mut > pop_samples: %s > '
                                 '%s\n%s\n' % (counts, pop_samples, mut_id))
                counts = numpy.minimum(counts, pop_samples)
            data[tuple(counts)] += 1

    if newfile:
        fid.close()

    fs = Spectrum(data, mask_corners=mask_corners, pop_ids=pop_ids)
    if average:
        fs /= runs

    if not return_header:
        return fs
    else:
        return fs, (command,seeds)

log()

Return the natural logarithm of the entries of the frequency spectrum.

Only necessary because numpy.ma.log now fails to propagate extra attributes after numpy 1.10.

Source code in dadi/Spectrum_mod.py
598
599
600
601
602
603
604
605
606
607
608
609
def log(self):
    """
    Return the natural logarithm of the entries of the frequency spectrum.

    Only necessary because numpy.ma.log now fails to propagate extra
    attributes after numpy 1.10.
    """
    logfs = numpy.ma.log(self)
    logfs.folded = self.folded
    logfs.pop_ids = self.pop_ids
    logfs.extrap_x = self.extrap_x
    return logfs

marginalize(over, mask_corners=True)

Reduced dimensionality spectrum summing over some populations.

Parameters:

Name Type Description Default
over list of int

sequence of axes to sum over. For example (0,2) will sum over populations 0 and 2.

required
mask_corners bool

If True, the typical corners of the resulting fs will be masked

True
Source code in dadi/Spectrum_mod.py
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
def marginalize(self, over, mask_corners=True):
    """
    Reduced dimensionality spectrum summing over some populations.

    Args:
        over (list of int): sequence of axes to sum over. For example (0,2) will sum over
            populations 0 and 2.
        mask_corners (bool, optional): If True, the typical corners of the resulting fs will be
                    masked
    """
    original_folded = self.folded
    # If we started with an folded Spectrum, we need to unfold before
    # marginalizing.
    if original_folded:
        output = self.unfold()
    else:
        output = self.copy()

    orig_mask = output.mask.copy()
    orig_mask.flat[0] = orig_mask.flat[-1] = False
    if numpy.any(orig_mask):
        logger.warning('Marginalizing a Spectrum with internal masked values. '
                    'This may not be a well-defined operation.')

    # Do the marginalization
    for axis in sorted(over)[::-1]:
        output = output.sum(axis=axis)
    pop_ids = None
    if self.pop_ids is not None:
        pop_ids = list(self.pop_ids)
        for axis in sorted(over)[::-1]:
            del pop_ids[axis]
    output.folded = False
    output.pop_ids = pop_ids
    output.extrap_x = self.extrap_x

    if mask_corners:
        output.mask_corners()

    # Return folded or unfolded as original.
    if original_folded:
        return output.fold()
    else:
        return output

mask_corners()

Mask the 'seen in 0 samples' and 'seen in all samples' entries.

Source code in dadi/Spectrum_mod.py
178
179
180
181
182
def mask_corners(self):
    """
    Mask the 'seen in 0 samples' and 'seen in all samples' entries.
    """
    self.mask.flat[0] = self.mask.flat[-1] = True

pi()

Estimated expected number of pairwise differences between two chromosomes in the population.

Note that this estimate includes a factor of sample_size / (sample_size - 1) to make E( π̂ ) = θ.

Source code in dadi/Spectrum_mod.py
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
def pi(self):
    """
    Estimated expected number of pairwise differences between two
    chromosomes in the population.

    Note that this estimate includes a factor of sample_size / (sample_size - 1)
    to make E( π̂ ) = θ.
    """
    if self.ndim != 1:
        raise ValueError("Only defined for a one-dimensional SFS.")

    n = self.sample_sizes[0]
    # sample frequencies p 
    p = numpy.arange(0,n+1,dtype=float)/n
    # This expression derives from Gillespie's _Population_Genetics:_A
    # _Concise_Guide_, 2nd edition, section 2.6.
    return n/(n-1.) * 2*numpy.ma.sum(self*p*(1-p))

project(ns)

Project to smaller sample size.

Parameters:

Name Type Description Default
ns int

Sample sizes for new spectrum.

required

Returns:

Name Type Description
fs Spectrum

Spectrum object with sample sizes ns.

Source code in dadi/Spectrum_mod.py
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
def project(self, ns):
    """
    Project to smaller sample size.

    Args:
        ns (int): Sample sizes for new spectrum.

    Returns:
        fs (Spectrum): Spectrum object with sample sizes ns.
    """
    if len(ns) != self.Npop:
        raise ValueError('Requested sample sizes not of same dimension '
                         'as spectrum. Perhaps you need to marginalize '
                         'over some populations first?')
    if numpy.any(numpy.asarray(ns) > numpy.asarray(self.sample_sizes)):
        raise ValueError('Cannot project to a sample size greater than '
                         'original. Original size is %s and requested size '
                         'is %s.' % (self.sample_sizes, ns))

    original_folded = self.folded
    # If we started with an folded Spectrum, we need to unfold before
    # projecting.
    if original_folded:
        output = self.unfold()
    else:
        output = self.copy()

    # Iterate over each axis, applying the projection.
    for axis,proj in enumerate(ns):
        if proj != self.sample_sizes[axis]:
            output = output._project_one_axis(proj, axis)

    output.pop_ids = self.pop_ids
    output.extrap_x = self.extrap_x

    # Return folded or unfolded as original.
    if original_folded:
        return output.fold()
    else:
        return output

reorder_pops(neworder)

Get Spectrum with populations in new order

Parameters:

Name Type Description Default
neworder list[int]

Integer list defining new order of populations, indexing the orginal populations from 1. Must contain all integers from 1 to number of pops.

required

Returns:

Name Type Description
fs Spectrum

New Spectrum with same number of populations, but in a different order

Source code in dadi/Spectrum_mod.py
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
def reorder_pops(self, neworder):
    """
    Get Spectrum with populations in new order

    Args:
        neworder (list[int]): Integer list defining new order of populations, indexing the orginal
                populations from 1. Must contain all integers from 1 to number of pops.

    Returns:
        fs (Spectrum): New Spectrum with same number of populations, but in a different order
    """
    if sorted(neworder) != [_+1 for _ in range(self.ndim)]:
        raise(ValueError("neworder argument misspecified"))
    newaxes = [_-1 for _ in neworder]
    fs = self.transpose(newaxes)
    if self.pop_ids:
        fs.pop_ids = [self.pop_ids[_] for _ in newaxes]

    return fs

sample()

Generate a Poisson-sampled fs from the current one.

Note

Entries where the current fs is masked will be masked in the output sampled fs.

Source code in dadi/Spectrum_mod.py
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
def sample(self):
    """
    Generate a Poisson-sampled fs from the current one.

    Note:
        Entries where the current fs is masked will be masked in the
          output sampled fs.
    """
    import scipy.stats
    # These are entries where the sampling has no meaning, b/c fs is masked.
    bad_entries = self.mask
    # We convert to a 1-d array for passing into the sampler
    means = self.ravel().copy()
    # Filter out those bad entries.
    means[bad_entries.ravel()] = 1
    # Sample
    samp = scipy.stats.distributions.poisson.rvs(means, size=len(means))
    # Replace bad entries with zero
    samp[bad_entries.ravel()] = 0
    # Convert back to a properly shaped array
    samp = samp.reshape(self.shape)
    # Convert to a fs and mask the bad entries
    samp = Spectrum(samp, mask=self.mask, data_folded=self.folded,
                    pop_ids = self.pop_ids)
    return samp

scramble_pop_ids(mask_corners=True)

Spectrum corresponding to scrambling individuals among populations.

This is useful for assessing how diverged populations are. Essentially, it pools all the individuals represented in the fs and generates new populations of random individuals (without replacement) from that pool. If this fs is significantly different from the original, that implies population structure.

Source code in dadi/Spectrum_mod.py
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
def scramble_pop_ids(self, mask_corners=True):
    """
    Spectrum corresponding to scrambling individuals among populations.

    This is useful for assessing how diverged populations are.
    Essentially, it pools all the individuals represented in the fs and
    generates new populations of random individuals (without replacement)
    from that pool. If this fs is significantly different from the
    original, that implies population structure.
    """
    original_folded = self.folded
    # If we started with an folded Spectrum, we need to unfold before
    # projecting.
    if original_folded:
        self = self.unfold()

    total_samp = numpy.sum(self.sample_sizes)

    # First generate a 1d sfs for the pooled population.
    combined = numpy.zeros(total_samp+1)
    # For each entry in the fs, this is the total number of derived alleles
    total_per_entry = self._total_per_entry()
    # Sum up to generate the equivalent 1-d spectrum.
    for derived,counts in zip(total_per_entry.ravel(), self.ravel()):
        combined[derived] += counts

    # Now resample back into a n-d spectrum
    # For each entry, this is the counts per popuation. 
    #  e.g. counts_per_entry[3,4,5] = [3,4,5]
    counts_per_entry = self._counts_per_entry()
    # Reshape it to be 1-d, so we can iterate over it easily.
    counts_per_entry = counts_per_entry.reshape(numpy.prod(self.shape), 
                                                self.ndim)
    resamp = numpy.zeros(self.shape)
    for counts, derived in zip(counts_per_entry, total_per_entry.ravel()):
        # The probability here is 
        # (t1 choose d1)*(t2 choose d2)/(ntot choose derived)
        lnprob = sum(_lncomb(t,d) for t,d in zip(self.sample_sizes,counts))
        lnprob -= _lncomb(total_samp, derived)
        prob = numpy.exp(lnprob)
        # Assign result using the appropriate weighting
        resamp[tuple(counts)] += prob*combined[derived]

    resamp = Spectrum(resamp, mask_corners=mask_corners)
    if not original_folded:
        return resamp
    else:
        return resamp.fold()

theta_L()

theta_L as defined by Zeng et al. "Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants" (2006) Genetics

Note that is only sensible for 1-dimensional spectra.

Source code in dadi/Spectrum_mod.py
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
def theta_L(self):
    """
    theta_L as defined by Zeng et al. "Statistical Tests for Detecting
    Positive Selection by Utilizing High-Frequency Variants" (2006)
    Genetics

    Note that is only sensible for 1-dimensional spectra.
    """
    if self.Npop != 1:
        raise ValueError("Only defined on a one-dimensional fs.")

    n = self.sample_sizes[0]
    return numpy.sum(numpy.arange(1,n)*self[1:n])/(n-1)

to_file(fname, precision=16, comment_lines=[], foldmaskinfo=True)

Write frequency spectrum to file.

Parameters:

Name Type Description Default
fname str

File name to write to. If string ends in .gz, file will be saved with gzip compression.

required
precision int

precision with which to write out entries of the SFS. (They are formated via %.<p>g, where <p> is the precision.)

16
comment_lines list[str]

list of strings to be used as comment lines in the header of the output file.

[]
foldmaskinfo bool

If False, folding and mask and population label information will not be saved. This conforms to the file format for dadi versions prior to 1.3.0.

True

The file format is:

- # Any number of comment lines beginning with a '#'

- A single line containing N integers giving the dimensions of the fs
  array. So this line would be '5 5 3' for an SFS that was 5x5x3.
  (That would be 4x4x2 *samples*.)

- On the *same line*, the string 'folded' or 'unfolded' denoting the
  folding status of the array

- On the *same line*, optional strings each containing the population
  labels in quotes separated by spaces, e.g. "pop 1" "pop 2"

- A single line giving the array elements. The order of elements is 
  e.g.: fs[0,0,0] fs[0,0,1] fs[0,0,2] ... fs[0,1,0] fs[0,1,1] ...

- A single line giving the elements of the mask in the same order as
  the data line. '1' indicates masked, '0' indicates unmasked.
Source code in dadi/Spectrum_mod.py
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
def to_file(self, fname, precision=16, comment_lines = [], 
            foldmaskinfo=True):
    """
    Write frequency spectrum to file.

    Args:
        fname (str): File name to write to.  If string ends in .gz, file will be saved
            with gzip compression.
        precision (int, optional): precision with which to write out entries of the SFS. (They
                are formated via `%.<p>g`, where `<p>` is the precision.)
        comment_lines (list[str], optional): list of strings to be used as comment lines in the header
                    of the output file.
        foldmaskinfo (bool, optional): If False, folding and mask and population label
                    information will not be saved. This conforms to the file
                    format for dadi versions prior to 1.3.0.

    The file format is:

        - # Any number of comment lines beginning with a '#'

        - A single line containing N integers giving the dimensions of the fs
          array. So this line would be '5 5 3' for an SFS that was 5x5x3.
          (That would be 4x4x2 *samples*.)

        - On the *same line*, the string 'folded' or 'unfolded' denoting the
          folding status of the array

        - On the *same line*, optional strings each containing the population
          labels in quotes separated by spaces, e.g. "pop 1" "pop 2"

        - A single line giving the array elements. The order of elements is 
          e.g.: fs[0,0,0] fs[0,0,1] fs[0,0,2] ... fs[0,1,0] fs[0,1,1] ...

        - A single line giving the elements of the mask in the same order as
          the data line. '1' indicates masked, '0' indicates unmasked.
    """
    # Open the file object.
    if fname.endswith('.gz'):
        fid = gzip.open(fname, 'wb')
    else:
        fid = open(fname, 'w')

    # Write comments
    for line in comment_lines:
        fid.write('# ')
        fid.write(line.strip())
        fid.write('\n')

    # Write out the shape of the fs
    for elem in self.data.shape:
        fid.write('%i ' % elem)

    if foldmaskinfo:
        if not self.folded:
            fid.write('unfolded')
        else:
            fid.write('folded')
        if self.pop_ids is not None:
            for label in self.pop_ids:
                fid.write(' "%s"' % label)

    fid.write('\n')

    # Write the data to the file. The obnoxious ravel call is to
    # ensure compatibility with old version that used self.data.tofile.
    numpy.savetxt(fid, [self.data.ravel()], delimiter=' ',
                  fmt='%%.%ig' % precision)

    if foldmaskinfo:
        # Write the mask to the file
        numpy.savetxt(fid, [numpy.asarray(self.mask, int).ravel()],
                      delimiter=' ', fmt='%d')

    fid.close()

unfold()

Unfolded frequency spectrum

It is assumed that each state of a SNP is equally likely to be ancestral.

Note also that unfolding is not done in-place. The return value is a new Spectrum object.

Source code in dadi/Spectrum_mod.py
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
def unfold(self):
    """
    Unfolded frequency spectrum

    It is assumed that each state of a SNP is equally likely to be
    ancestral.

    Note also that unfolding is not done in-place. The return value is a new
    Spectrum object.
    """
    if not self.folded:
        raise ValueError('Input Spectrum is not folded.')

    # Unfolding the data is easy.
    reversed_data = reverse_array(self.data)
    newdata = (self.data + reversed_data)/2.

    # Unfolding the mask is trickier. We want to preserve masking of entries
    # that were masked in the original Spectrum.
    # Which entries in the original Spectrum were masked solely because
    # they are incompatible with a folded Spectrum?
    total_samples = numpy.sum(self.sample_sizes)
    total_per_entry = self._total_per_entry()
    where_folded_out = total_per_entry > int(total_samples/2)

    newmask = numpy.logical_xor(self.mask, where_folded_out)
    newmask = numpy.logical_or(newmask, reverse_array(newmask))

    outfs = Spectrum(newdata, mask=newmask, data_folded=False, 
                     pop_ids=self.pop_ids)
    outfs.extrap_x = self.extrap_x
    return outfs

unmask_all()

Unmask all values.

Source code in dadi/Spectrum_mod.py
184
185
186
187
188
def unmask_all(self):
    """
    Unmask all values.
    """
    self.mask[[slice(None)]*self.Npop] = False