A bit about FST

April 4, 2021 · 5 min read

Little this, little that

A bit about FST

If you have spent any time being exposed to population genetics, you've likely encountered the term FST, our field's most beloved/maligned one-size-fits all coefficient. That description is a gross oversimplification, but the purpose of this post isn't to dig deep into the world of differentiation statistics, but instead shed light on the general concept and the logic behind its significance testing.

If you're like me, you've used and reported some kind of pairwise XST statistic in your work, where that value was generated using some fantastic software like adegenet or GenoDive, and you trusted that the methods were honest and appropriate (they are!). Part of the motivation behind PopGen.jl was to invest time into learning and understanding some of these methods I use, despite not coming from a mathematics or computer science background. Well, as I actually learn stuff and decode methods (no pun intended), I feel like it might be worthwhile to share some of my basic insights with others, in a way that I wish I could have been taught. So, this time it's FST. The purpose of this post is to cover the larger concepts without getting into the overwhelming math. Let's go!

What is FST?

To understand FST, we need to start with F, the inbreeding coefficient devised by brilliant-but-ethically-questionable Sewall Wright. The F coefficient was a mathematically simple way to understand the level of inbreeding in a group, originally described for cattle breeds (the 🐄 part isn't relevant, but it's a fun fact). There were three circumstantial versions of F, denoted by very straightforward subscripted letters: FIT, FIS, FST. The I stands for Individual, S for Subpopulation, T for Total. The combination of letters explains what relationship you are measuring: FIT was the comparison of F for an Individual vs the Total, FIS was the comparison of F for an Individual vs a Subpopulation.

Logically, FST would then mean you were looking at F for Subpopulations vs the Total, which is where population geneticists get all excited. The values range from 0 to 1, with 0 being panmixia (fully mixed) and 1 being complete isolation. Although it seems like it would be linear, it's not-- Wright considered 0.125 to be the cutoff between panmictic and divergent. If one was to calculate FST for one large group of individuals and get a value >0.125, this would suggest there is population subdivision happening (exciting!!).

Pairwise FST

Going one step further, you can narrow how you calculate FST to identify different trends. A "pairwise" FST is when you calculate FST for only a pair of populations at a time. With a pairwise FST, you are now testing if two populations are panmictic or divergent with each other. Depending on your study system and the questions you're trying to answer, this can be extremely valuable information.

From a genetic standpoint, it looks kinda like this: fst_diagram

This is an over-simplified system (diploid, single locus, biallelic, ignoring Hardy-Weinberg stuff), but it demonstrates the point. When you are investigating two groups, if they are both completely heterozygous for the locus (for the same alleles), you would assume the groups are fully mixed (FST = 0). And that makes sense, right? Given the information required to calculate FST, it's a reasonable conclusion to say those alleles are constantly shared between the groups. And the opposite then should be true too: if both groups are completely fixed for different alleles (FST = 1), then they clearly aren't sharing alleles with each other.

Pitfalls

Like I alluded to in the opening paragraph, FST (and it's derivatives) aren't a golden rule. There are natural phenomena that FST cannot account for, which may lead to misleading results or incorrect interpretations of results. For one, there are Hardy-Weinberg assumptions, so the presence of kin, overlapping generations, etc. kinda messes things up. Another is consideration is that the genetic data we collect now is only a snapshot in time. If two populations are completely isolated from one another and have been for generations, but are long lived and slow to evolve, then FST may mislead you into believing they are panmictic (that happened in my shark study). These are just a few examples and there are more.

Significance testing

One way of being more rigorous with FST values is significance testing (you know, generating those p-values everyone loves so much). So the question is, how do we do that? One common (frequentist) solution is permutation testing. The rationale is this: we get some kind of observed FST with our samples arranged in their natural populations, so would that FST look the same if we shuffled the samples up? Honestly, that's the gist of it. We calculate our observed FST value, then we randomly shuffle the samples into two new populations and recalculate the pairwise FST for the new population pair. Then we just keep shuffling and recalculating FST a few thousands of times. After so many iterations, we compare how many times our observed (original) FST value was greater than the permuted FST values. This is known as a one-tailed test, since we're only interested in knowing if our observed FST is greater than randomness, meaning we're making the case that FST (and therefore divergence) is highest with our samples in this non-random population configuration.

Other means

The field of population genetics has expanded tremendously since the Fisher and Wright era, and there are now all sorts of interesting ways to identify population subdivision. Each method has its strengths and weaknesses, which is why it's good practice to try multiple things and find agreement between methods. At the time of this writing, PopGen.jl can calculate pairwise FST and perform significance testing using the Nei 1987, Weir & Cockerham 1984, and Hudson et al. 1992 methods. If you're interested in contributing to PopGen.jl, we'd love to have you!

Relatedness Tutorial

November 14, 2020 · 11 min read

Pavel Dimens

Little this, little that

PopGen.jl <0.9.0

The kinship interface has changed a bit between versions 0.7 and 0.9. This post has not yet been updated for versions 0.9.0+. To follow along, use versions 0.8.0 or lower.

Getting Started

In a population genetics study, you often need to identify if there are kin in your data. This may be necessary because you are trying to remove kin from your data (because of Hardy-Weinberg assumptions), or maybe kinship is a central interest in your study. Either way, the goal of this tutorial is to provide you with a basic tutorial on using PopGen.jl to perform a relatedness analysis, which is sometimes called a kinship analysis. To follow along, you'll need to have Julia, along with the packages PopGen.jl, PopGenSims.jl, and StatsBase.jl installed. We'll be using the nancycats data because it's smaller than gulfsharks, so things should be a lot quicker.

using PopGen, PopGenSims, StatsBase

julia> cats = @nancycats
PopData{Diploid, 9 Microsatellite Loci}
  Samples: 237
  Populations: 17

Estimators

Like Coancestry and the R packages that wrap it (i.e. relate, related), PopGen.jl provides a whole bunch of relatedness estimators that you can choose from for your data. Unfortunately, there is no right answer and you will need to use your discretion. Some people choose an estimator based on the heterozygosity of the data, others choose one based on more liberal or conservative values, and there are yet more criteria one can consider for choosing an estimator. To keep things simple, we're going to use LynchLi. Why? Because I'm the one writing this tutorial, and I said so 😁.

The Steps

1. Calculate pairwise relatedness

This seems pretty obvious, but it needs to be said. In order to do the analysis, you need you get the pairwise relatedness values for the individuals of interest in your data. To keep things simple, we're going to do an all-by-all comparison. But, we also want to boostrap the pairs to create confidence intervals ("CI") for each pair, so let's talk about that.

2. Bootstrap to calculate CI

It would behoove us to bootstrap the loci in a pairwise comparison n number of times so we can create a confidence interval for the relatedness estimates for each pair. This inflates the runtime of the analysis substantially, but it's a very useful method in making sense of our estimates. If one was to be conservative (and we generally are), then we would reject an estimate for a pair whose CI includes zero. In PopGen.jl, the estimates and bootstrapping are done all at once to minimize processing time, so the command for that would be

julia> rel_out  = kinship(cats, method = LynchLi, iterations = 1000)

By default, the kinship function uses a 95% CI (interval = (0.025, 0.975)), but you can change that with interval = (low,high) where low and high are decimals of your quantiles. The kinship() function returns a NamedTuple of dataframes whenever you are bootstrapping, where each element shares its name with the method used, so in this case, we can access our results with rel_out.LynchLi.

relatedness
plotting

julia> rel_out.LynchLi
27966×8 DataFrame
   Row │ sample_1  sample_2  n_loci  LynchLi     LynchLi_mean  LynchLi_median  LynchLi_S ⋯
       │ String    String    Int64   Float64?    Float64?      Float64?        Float64?  ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────
│ N215      N216           8   0.743535      0.747288        0.748042     0.75344 ⋯
│ N215      N217           8   0.230605      0.233593        0.240085     0.34187
│ N215      N218           8   0.230605      0.230507        0.221861     0.32161
│ N215      N219           8   0.230605      0.23601         0.23567      0.32782
│ N215      N220           8   0.333191      0.33798         0.350492     0.39898 ⋯
│ N215      N221           8   0.589656      0.594223        0.601308     0.61945
│ N215      N222           8   0.0254328     0.0347216       0.0262021    0.21408
│ N215      N223           8   0.333191      0.329983        0.331411     0.38402
│ N215      N224           8  -0.0258602    -0.021062       -0.0301579    0.21112 ⋯
│ N215      N7             8  -0.282325     -0.27967        -0.288337     0.33611
│ N215      N141           8  -0.0771532    -0.0796867      -0.083113     0.21261
│ N215      N142           8   0.0254328     0.0302549       0.0330718    0.23957
   ⋮   │    ⋮         ⋮        ⋮         ⋮            ⋮              ⋮             ⋮     ⋱
│ N295      N289           7   0.322731      0.347021        0.34118      0.41168 ⋯
│ N295      N290           7   0.153414      0.160102        0.164866     0.22862
│ N296      N297           7  -0.0159038    -0.0182747      -0.0187108    0.16981
│ N296      N281           7   0.0405353     0.037025        0.0422647    0.15294
│ N296      N289           7   0.322731      0.328379        0.337317     0.35578 ⋯
│ N296      N290           7   0.153414      0.152384        0.16194      0.19131
│ N297      N281           7  -0.0159038    -0.0128349      -0.0303449    0.21030
│ N297      N289           7   0.37917       0.392517        0.392818     0.45139
│ N297      N290           7   0.435609      0.437829        0.450044     0.47027 ⋯
│ N281      N289           8   0.20428       0.21279         0.207425     0.29611
│ N281      N290           7   0.37917       0.386583        0.390585     0.45471
│ N289      N290           7   0.209853      0.217811        0.222894     0.28649
                                                          2 columns and 27942 rows omitted

And while it's totally optional, we can plot the distribution of values for some visual data exploration. For that we'll use Plots.jl and StatsPlots.jl.

plotting packages

We could have used any plotting package, but Plots.jl was chosen for simplicity. Other great options are (and not limited to): Makie.jl, Gadfly.jl, VegaLite.jl, and PlotlyJS.jl.

using Plots, StatsPlots

julia> density(rel_out.LynchLi.LynchLi)
julia> title!("LynchLi relatedness")

relatedness_histo

3. Create CI's for the sibships

There's more???

The next set of steps seem like a lot more work, so allow me to explain. The estimators all generally give you some value between 0-1 (or 0-0.5, same idea) and you can intuit that certain values mean certain things, like that 0 is "unrelated", 0.25 is "half-sib", and 0.5 is "full-sib". However, those are fixed values, so how do we know how far we can deviate from 0.25 (for example) and still call our pair half-siblings? Instead of hand-waving, we can create confidence intervals from simulated data to act as sibship ranges for our data. If this doesn't make sense yet, it will below. Promise!

i. simulate known sibship pairs

Next, we need to further contextualize what out estimates actually mean, and we need to devise a rigorous and defensible method to define the sibling-ship ("sibship") of a pair of samples as unrelated, half-sibs, or full-sibs. To do this, we're going to use PopGenSims.jl to simulate data based on the allele frequencies in our data. What simulatekin does is simulate individuals based on the allele frequencies in your PopData, then simulate offspring of a particular siblingship resulting from the "mating" of those individuals. For example, when you specify "fullsib", it generates two offspring resulting from two parents, n number of times. We want to do this for each of the three kinds of sibships.

julia> kin_sims = simulatekin(cats, fullsib = 500, halfsib = 500, unrelated = 500)
PopData{Diploid, 9 Microsatellite Loci}
  Samples: 3000
  Populations: 3

We can keep all three simulated relationships together, but it will be easier to explain things (for instructional purposes) if we don't, so let's split out each into its own PopData.

julia> fullsib = kin_sims[genodata(kin_sims).population .== "fullsib"] ;
julia> halfsib = kin_sims[genodata(kin_sims).population .== "halfsib"] ;
julia> unrelated = kin_sims[genodata(kin_sims).population .== "unrelated"] ;

ii. get relatedness estimates for the simulated data

Next, we want to get the relatedness estimate for each simulated pair of "known" sibship. We are only interested in the values for the simulated pairs and not samples across pairs. If you aren't sure why that is, think of it this way: we're trying to create a range of values where we can confidently say unknown things are full-sibs (or half-sib, etc.), so we want to know what range of values we get from a bunch of known fullsib pairs, not the unknown relationships of samples between pairs.

It would a nightmare to manually specify only 2 individuals at a time into kinship(). Instead, the function has a shortcut built into it that will recognize the population names generated from simulatekin and only give you relatedness estimates for those pairs. So, we just need to run it once for each of our simulations, this time without bootstrapping because we are only interested in the estimates. Make sure to use the same estimator! The run will be a lot faster this time because it only needs to calculate estimates for 500 pairs each time (our n from above) and without bootstrapping. When not bootstrapping, kinship() returns a dataframe (versus a NamedTuple of dataframes).

unrelated relatedness
halfsib relatedness
fullsib relatedness

julia> un_sims_rel = kinship(unrelated_sims, method = LynchLi)
500×4 DataFrame
 Row │ sample_1            sample_2            n_loci  LynchLi    
     │ String              String              Int64   Float64?   
─────┼────────────────────────────────────────────────────────────
   1 │ sim001_unrelated_1  sim001_unrelated_2       9  -0.11419
   2 │ sim002_unrelated_1  sim002_unrelated_2       9  -0.337028
   3 │ sim003_unrelated_1  sim003_unrelated_2       9  -0.0696222
  ⋮  │         ⋮                   ⋮             ⋮         ⋮
 498 │ sim498_unrelated_1  sim498_unrelated_2       9   0.019513
 499 │ sim499_unrelated_1  sim499_unrelated_2       9   0.019513
 500 │ sim500_unrelated_1  sim500_unrelated_2       9   0.019513
                                                  494 rows omitted

julia> half_sims_rel = kinship(halfsib_sims, method = LynchLi)
500×4 DataFrame
 Row │ sample_1          sample_2          n_loci  LynchLi    
     │ String            String            Int64   Float64?   
─────┼────────────────────────────────────────────────────────
   1 │ sim001_halfsib_1  sim001_halfsib_2       9  -0.0182636
   2 │ sim002_halfsib_1  sim002_halfsib_2       9   0.468732
   3 │ sim003_halfsib_1  sim003_halfsib_2       9   0.291643
  ⋮  │        ⋮                 ⋮            ⋮         ⋮
 498 │ sim498_halfsib_1  sim498_halfsib_2       9   0.42446
 499 │ sim499_halfsib_1  sim499_halfsib_2       9   0.513004
 500 │ sim500_halfsib_1  sim500_halfsib_2       9   0.0702811
                                              494 rows omitted

julia> full_sims_rel = kinship(fullsib_sims, method = LynchLi)
500×4 DataFrame
 Row │ sample_1          sample_2          n_loci  LynchLi  
     │ String            String            Int64   Float64? 
─────┼──────────────────────────────────────────────────────
   1 │ sim001_fullsib_1  sim001_fullsib_2       9  0.732808
   2 │ sim002_fullsib_1  sim002_fullsib_2       9  0.599213
   3 │ sim003_fullsib_1  sim003_fullsib_2       9  0.376553
  ⋮  │        ⋮                 ⋮            ⋮        ⋮
 498 │ sim498_fullsib_1  sim498_fullsib_2       9  0.376553
 499 │ sim499_fullsib_1  sim499_fullsib_2       9  0.510149
 500 │ sim500_fullsib_1  sim500_fullsib_2       9  0.109361
                                            494 rows omitted

And if we wanted to plot what that looks like (optional):

using Plots, StatsPlots

julia> density(rel_out.LynchLi.LynchLi, label = "real data", color = :grey, fill = (0, :grey))
julia> density!(un_sims_rel.LynchLi, label = "unrelated")
julia> density!(half_sims_rel.LynchLi, label = "halfsib", color = :blue)
julia> density!(full_sims_rel.LynchLi, label = "fullsib", color = :black)
julia> title!("relatedness estimates on simulated and real data")

relatedness_histo

Hopefully by now you are starting to contextualize why we're doing all of this. The distributions generated from our simulated data are giving us a better indication of what "unrelated", "halfsib", and "fullsib" estimates look like in our data.

iii. sibship intervals

What we just did is create null distributions for each sibship relationship, so now all that's left is to get a confidence interval from each. Keep in mind that your values will be a bit different due to the randomization involved with many of these steps.

julia> unrelated_ci = quantile(un_sims_rel.LynchLi, (0.025, 0.975))
(-0.3380513384854698, 0.33097433075726507)

julia> halfsibs_ci = quantile(half_sims_rel.LynchLi, (0.025, 0.975))
(-0.06652262584414452, 0.5556155725649398)

julia> fullsibs_ci = quantile(full_sims_rel.LynchLi, (0.025, 0.975))
(0.1989727186688347, 0.8219939374819633)

So, given our data and the simulations we made, we can now make a reasonable assumption regarding the ranges for each sibship relationship:

Relationship	Lower Bound	Upper Bound
unrelated	-0.338	0.331
half sibling	-0.066	0.555
full sibling	0.199	0.82

4. Finally, the data assessment

Now that we have our relatedness estimates and the acceptable sibship ranges given our data, we can see where our data falls. Now, we can say that a particular sample pair is unrelated/halfsib/fullsib if:

that pair's confidence interval does not include zero and
that pair's estimate falls within any of the three calculate ranges

Unfortunately, the ranges overlap quite a bit, which makes interpretation difficult, so it may be suitable to use a different estimator.

Closing remarks

There is more that can be done for relatedness, like network analysis. However, this tutorial covers what we consider a reasonable way to assess kinship in population genetic studies. If removing kin is your ultimate goal, consider the effects doing that may have on the analyses you are looking to do. Additionally, consider which individual of a pair you would remove and why. If you were curious as to how many identical loci a pair of samples may share, you can check that using pairwiseidentical(). Good luck!

Logo Graveyard

June 3, 2020 · One min read

Pavel Dimens

Little this, little that

What is that logo?

If you haven't guessed already, it's the beloved Punnett Square! Nothing screams genetics like everyone's first entry-level genetic diagram!

The Logo Graveyard

While this is completely unrelated to anything important about population genetics or Julia, I (Pavel) want you all to understand the logo-development process that led us to our logo so you can feel my struggle. The process is as follows:

Jason and I spitball ideas
I procrastinate real work and open up Inkscape to draft some ideas
I compose 1-3 versions of an idea and send it to Jason

if Jason != veto
    for i in 1:5
    Jason critiques and suggests changes
        I makes the changes
    end
end 

Jason and I finalize the idea!
Weeks pass and I'm not quite satisfied and we start at 1 again

I'm happy to say that we love the Punnett Square and it's for keeps, but have a look at the scrapped ideas as a little walk down memory lane:

logo graveyard

A bit about FST

What is FST?​

Pairwise FST​

Pitfalls​

Significance testing​

Other means​

Getting Started​

Estimators​

The Steps​

1. Calculate pairwise relatedness​

2. Bootstrap to calculate CI​

3. Create CI's for the sibships​

i. simulate known sibship pairs​

ii. get relatedness estimates for the simulated data​

iii. sibship intervals​

4. Finally, the data assessment​

Closing remarks​

What is that logo?​

The Logo Graveyard​

What is FST?

Pairwise FST

Pitfalls

Significance testing

Other means

Getting Started

Estimators

The Steps

1. Calculate pairwise relatedness

2. Bootstrap to calculate CI

3. Create CI's for the sibships

i. simulate known sibship pairs

ii. get relatedness estimates for the simulated data

iii. sibship intervals

4. Finally, the data assessment

Closing remarks

What is that logo?

The Logo Graveyard