Last updated: 2023-05-15

Checks: 7 0

Knit directory: fitnessGWAS/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0.4). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20180914) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9abaadc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rapp.history
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .httr-oauth
    Ignored:    .pversion
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    code/Drosophila_GWAS.Rmd
    Ignored:    data/.DS_Store
    Ignored:    data/derived/
    Ignored:    data/input/.DS_Store
    Ignored:    data/input/.pversion
    Ignored:    data/input/dgrp.fb557.annot.txt
    Ignored:    data/input/dgrp2.bed
    Ignored:    data/input/dgrp2.bim
    Ignored:    data/input/dgrp2.fam
    Ignored:    data/input/huang_transcriptome/
    Ignored:    figures/.DS_Store
    Ignored:    old_analyses/.DS_Store

Untracked files:
    Untracked:  figures/Figure 2 edited.pptx
    Untracked:  figures/fig1.pdf
    Untracked:  figures/fig1_font.pdf
    Untracked:  figures/fig2_SNPs_manhattan_plot_edited.png
    Untracked:  old_analyses/Data for old analyses/
    Untracked:  old_analyses/eQTL_analysis.Rmd
    Untracked:  old_analyses/fitness_data.csv
    Untracked:  old_analyses/gcta_quant_genetics_OLD.Rmd
    Untracked:  old_analyses/quantitative_genetics_OLD_brms.Rmd

Unstaged changes:
    Modified:   code/main_paper_figures.Rmd
    Modified:   code/main_paper_figures.docx
    Modified:   code/pdf_supp_material.Rmd
    Modified:   code/pdf_supp_material.pdf
    Modified:   figures/fig2_SNPs_manhattan_plot.png
    Modified:   figures/fig3_boyle_plot.pdf
    Modified:   figures/fig4_mutation_load.pdf
    Modified:   figures/fig5_quartiles_plot.pdf
    Modified:   figures/fig6_antagonism_ratios.pdf
    Modified:   figures/fig7_models.pdf
    Deleted:    output/README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/perform_gwas.Rmd) and HTML (docs/perform_gwas.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 9abaadc lukeholman 2023-05-15 wflow_publish("analysis/perform_gwas.Rmd")
html 0ba3c20 lukeholman 2023-05-15 Build site.
Rmd 358afb4 lukeholman 2023-03-21 Tidying up and small edits
html a6ace40 lukeholman 2023-03-16 Build site.
Rmd de878af lukeholman 2023-03-16 wflow_publish("analysis/perform_gwas.Rmd")
html 9de9db7 lukeholman 2023-03-15 Build site.
Rmd 9f5b6a9 lukeholman 2023-03-15 wflow_publish("analysis/perform_gwas.Rmd")
html 17cf6e8 lukeholman 2023-03-15 Build site.
Rmd 284fdf0 lukeholman 2023-03-15 wflow_publish("analysis/perform_gwas.Rmd")
html 8830ad4 lukeholman 2023-03-15 Build site.
Rmd 2080e0c lukeholman 2023-03-15 wflow_publish("analysis/perform_gwas.Rmd")
Rmd d94e22d lukeholman 2023-03-14 New quant gen and other changes
html 0a415e3 lukeholman 2021-11-12 Build site.
html 7449a90 lukeholman 2021-10-01 Build site.
html 9f76189 lukeholman 2021-09-27 Build site.
Rmd 1f52aab lukeholman 2021-09-27 Commit Sept 2021
html 8d14298 lukeholman 2021-09-26 Build site.
Rmd af15dd6 lukeholman 2021-09-26 Commit Sept 2021
html 410fb73 lukeholman 2021-03-04 Build site.
html 871ae81 lukeholman 2021-03-04 Build site.
html e112260 lukeholman 2021-03-04 Build site.
html 836a780 lukeholman 2021-03-04 Build site.
html 359ff37 lukeholman 2021-03-04 Build site.
html 5506c4b lukeholman 2021-03-04 Build site.
html f64a3e9 lukeholman 2021-03-04 Build site.
Rmd fa1b9b0 lukeholman 2021-03-04 big first commit 2021
Rmd 8d54ea5 Luke Holman 2018-12-23 Initial commit
html 8d54ea5 Luke Holman 2018-12-23 Initial commit

library(dplyr)
library(ggplot2)
# library(ggdendro)
library(glue)
library(bigsnpr) # to install:   devtools::install_github("privefl/bigsnpr")
library(readr)
library(pander)
library(purrr)
library(future.apply) # for parallel code

# Load the predicted line means, as calculated by get_predicted_line_means
predicted_line_means <- read_csv("data/derived/predicted_line_means.csv")

# Note: you may need to re-download plink to get this to run on non-Mac systems
# I used bigsnpr::download_plink()
plink <- paste(getwd(), "code/plink", sep = "/")

# # Load the wolbachia infection status data from the DGRP website
# wolbachia <- read_csv("data/input/wolbachia.csv") %>% 
#   mutate(line = paste("line", line, sep = ""))
# 
# # Load the chomosomal inversion data from the DGRP website
# # these are the 5 inversions that Huang et al. PNAS corrected for
# inversions <- read_csv("data/input/inversion genotypes.csv") %>%
#     mutate(line = paste("line", line, sep = "")) %>%
#     select(line, `In(2L)t`, `In(2R)NS`, `In(3R)P`, `In(3R)K`, `In(3R)Mo`) 

# helper function to pass commands to the terminal
# Note that we set `intern = TRUE`, and pass the result of `system()` to `cat()`, 
# ensuring that the Terminal output will be printed in this knitr report. 
run_command <- function(shell_command, wd = getwd(), path = ""){
  cat(system(glue("cd ", wd, path, "\n",shell_command), intern = TRUE), sep = '\n')
}

Perform SNP quality control and imputation

We cleaned up the DGRP’s .bed/.bim/.fam files (available from the Mackay lab website) as follows:

  1. Remove any SNPs for which genotypes are missing for >10% of the DGRP lines. We then use the software Beagle to impute the remaining missing genotypes.
  2. Remove SNPs with a minor allele frequency of less than 5%

Note that in the PLINK-formatted genotype files, lines fixed for the major allele are coded as 2, and lines fixed for the minor allele as 0. This means that in the association tests we calculate, negative effect sizes mean that the minor allele is associated with lower fitness, while positive effect sizes means that the minor allele is associated with higher fitness.

beagle <- bigsnpr::download_beagle()

perform_SNP_QC_and_imputation <- function(phenotypes){
  
  if("block" %in% names(phenotypes)) phenotypes <- phenotypes %>% select(-block)
  
  # Make a list of the lines in our sample and save as a text file for passing to PLINK
  lines_to_keep <- gsub("_", "", phenotypes$line) %>% cbind(.,.)
  write.table(lines_to_keep, row.names = FALSE, col.names = FALSE, file = "data/derived/lines_to_keep.txt", quote = FALSE)
  
  # Define a function to add our phenotype data to a .fam file, which is needed for GWAS analysis and to make sure PLINK includes these samples
  # The 'phenotypes' data frame needs to have a column called 'line'
  add_phenotypes_to_fam <- function(filepath, phenotypes){
    read_delim(filepath, col_names = FALSE, delim = " ") %>% 
      select(X1, X2, X3, X4, X5) %>% # Get all the non-phenotype columns
      left_join(phenotypes %>% 
                  mutate(line = gsub("_", "", line)), 
                by = c("X1" = "line")) %>%
      write.table(file = filepath, 
                  col.names = FALSE, row.names = FALSE, 
                  quote = FALSE, sep = " ")
  }
  
  # Use Plink to clean and subset the DGRP's SNP data as follows:
  # Only keep SNPs for which at least 90% of DGRP lines were successfully genotyped (--geno 0.1)
  # Only keep SNPs with a minor allele frequency of 0.05 or higher (--maf 0.05)
  # Finally, write the processed BIM/BED/FAM files to the data/derived directory
  run_command(glue("{plink} --bfile dgrp2",
                   " --geno 0.1 --maf 0.05 --allow-no-sex", 
                   " --make-bed --out ../derived/dgrp2_QC_all_lines"), path = "/data/input/")
  
  # Use the shell command 'sed' to remove underscores from the DGRP line names in the .fam file (e.g. 'line_120' becomes 'line120')
  # Otherwise, these underscores cause trouble when we need to convert from PLINK to vcf format (vcf format uses underscore as a separator)
  for(i in 1:2) run_command("sed -i '' 's/_//' dgrp2_QC_all_lines.fam", path = "/data/derived/")
  
  # Now impute the missing genotypes using Beagle
  # This part uses the data for the full DGRP panel of >200 lines, to infer missing genotypes as accurately as possible. 
  # This step uses a lot of memory (I set to 28MB max, and it used 26.5GB), but maybe it can also run on a less powerful computer?
  # The bigsnpr package provides a helpful wrapper for Beagle called snp_beagleImpute(): it translates to a VCF file and back again using PLINK
  snp_beagleImpute(beagle, plink, 
                   bedfile.in = "data/derived/dgrp2_QC_all_lines.bed", 
                   bedfile.out = "data/derived/dgrp2_QC_all_lines_imputed.bed",
                   ncores = 7, 
                   memory.max = 20)
  
  # assign a sex of 'female' to all the DGRP lines (Beagle removes the sex, and it seems PLINK needs individuals to have a sex, 
  # despite PLINK having a command called --allow-no-sex)
  run_command("sed -i '' 's/    0   0   0/  0   0   2/' dgrp2_QC_all_lines_imputed.fam", path = "/data/derived/")
  
  # Re-write the .bed file, to make sure the MAF threshold and minor/major allele designations are correctly assigned post-Beagle
  run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed",
                   " --geno 0.1 --maf 0.05 --allow-no-sex", 
                   " --make-bed --out dgrp2_QC_all_lines_imputed_correct"), path = "/data/derived/")
  #unlink(list.files("data/derived", pattern = "~", full.names = TRUE)) # delete the original files, which were given a ~ file name by PLINK
  
  # Use PLINK to get the allele IDs and calculate the MAFs across the whole DGRP, for all SNPs that survived QC
  # The file created is called data/derived/plink.frq
  run_command("{plink} --bfile dgrp2_QC_all_lines_imputed_correct --freq", path = "/data/derived")

  # Now cull the PLINK files to just the lines that we measured, and re-apply the 
  # MAF cut-off of 0.05 for the new smaller sample of DGRP lines
  run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed_correct",
                   " --keep-allele-order", # force PLINK to retain the major/minor allele designations that apply to the DGRP as a whole
                   " --keep lines_to_keep.txt --geno 0.1 --maf 0.05", 
                   " --make-bed --out dgrp2_QC_focal_lines"), path = "/data/derived/")
  
  # edit the new FAM file to add the phenotype data from 'phenotypes'
  add_phenotypes_to_fam("data/derived/dgrp2_QC_focal_lines.fam", phenotypes)
  
  # Clean up:
  unlink(c("data/derived/lines_to_keep.txt", 
           "data/derived/plink.log",
           "data/derived/dgrp2_QC_all_lines_imputed.bed",
           "data/derived/dgrp2_QC_all_lines_imputed.bim",
           "data/derived/dgrp2_QC_all_lines_imputed.fam",
           "data/derived/dgrp2_QC_all_lines_imputed.log",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.bed",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.bim",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.fam",
           "data/derived/dgrp2_QC_all_lines_imputed_correct.log"))
}

perform_SNP_QC_and_imputation(phenotypes = predicted_line_means)

Run GWAS using GEMMA

Decompose the genomic relatedness matrix

This step is needed as a set up for the mixed model association tests, which use the eigenvectors and values of the GRM to adjust for ‘population structure’ among our samples.

# Any command with {bed} uses this particular PLINK file, double-checking that all the SNPs have MAF > 0.05
bed <- "dgrp2_QC_focal_lines -maf 0.05" 
#cov <- "-c covariate_file_for_GEMMA.txt" # not used at present

# Calculate the *centered* GRM for the focal lines (option -gk 1)
run_command("gemma -bfile {bed} -r2 0.99 -gk 1 -o GRM", path = "/data/derived") 

# Perform decomposition of the GRM, and save its eigenvalues and eigenvectors
run_command("gemma -bfile {bed} -r2 0.99 -k ./output/GRM.cXX.txt -eigen -o eigen_decomp", path = "/data/derived") 

# Remove eigen vectors + values for which the eigenvalue is close to zero, and save the modified file
values <- read_tsv("data/derived/output/eigen_decomp.eigenD.txt", col_names = "eigenvalue")$eigenvalue
vectors <- read_tsv("data/derived/output/eigen_decomp.eigenU.txt", col_names = FALSE)
missing_row <- which(is.na(read_delim("data/derived/dgrp2_QC_focal_lines.fam", delim = " ", col_names = FALSE)$X8))

write_tsv(vectors, path = "data/derived/output/eigen_decomp.eigenU_female.txt", col_names = FALSE)
write_tsv(data.frame(values), path = "data/derived/output/eigen_decomp.eigenD_female.txt", col_names = FALSE)

write_tsv(vectors[-missing_row, -missing_row], path = "data/derived/output/eigen_decomp.eigenU_male.txt", col_names = FALSE)
write_tsv(data.frame(values[-missing_row]), path = "data/derived/output/eigen_decomp.eigenD_male.txt", col_names = FALSE)
unlink("data/derived/output/eigen_decomp.eigenD.txt")
unlink("data/derived/output/eigen_decomp.eigenU.txt")

Run univariate association tests

The following code chunk runs 4 linear mixed models, implemented in GEMMA. Each linear mixed models uses the decomposed genomic relatedness matrix from above, and has one of our four fitness traits as the response variable.

# The option "-lmm 1" runs a linear mixed model using Wald test to get the p-values
GRM_female <- "-d ./output/eigen_decomp.eigenD_female.txt -u ./output/eigen_decomp.eigenU_female.txt"
GRM_male <- "-d ./output/eigen_decomp.eigenD_male.txt -u ./output/eigen_decomp.eigenU_male.txt"

c("gemma -bfile {bed} {GRM_female} -lmm 1 -n 1 -o female_early_lmm",
  "gemma -bfile {bed} {GRM_female} -lmm 1 -n 2 -o female_late_lmm",
  "gemma -bfile {bed} {GRM_male} -lmm 1 -n 3 -o male_early_lmm",
  "gemma -bfile {bed} {GRM_male} -lmm 1 -n 4 -o male_late_lmm") %>%
  lapply(run_command, path = "/data/derived")

Add the allele IDs and minor allele frequencies (MAFs) to variant annotation database

This part uses the full DGRP panel, since our aim is to determine how common these alleles are in the original wild population from which the DGRP was captured. These MAFs are the ones used in our statistical analyses and plots.

# Extract and save the MAFs, SNP positions, and major/minor alleles
MAFs <- read.table("data/derived/plink.frq", 
                   header = TRUE, stringsAsFactors = FALSE) %>% 
  mutate(position = map_chr(
    strsplit(SNP, split = "_"), 
    function(x) x[2])) %>%
  select(SNP, position, MAF, A1, A2) %>% 
  rename(minor_allele = A1,
         major_allele = A2) 

db <- DBI::dbConnect(RSQLite::SQLite(), 
                     "data/derived/annotations.sqlite3", create = FALSE)

MAFs <- tbl(db, "variants") %>% 
  select(SNP, FBID, site.class, distance.to.gene, chr) %>% collect() %>%
  full_join(MAFs, by = "SNP") %>%
  filter(!is.na(MAF)) %>%
  mutate(site.class = replace(site.class, is.na(site.class), "INTERGENIC"))

# Delete the original variant annotation table from the db, and add the new one back in
db %>% db_drop_table(table = "variants") 
db %>% copy_to(MAFs, 
               "variants", temporary = FALSE, 
               indexes = list("SNP", "FBID", "chr", "site.class")) 
rm(MAFs)

Clean up the GEMMA results files

# The 4 univariate files:
c("data/derived/output/female_early_lmm.assoc.txt",
  "data/derived/output/female_late_lmm.assoc.txt",
  "data/derived/output/male_early_lmm.assoc.txt",
  "data/derived/output/male_late_lmm.assoc.txt") %>%
  lapply(function(filename){
    df <- read_tsv(filename)
    names(df)[names(df) == "rs"] <- "SNP"
    df %>% select(SNP, beta, se, p_wald) %>% 
      write_tsv(filename)
  })

Concatenate the results of the four GWAS

Load up the GWAS results, which were produced by using linear mixed models implemented in GEMMA, and arrange the data in a ‘wide’ format, where each row is one SNP, and the columns hold the estimates of \(\beta\), the standard error of \(\beta\), and the p-value for the GEMMA association tests on each of the four fitness traits.

load_gwas_results <- function(fitness.component){
  new_names <- c(x = "beta", y = "se", z = "p_wald")
  names(new_names) <- c(paste("beta_", fitness.component, sep = ""), 
                        paste("SE_", fitness.component, sep = ""), 
                        paste("pvalue_", fitness.component, sep = ""))
  
  paste("data/derived/output/", fitness.component, "_lmm.assoc.txt", sep = "") %>%
    read_tsv() %>%
    select(SNP, beta, se, p_wald) %>%
    rename(!!new_names)
}
all_univariate_lmm_results <- load_gwas_results("female_early") %>%
  full_join(load_gwas_results("female_late"), by = "SNP") %>%
  full_join(load_gwas_results("male_early"), by = "SNP") %>%
  full_join(load_gwas_results("male_late"), by = "SNP") 

Group together SNPs that are in 100% linkage disequilibrium

Here, we take advantage of the fact that SNPs in 100% LD with one another other have identical estimates of \(|\beta|\) and its standard error, for all four fitness traits. The following code groups together SNPs that have identical association test statistics, producing a data frame in which the number of rows is equal to the number of individual SNPs and groups of SNPs in 100% LD.

# Group loci with identical GWAS results (these are in 100% LD with each other)
all_univariate_lmm_results <- all_univariate_lmm_results %>%
  mutate(pasted = paste(beta_female_early, beta_female_late, beta_male_early, beta_male_late, 
                        SE_female_early, SE_female_late, SE_male_early, SE_male_late)) %>%
  group_by(pasted) %>%
  summarise(SNPs = paste0(SNP, collapse = ", "),    # If there are multiple SNPs, concatenate their names
            beta_female_early = beta_female_early[1], 
            beta_female_late = beta_female_late[1], 
            beta_male_early = beta_male_early[1], 
            beta_male_late = beta_male_late[1],
            SE_female_early = SE_female_early[1], 
            SE_female_late = SE_female_late[1], 
            SE_male_early = SE_male_early[1], 
            SE_male_late = SE_male_late[1],
            pvalue_female_early = pvalue_female_early[1], 
            pvalue_female_late = pvalue_female_late[1], 
            pvalue_male_early = pvalue_male_early[1], 
            pvalue_male_late = pvalue_male_late[1]) %>%
  ungroup() %>% select(-pasted) %>% 
  arrange(SNPs)


# A handful of SNPs with low MAF could not be estimated and they have 'NA' for beta and SE, so remove them now
# These are SNPs where removing line 354 (where we measured female but not male fitness) pushes the MAF below 0.05 for males
SNPs_to_remove <- all_univariate_lmm_results$SNPs[!complete.cases(all_univariate_lmm_results)] 

all_univariate_lmm_results <- all_univariate_lmm_results %>% 
  filter(!(SNPs %in% SNPs_to_remove))

write_csv(all_univariate_lmm_results, file = "data/derived/all_univariate_GEMMA_results.csv")
unlink("data/derived/output/female_early_lmm.assoc.txt")
unlink("data/derived/output/female_late_lmm.assoc.txt")
unlink("data/derived/output/male_early_lmm.assoc.txt")
unlink("data/derived/output/male_late_lmm.assoc.txt")