Last updated: 2023-05-15
Checks: 7 0
Knit directory: fitnessGWAS/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0.4). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20180914)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 9abaadc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rapp.history
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .httr-oauth
Ignored: .pversion
Ignored: analysis/.DS_Store
Ignored: code/.DS_Store
Ignored: code/Drosophila_GWAS.Rmd
Ignored: data/.DS_Store
Ignored: data/derived/
Ignored: data/input/.DS_Store
Ignored: data/input/.pversion
Ignored: data/input/dgrp.fb557.annot.txt
Ignored: data/input/dgrp2.bed
Ignored: data/input/dgrp2.bim
Ignored: data/input/dgrp2.fam
Ignored: data/input/huang_transcriptome/
Ignored: figures/.DS_Store
Ignored: old_analyses/.DS_Store
Untracked files:
Untracked: figures/Figure 2 edited.pptx
Untracked: figures/fig1.pdf
Untracked: figures/fig1_font.pdf
Untracked: figures/fig2_SNPs_manhattan_plot_edited.png
Untracked: old_analyses/Data for old analyses/
Untracked: old_analyses/eQTL_analysis.Rmd
Untracked: old_analyses/fitness_data.csv
Untracked: old_analyses/gcta_quant_genetics_OLD.Rmd
Untracked: old_analyses/quantitative_genetics_OLD_brms.Rmd
Unstaged changes:
Modified: code/main_paper_figures.Rmd
Modified: code/main_paper_figures.docx
Modified: code/pdf_supp_material.Rmd
Modified: code/pdf_supp_material.pdf
Modified: figures/fig2_SNPs_manhattan_plot.png
Modified: figures/fig3_boyle_plot.pdf
Modified: figures/fig4_mutation_load.pdf
Modified: figures/fig5_quartiles_plot.pdf
Modified: figures/fig6_antagonism_ratios.pdf
Modified: figures/fig7_models.pdf
Deleted: output/README.md
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/perform_gwas.Rmd
) and HTML
(docs/perform_gwas.html
) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 9abaadc | lukeholman | 2023-05-15 | wflow_publish("analysis/perform_gwas.Rmd") |
html | 0ba3c20 | lukeholman | 2023-05-15 | Build site. |
Rmd | 358afb4 | lukeholman | 2023-03-21 | Tidying up and small edits |
html | a6ace40 | lukeholman | 2023-03-16 | Build site. |
Rmd | de878af | lukeholman | 2023-03-16 | wflow_publish("analysis/perform_gwas.Rmd") |
html | 9de9db7 | lukeholman | 2023-03-15 | Build site. |
Rmd | 9f5b6a9 | lukeholman | 2023-03-15 | wflow_publish("analysis/perform_gwas.Rmd") |
html | 17cf6e8 | lukeholman | 2023-03-15 | Build site. |
Rmd | 284fdf0 | lukeholman | 2023-03-15 | wflow_publish("analysis/perform_gwas.Rmd") |
html | 8830ad4 | lukeholman | 2023-03-15 | Build site. |
Rmd | 2080e0c | lukeholman | 2023-03-15 | wflow_publish("analysis/perform_gwas.Rmd") |
Rmd | d94e22d | lukeholman | 2023-03-14 | New quant gen and other changes |
html | 0a415e3 | lukeholman | 2021-11-12 | Build site. |
html | 7449a90 | lukeholman | 2021-10-01 | Build site. |
html | 9f76189 | lukeholman | 2021-09-27 | Build site. |
Rmd | 1f52aab | lukeholman | 2021-09-27 | Commit Sept 2021 |
html | 8d14298 | lukeholman | 2021-09-26 | Build site. |
Rmd | af15dd6 | lukeholman | 2021-09-26 | Commit Sept 2021 |
html | 410fb73 | lukeholman | 2021-03-04 | Build site. |
html | 871ae81 | lukeholman | 2021-03-04 | Build site. |
html | e112260 | lukeholman | 2021-03-04 | Build site. |
html | 836a780 | lukeholman | 2021-03-04 | Build site. |
html | 359ff37 | lukeholman | 2021-03-04 | Build site. |
html | 5506c4b | lukeholman | 2021-03-04 | Build site. |
html | f64a3e9 | lukeholman | 2021-03-04 | Build site. |
Rmd | fa1b9b0 | lukeholman | 2021-03-04 | big first commit 2021 |
Rmd | 8d54ea5 | Luke Holman | 2018-12-23 | Initial commit |
html | 8d54ea5 | Luke Holman | 2018-12-23 | Initial commit |
library(dplyr)
library(ggplot2)
# library(ggdendro)
library(glue)
library(bigsnpr) # to install: devtools::install_github("privefl/bigsnpr")
library(readr)
library(pander)
library(purrr)
library(future.apply) # for parallel code
# Load the predicted line means, as calculated by get_predicted_line_means
predicted_line_means <- read_csv("data/derived/predicted_line_means.csv")
# Note: you may need to re-download plink to get this to run on non-Mac systems
# I used bigsnpr::download_plink()
plink <- paste(getwd(), "code/plink", sep = "/")
# # Load the wolbachia infection status data from the DGRP website
# wolbachia <- read_csv("data/input/wolbachia.csv") %>%
# mutate(line = paste("line", line, sep = ""))
#
# # Load the chomosomal inversion data from the DGRP website
# # these are the 5 inversions that Huang et al. PNAS corrected for
# inversions <- read_csv("data/input/inversion genotypes.csv") %>%
# mutate(line = paste("line", line, sep = "")) %>%
# select(line, `In(2L)t`, `In(2R)NS`, `In(3R)P`, `In(3R)K`, `In(3R)Mo`)
# helper function to pass commands to the terminal
# Note that we set `intern = TRUE`, and pass the result of `system()` to `cat()`,
# ensuring that the Terminal output will be printed in this knitr report.
run_command <- function(shell_command, wd = getwd(), path = ""){
cat(system(glue("cd ", wd, path, "\n",shell_command), intern = TRUE), sep = '\n')
}
We cleaned up the DGRP’s .bed/.bim/.fam files (available from the Mackay lab website) as follows:
Note that in the PLINK-formatted genotype files, lines fixed for the major allele are coded as 2, and lines fixed for the minor allele as 0. This means that in the association tests we calculate, negative effect sizes mean that the minor allele is associated with lower fitness, while positive effect sizes means that the minor allele is associated with higher fitness.
beagle <- bigsnpr::download_beagle()
perform_SNP_QC_and_imputation <- function(phenotypes){
if("block" %in% names(phenotypes)) phenotypes <- phenotypes %>% select(-block)
# Make a list of the lines in our sample and save as a text file for passing to PLINK
lines_to_keep <- gsub("_", "", phenotypes$line) %>% cbind(.,.)
write.table(lines_to_keep, row.names = FALSE, col.names = FALSE, file = "data/derived/lines_to_keep.txt", quote = FALSE)
# Define a function to add our phenotype data to a .fam file, which is needed for GWAS analysis and to make sure PLINK includes these samples
# The 'phenotypes' data frame needs to have a column called 'line'
add_phenotypes_to_fam <- function(filepath, phenotypes){
read_delim(filepath, col_names = FALSE, delim = " ") %>%
select(X1, X2, X3, X4, X5) %>% # Get all the non-phenotype columns
left_join(phenotypes %>%
mutate(line = gsub("_", "", line)),
by = c("X1" = "line")) %>%
write.table(file = filepath,
col.names = FALSE, row.names = FALSE,
quote = FALSE, sep = " ")
}
# Use Plink to clean and subset the DGRP's SNP data as follows:
# Only keep SNPs for which at least 90% of DGRP lines were successfully genotyped (--geno 0.1)
# Only keep SNPs with a minor allele frequency of 0.05 or higher (--maf 0.05)
# Finally, write the processed BIM/BED/FAM files to the data/derived directory
run_command(glue("{plink} --bfile dgrp2",
" --geno 0.1 --maf 0.05 --allow-no-sex",
" --make-bed --out ../derived/dgrp2_QC_all_lines"), path = "/data/input/")
# Use the shell command 'sed' to remove underscores from the DGRP line names in the .fam file (e.g. 'line_120' becomes 'line120')
# Otherwise, these underscores cause trouble when we need to convert from PLINK to vcf format (vcf format uses underscore as a separator)
for(i in 1:2) run_command("sed -i '' 's/_//' dgrp2_QC_all_lines.fam", path = "/data/derived/")
# Now impute the missing genotypes using Beagle
# This part uses the data for the full DGRP panel of >200 lines, to infer missing genotypes as accurately as possible.
# This step uses a lot of memory (I set to 28MB max, and it used 26.5GB), but maybe it can also run on a less powerful computer?
# The bigsnpr package provides a helpful wrapper for Beagle called snp_beagleImpute(): it translates to a VCF file and back again using PLINK
snp_beagleImpute(beagle, plink,
bedfile.in = "data/derived/dgrp2_QC_all_lines.bed",
bedfile.out = "data/derived/dgrp2_QC_all_lines_imputed.bed",
ncores = 7,
memory.max = 20)
# assign a sex of 'female' to all the DGRP lines (Beagle removes the sex, and it seems PLINK needs individuals to have a sex,
# despite PLINK having a command called --allow-no-sex)
run_command("sed -i '' 's/ 0 0 0/ 0 0 2/' dgrp2_QC_all_lines_imputed.fam", path = "/data/derived/")
# Re-write the .bed file, to make sure the MAF threshold and minor/major allele designations are correctly assigned post-Beagle
run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed",
" --geno 0.1 --maf 0.05 --allow-no-sex",
" --make-bed --out dgrp2_QC_all_lines_imputed_correct"), path = "/data/derived/")
#unlink(list.files("data/derived", pattern = "~", full.names = TRUE)) # delete the original files, which were given a ~ file name by PLINK
# Use PLINK to get the allele IDs and calculate the MAFs across the whole DGRP, for all SNPs that survived QC
# The file created is called data/derived/plink.frq
run_command("{plink} --bfile dgrp2_QC_all_lines_imputed_correct --freq", path = "/data/derived")
# Now cull the PLINK files to just the lines that we measured, and re-apply the
# MAF cut-off of 0.05 for the new smaller sample of DGRP lines
run_command(glue("{plink} --bfile dgrp2_QC_all_lines_imputed_correct",
" --keep-allele-order", # force PLINK to retain the major/minor allele designations that apply to the DGRP as a whole
" --keep lines_to_keep.txt --geno 0.1 --maf 0.05",
" --make-bed --out dgrp2_QC_focal_lines"), path = "/data/derived/")
# edit the new FAM file to add the phenotype data from 'phenotypes'
add_phenotypes_to_fam("data/derived/dgrp2_QC_focal_lines.fam", phenotypes)
# Clean up:
unlink(c("data/derived/lines_to_keep.txt",
"data/derived/plink.log",
"data/derived/dgrp2_QC_all_lines_imputed.bed",
"data/derived/dgrp2_QC_all_lines_imputed.bim",
"data/derived/dgrp2_QC_all_lines_imputed.fam",
"data/derived/dgrp2_QC_all_lines_imputed.log",
"data/derived/dgrp2_QC_all_lines_imputed_correct.bed",
"data/derived/dgrp2_QC_all_lines_imputed_correct.bim",
"data/derived/dgrp2_QC_all_lines_imputed_correct.fam",
"data/derived/dgrp2_QC_all_lines_imputed_correct.log"))
}
perform_SNP_QC_and_imputation(phenotypes = predicted_line_means)
The following code chunk runs 4 linear mixed models, implemented in GEMMA. Each linear mixed models uses the decomposed genomic relatedness matrix from above, and has one of our four fitness traits as the response variable.
# The option "-lmm 1" runs a linear mixed model using Wald test to get the p-values
GRM_female <- "-d ./output/eigen_decomp.eigenD_female.txt -u ./output/eigen_decomp.eigenU_female.txt"
GRM_male <- "-d ./output/eigen_decomp.eigenD_male.txt -u ./output/eigen_decomp.eigenU_male.txt"
c("gemma -bfile {bed} {GRM_female} -lmm 1 -n 1 -o female_early_lmm",
"gemma -bfile {bed} {GRM_female} -lmm 1 -n 2 -o female_late_lmm",
"gemma -bfile {bed} {GRM_male} -lmm 1 -n 3 -o male_early_lmm",
"gemma -bfile {bed} {GRM_male} -lmm 1 -n 4 -o male_late_lmm") %>%
lapply(run_command, path = "/data/derived")
This part uses the full DGRP panel, since our aim is to determine how common these alleles are in the original wild population from which the DGRP was captured. These MAFs are the ones used in our statistical analyses and plots.
# Extract and save the MAFs, SNP positions, and major/minor alleles
MAFs <- read.table("data/derived/plink.frq",
header = TRUE, stringsAsFactors = FALSE) %>%
mutate(position = map_chr(
strsplit(SNP, split = "_"),
function(x) x[2])) %>%
select(SNP, position, MAF, A1, A2) %>%
rename(minor_allele = A1,
major_allele = A2)
db <- DBI::dbConnect(RSQLite::SQLite(),
"data/derived/annotations.sqlite3", create = FALSE)
MAFs <- tbl(db, "variants") %>%
select(SNP, FBID, site.class, distance.to.gene, chr) %>% collect() %>%
full_join(MAFs, by = "SNP") %>%
filter(!is.na(MAF)) %>%
mutate(site.class = replace(site.class, is.na(site.class), "INTERGENIC"))
# Delete the original variant annotation table from the db, and add the new one back in
db %>% db_drop_table(table = "variants")
db %>% copy_to(MAFs,
"variants", temporary = FALSE,
indexes = list("SNP", "FBID", "chr", "site.class"))
rm(MAFs)
# The 4 univariate files:
c("data/derived/output/female_early_lmm.assoc.txt",
"data/derived/output/female_late_lmm.assoc.txt",
"data/derived/output/male_early_lmm.assoc.txt",
"data/derived/output/male_late_lmm.assoc.txt") %>%
lapply(function(filename){
df <- read_tsv(filename)
names(df)[names(df) == "rs"] <- "SNP"
df %>% select(SNP, beta, se, p_wald) %>%
write_tsv(filename)
})
Load up the GWAS results, which were produced by using linear mixed models implemented in GEMMA, and arrange the data in a ‘wide’ format, where each row is one SNP, and the columns hold the estimates of \(\beta\), the standard error of \(\beta\), and the p-value for the GEMMA association tests on each of the four fitness traits.
load_gwas_results <- function(fitness.component){
new_names <- c(x = "beta", y = "se", z = "p_wald")
names(new_names) <- c(paste("beta_", fitness.component, sep = ""),
paste("SE_", fitness.component, sep = ""),
paste("pvalue_", fitness.component, sep = ""))
paste("data/derived/output/", fitness.component, "_lmm.assoc.txt", sep = "") %>%
read_tsv() %>%
select(SNP, beta, se, p_wald) %>%
rename(!!new_names)
}
all_univariate_lmm_results <- load_gwas_results("female_early") %>%
full_join(load_gwas_results("female_late"), by = "SNP") %>%
full_join(load_gwas_results("male_early"), by = "SNP") %>%
full_join(load_gwas_results("male_late"), by = "SNP")
Here, we take advantage of the fact that SNPs in 100% LD with one another other have identical estimates of \(|\beta|\) and its standard error, for all four fitness traits. The following code groups together SNPs that have identical association test statistics, producing a data frame in which the number of rows is equal to the number of individual SNPs and groups of SNPs in 100% LD.
# Group loci with identical GWAS results (these are in 100% LD with each other)
all_univariate_lmm_results <- all_univariate_lmm_results %>%
mutate(pasted = paste(beta_female_early, beta_female_late, beta_male_early, beta_male_late,
SE_female_early, SE_female_late, SE_male_early, SE_male_late)) %>%
group_by(pasted) %>%
summarise(SNPs = paste0(SNP, collapse = ", "), # If there are multiple SNPs, concatenate their names
beta_female_early = beta_female_early[1],
beta_female_late = beta_female_late[1],
beta_male_early = beta_male_early[1],
beta_male_late = beta_male_late[1],
SE_female_early = SE_female_early[1],
SE_female_late = SE_female_late[1],
SE_male_early = SE_male_early[1],
SE_male_late = SE_male_late[1],
pvalue_female_early = pvalue_female_early[1],
pvalue_female_late = pvalue_female_late[1],
pvalue_male_early = pvalue_male_early[1],
pvalue_male_late = pvalue_male_late[1]) %>%
ungroup() %>% select(-pasted) %>%
arrange(SNPs)
# A handful of SNPs with low MAF could not be estimated and they have 'NA' for beta and SE, so remove them now
# These are SNPs where removing line 354 (where we measured female but not male fitness) pushes the MAF below 0.05 for males
SNPs_to_remove <- all_univariate_lmm_results$SNPs[!complete.cases(all_univariate_lmm_results)]
all_univariate_lmm_results <- all_univariate_lmm_results %>%
filter(!(SNPs %in% SNPs_to_remove))
write_csv(all_univariate_lmm_results, file = "data/derived/all_univariate_GEMMA_results.csv")
unlink("data/derived/output/female_early_lmm.assoc.txt")
unlink("data/derived/output/female_late_lmm.assoc.txt")
unlink("data/derived/output/male_early_lmm.assoc.txt")
unlink("data/derived/output/male_late_lmm.assoc.txt")
To keep the computation time and memory usage for mashr
manageable, we did not analyse every SNP analysed in the GWAS
(i.e. 1207357 SNPs), but rather a subset of them that were approximately
in linkage disequilibrium. We identified this LD-pruned set of SNPs
using the PLINK arguments --indep-pairwise 100 10 0.2
,
i.e. pruning within 100kB sliding windows, sliding 10 variants along
with each step, and allowing a maximum pairwise \(r^2\) threshold of 0.2 between loci. With
these parameters, 1,420,071 SNPs were removed, leaving 226,581 for
downstream analysis.
# indep-pairwise arguments are:
# 100kB window size,
# variant count to shift the window by 10 variants at the end of each step,
# pairwise r^2 threshold of 0.2
run_command(glue("{plink} --bfile dgrp2_QC_focal_lines",
" --indep-pairwise 100 10 0.2"), path = "/data/derived/")
file.rename("data/derived/plink.prune.in", "data/derived/SNPs_in_LD")
unlink("gwas_data/derived/plink.prune.out")
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] future.apply_1.10.0 future_1.30.0 purrr_1.0.1
[4] pander_0.6.5 readr_2.1.4 bigsnpr_1.11.6
[7] bigstatsr_1.5.12 glue_1.6.2 ggplot2_3.4.1
[10] dplyr_1.1.0 workflowr_1.7.0.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 lattice_0.20-45 listenv_0.9.0 getPass_0.2-2
[5] ps_1.7.2 rprojroot_2.0.3 digest_0.6.31 foreach_1.5.2
[9] utf8_1.2.2 parallelly_1.34.0 R6_2.5.1 bigsparser_0.6.1
[13] evaluate_0.20 httr_1.4.5 pillar_1.8.1 flock_0.7
[17] rlang_1.0.6 rstudioapi_0.14 data.table_1.14.6 whisker_0.4.1
[21] callr_3.7.3 jquerylib_0.1.4 Matrix_1.5-1 rmarkdown_2.20
[25] bigparallelr_0.3.2 stringr_1.5.0 bit_4.0.5 munsell_0.5.0
[29] compiler_4.2.2 httpuv_1.6.9 xfun_0.37 pkgconfig_2.0.3
[33] globals_0.16.2 htmltools_0.5.4 tidyselect_1.2.0 tibble_3.1.8
[37] codetools_0.2-18 fansi_1.0.4 crayon_1.5.2 tzdb_0.3.0
[41] withr_2.5.0 later_1.3.0 grid_4.2.2 jsonlite_1.8.4
[45] gtable_0.3.1 lifecycle_1.0.3 git2r_0.31.0 magrittr_2.0.3
[49] scales_1.2.1 vroom_1.6.1 cli_3.6.0 stringi_1.7.12
[53] cachem_1.0.7 fs_1.6.1 promises_1.2.0.1 doRNG_1.8.6
[57] doParallel_1.0.17 bslib_0.4.2 ellipsis_0.3.2 generics_0.1.3
[61] vctrs_0.5.2 cowplot_1.1.1 iterators_1.0.14 tools_4.2.2
[65] bit64_4.0.5 hms_1.1.2 rngtools_1.5.2 processx_3.8.0
[69] parallel_4.2.2 fastmap_1.1.1 yaml_2.3.7 colorspace_2.1-0
[73] bigassertr_0.1.6 knitr_1.42 sass_0.4.5