Last updated: 2019-03-07
Checks: 6 0
Knit directory: queen-pheromone-RNAseq/
This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190307)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: bioinformatics_scripts/data/
Ignored: bioinformatics_scripts/ref/
Ignored: data/.DS_Store
Ignored: data/apis_gene_comparisons/.DS_Store
Ignored: data/apis_gene_comparisons/wojciechowski_histone_data/.DS_Store
Ignored: data/apis_gene_comparisons/wojciechowski_histone_data/GSE110640_RAW/.DS_Store
Ignored: data/apis_gene_comparisons/wojciechowski_histone_data/GSE110641_RAW/.DS_Store
Ignored: data/component spreadsheets of queen_pheromone.db/.DS_Store
Ignored: docs/figure/pdf_supplementary_material.Rmd/
Ignored: figures/
Ignored: manuscript/
Ignored: supplement/
Unstaged changes:
Deleted: bioinformatics_scripts/data
Deleted: bioinformatics_scripts/ref
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 7d083c0 | lukeholman | 2019-03-07 | new theme |
html | f260a62 | Luke Holman | 2019-03-07 | Build site. |
Rmd | 934cda8 | Luke Holman | 2019-03-07 | First commit of new website structure |
There are already sequenced genomes of A. mellifera and B. terrestris and L. niger in NCBI. We’ll have to make something for L. flavus from scratch.
We also need to determine possible isoforms for the genes in the data set, which we will do using tophat for the genomes that have references.
This specis has no isoform data in NCBI, so we create our own. First we create bowtie2 indexed for the reference genomes using bowtie2-build
, and then tophat and cufflinks are executed using tophat.sh
using reference-based mapping to identify alternative splicing.
trinity.sh
assemble lf from raw reads. We are not using a genome guided assembly, since the ln genome is fragmented, as per recommendation of the trinity authors transdecoder.sh
predict proteins from the transcripts.
kallisto We use kallisto for gene expression analysis of transcripts from the NCBI data bases for A. mellifera and B. terrestis, for the TopHat assembly of L. niger and for the Trinity assembly of L. flavus.
rsem and ebseq This is the more traditional approach using the same data sources. We prepare references as ngvector files from the predicted transcripts, as per instructions.
We handle this by reciprocal blastp hit on the proteins vs the honey bee genome (the best annotated of the bunch) e.g., blastp -num_threads 12 -query lf.fa -db amel -outfmt 6 -evalue 1e-4 -max_target_seqs 1
We here present the individual scripts in code chunks, for convenient viewing.
ebseq.sh
#!/bin/bash
#SBATCH --job-name=ebseq
#SBATCH --partition=compute
#SBATCH --mem=500M
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
rsem=/apps/unit/MikheyevU/sasha/RSEM
refdir=./data/rsem
a=($(ls -1 ./data/raw_reads/trimmed/*_1.fastq.gz |awk -F/ '{print $NF}' |cut -c-2 |uniq)) #4
species=${a["SLURM_ARRAY_TASK_ID"]}
echo $species
# contrasts are relative to control treatments, i.e., control goes first
if [ "$species" == "bt" ]; then
samples=`for i in 2 4 5 8 10 1 3 6 7 9; do echo -ne $refdir/$species$i".isoforms.results ";done`
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="5,5"
elif [ "$species" == "am" ]; then
samples=`for i in 2 4 6 1 3; do echo -ne $refdir/$species$i".isoforms.results ";done`
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="3,2"
elif [ "$species" == "lf" ]; then
samples=`for i in 4 6 8 10 12 14 16 1 3 5 7 11 13; do echo -ne $refdir/$species$i".isoforms.results " ;done`
echo $samples
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="7,6"
elif [ "$species" == "ln" ]; then
samples=`for i in 1 3 5 7 11 2 4 6 8 12; do echo -ne $refdir/$species$i".isoforms.results ";done`
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="5,5"
fi
$rsem/rsem-generate-data-matrix `echo $samples | sed 's/isoforms/genes/g'` > $refdir/$species.genes.matrix
$rsem/rsem-run-ebseq $refdir/$species.genes.matrix $conditions $refdir/$species".genes.ebseq"
$rsem/rsem-control-fdr $refdir/$species".genes.ebseq" .05 $refdir/$species".genes.padj.ebseq"
$rsem/rsem-run-ebseq --ngvector ./ref/$species.ngvec $refdir/$species.matrix $conditions $refdir/$species".ebseq"
$rsem/rsem-control-fdr $refdir/$species".ebseq" .05 $refdir/$species".padj.ebseq"
kallisto.sh
#!/bin/bash
#SBATCH --job-name=kalisto
#SBATCH --partition=compute
#SBATCH --mem=3G
#SBATCH --time=5:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
module load kallisto
##species=lf #am, bt, ln
##SLURM_ARRAY_TASK_ID=0
#for sbatch: --export=species=am
#10 for bt, 6 for am, 12 for ln, 15 lf
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz)
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`
if [ "$species" == "bt" ]; then
ref=./ref/GCF_000214255.1_Bter_1.0_rna
elif [ "$species" == "am" ]; then
ref=./ref/GCF_000002195.4_Amel_4.5_rna
elif [ "$species" == "lf" ]; then
ref=./ref/lf
elif [ "$species" == "ln" ]; then
ref=./ref/ln
fi
kallisto quant -i $ref -o data/kallisto/$base -b 100 $f $r
orthodb.py
# takes a list of protein sequences and determines their orthogroup ids
#with python/2.7.8
#takes two files, source of fasta and output
import urllib, json, time, sys
from Bio import SeqIO
urlbase = "http://www.orthodb.org/blast?"
outfile = open(sys.argv[2], "w", 0)
level = "level=7434" #Aculeata
count = 1
start = time.time()
for rec in SeqIO.parse(sys.argv[1], "fasta"):
while True:
try:
response = urllib.urlopen(urlbase + level + "&seq=" + str(rec.seq).replace("*",""))
data = response.read()
parsedData = json.loads(data)
except:
print "http error"
time.sleep(2)
else:
break
if parsedData['data']:
outfile.write(rec.id + "\t" + ",".join(parsedData['data']) + "\n")
count += 1
if count % 100 == 0:
print("%i %.2f" % (count, time.time() - start))
start = time.time()
rsem-generate-data-matrix
#!/usr/bin/perl
use strict;
if (scalar(@ARGV) == 0) {
print "Usage: rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.matrix\n";
print "Results files should be either all .genes.results or all .isoforms.results.\n";
exit(-1);
}
my $offsite = 5; # ###### change to TPM
my $line;
my $n = scalar(@ARGV);
my $M = -1;
my @matrix = ();
# 0, file_name; 1, reference of expected count array; 2, reference of transcript_id/gene_id array
sub loadData {
open(INPUT, $_[0]);
my $line = <INPUT>; # The first line contains only column names
while ($line = <INPUT>) {
chomp($line);
my @fields = split(/\t/, $line);
push(@{$_[2]}, "\"$fields[0]\"");
push(@{$_[1]}, $fields[$offsite]);
}
close(INPUT);
if (scalar(@{$_[1]}) == 0) {
print STDERR "Nothing is detected! $_[0] may not exist or is empty.\n";
exit(-1);
}
}
#0, M; 1, reference of @ids_arr; 2, reference of @ids
sub check {
my $size = $_[0];
for (my $i = 0; $i < $size; $i++) {
if ($_[1]->[$i] ne $_[2]->[$i]) {
return 0;
}
}
return 1;
}
my @ids_arr = ();
for (my $i = 0; $i < $n; $i++) {
my (@ids, @ecs) = ();
&loadData($ARGV[$i], \@ecs, \@ids);
if ($M < 0) {
$M = scalar(@ids);
@ids_arr = @ids;
}
elsif (!&check($M, \@ids_arr, \@ids)) {
print STDERR "Number of lines among samples are not equal!\n";
exit(-1);
}
my $colname;
if (substr($ARGV[$i], 0, 2) eq "./") { $colname = substr($ARGV[$i], 2); }
else { $colname = $ARGV[$i]; }
$colname = "\"$colname\"";
@ecs = ($colname, @ecs);
push(@matrix, \@ecs);
}
@ids_arr = ("", @ids_arr);
@matrix = (\@ids_arr, @matrix);
for (my $i = 0; $i <= $M; $i++) {
for (my $j = 0; $j < $n; $j++) { print "$matrix[$j][$i]\t"; }
print "$matrix[$n][$i]\n";
}
rsem.sh
#!/bin/bash
#SBATCH --job-name=rsem
#SBATCH --partition=compute
#SBATCH --mem=4G
#SBATCH --time=2-00:00:00
#SBATCH --cpus-per-task=10
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
module load rsem bowtie2
cd data/rsem
refdir=../raw_reads/trimmed
a=($refdir/*_1.fastq.gz) #43
b=($refdir/*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
species=`basename $f | cut -c-2`
base=`basename $f _1.fastq.gz`
if [ "$species" == "bt" ]; then
ref=../../ref/bt
elif [ "$species" == "am" ]; then
ref=../../ref/am
elif [ "$species" == "lf" ]; then
ref=../../ref/lf
elif [ "$species" == "ln" ]; then
ref=../../ref/ln
fi
echo rsem-calculate-expression -p 8 --paired-end --bowtie2 $f $r $ref $base
/apps/unit/MikheyevU/sasha/RSEM/rsem-calculate-expression -p 10 --bowtie2 --paired-end $f $r $ref $base
tophat.sh
#!/bin/bash
#SBATCH --job-name=tophat
#SBATCH --partition=compute
##SBATCH --mem=1G
#SBATCH --time=1-00:00:00
##SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
#species=ln #am, bt, ln
#for sbatch: --export=species=am
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz) #10 for bt, 6 for am, 12 for ln
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`
module load bowtie2/2.2.6 tophat/2.1.1 cufflinks/2.2.1
if [ "$species" == "bt" ]; then
gff=./ref/GCF_000214255.1_Bter_1.0_genomic.gff
ref=./ref/bt
elif [ "$species" == "am" ]; then
gff=./ref/GCF_000002195.4_Amel_4.5_genomic.gff
ref=./ref/am
else
gff=./ref/GCA_001045655.1_ASM104565v1_genomic.gff
ref=./ref/ln
fi
echo $SLURM_ARRAY_TASK_ID $species $gff $ref
echo $species $ref $base $gff $f $r
#tophat2 -p 1 -G $gff -o ./data/assembly/tophat/$base $ref $f $r
cufflinks -p 1 -g $gff ./data/assembly/tophat/$base/accepted_hits.bam -o ./data/assembly/tophat/$base
transdecoder.sh
#!/bin/bash
#SBATCH --job-name=transdecoder
#SBATCH --partition=compute
#SBATCH --mem=10G
#SBATCH --cpus-per-task=10
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
##SLURM_ARRAY_TASK_ID=2
module load Trinity/2.1.1
infile=./data/assembly/trinity_lf/Trinity.fasta
base=lf
TransDecoder -t $infile --reuse --workdir ./output_"$base" --search_pfam /apps/unit/MikheyevU/sasha/TransDecoder_r20140704/pfam/Pfam-AB.hmm.bin --CPU 10 -v
trimmomatic.sh
#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --partition=compute
#SBATCH --mem=20G
#SBATCH --cpus-per-task=8
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
trimmomatic=/apps/unit/MikheyevU/sasha/trimmomatic
##SLURM_ARRAY_TASK_ID=1
a=(*/*_1.fastq.gz) # 43
b=(*/*_2.fastq.gz) # 43
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=$(basename $f _1.fastq.gz)
java -jar $trimmomatic/trimmomatic-0.32.jar PE -threads 8 -phred33 $f $r trimmed/$base"_1.fastq.gz" trimmed/$base"_unpaired1.fastq.gz" trimmed/$base"_2.fastq.gz" trimmed/$base"_unpaired2.fastq.gz" ILLUMINACLIP:$trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:25
trinity.sh
#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH --partition=compute
#SBATCH --mem=80G
#SBATCH --time=3-00:00:00
#SBATCH --cpus-per-task=12
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
species=lf
refdir=../data/raw_reads/merged
module load Trinity/2.1.1 bowtie/1.1.0
Trinity --seqType fq --max_memory 75G --left $refdir/"$species"_1.fastq.gz \
--right $refdir/"$species"_2.fastq.gz \
--CPU 12 --output ../data/assembly/trinity_"$species"
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.2.0 Rcpp_1.0.0 lattice_0.20-35 digest_0.6.18
[5] rprojroot_1.3-2 grid_3.5.1 jsonlite_1.6 backports_1.1.2
[9] git2r_0.23.0 magrittr_1.5 evaluate_0.11 stringi_1.3.1
[13] fs_1.2.6 whisker_0.3-2 Matrix_1.2-14 reticulate_1.10
[17] rmarkdown_1.10 tools_3.5.1 stringr_1.3.1 glue_1.3.0.9000
[21] yaml_2.2.0 compiler_3.5.1 htmltools_0.3.6 knitr_1.20