Last updated: 2019-03-07

Checks: 6 0

Knit directory: queen-pheromone-RNAseq/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190307)

The command set.seed(20190307) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 7d083c0

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    bioinformatics_scripts/data/
    Ignored:    bioinformatics_scripts/ref/
    Ignored:    data/.DS_Store
    Ignored:    data/apis_gene_comparisons/.DS_Store
    Ignored:    data/apis_gene_comparisons/wojciechowski_histone_data/.DS_Store
    Ignored:    data/apis_gene_comparisons/wojciechowski_histone_data/GSE110640_RAW/.DS_Store
    Ignored:    data/apis_gene_comparisons/wojciechowski_histone_data/GSE110641_RAW/.DS_Store
    Ignored:    data/component spreadsheets of queen_pheromone.db/.DS_Store
    Ignored:    docs/figure/pdf_supplementary_material.Rmd/
    Ignored:    figures/
    Ignored:    manuscript/
    Ignored:    supplement/

Unstaged changes:
    Deleted:    bioinformatics_scripts/data
    Deleted:    bioinformatics_scripts/ref

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	7d083c0	lukeholman	2019-03-07	new theme
html	f260a62	Luke Holman	2019-03-07	Build site.
Rmd	934cda8	Luke Holman	2019-03-07	First commit of new website structure

Workflow

There are already sequenced genomes of A. mellifera and B. terrestris and L. niger in NCBI. We’ll have to make something for L. flavus from scratch.

We also need to determine possible isoforms for the genes in the data set, which we will do using tophat for the genomes that have references.

L. niger workflow

This specis has no isoform data in NCBI, so we create our own. First we create bowtie2 indexed for the reference genomes using bowtie2-build, and then tophat and cufflinks are executed using tophat.sh using reference-based mapping to identify alternative splicing.

L flavus workflow

trinity.sh assemble lf from raw reads. We are not using a genome guided assembly, since the ln genome is fragmented, as per recommendation of the trinity authors transdecoder.sh predict proteins from the transcripts.

Gene expression analysis

kallisto We use kallisto for gene expression analysis of transcripts from the NCBI data bases for A. mellifera and B. terrestis, for the TopHat assembly of L. niger and for the Trinity assembly of L. flavus.

rsem and ebseq This is the more traditional approach using the same data sources. We prepare references as ngvector files from the predicted transcripts, as per instructions.

Orthology

We handle this by reciprocal blastp hit on the proteins vs the honey bee genome (the best annotated of the bunch) e.g., blastp -num_threads 12 -query lf.fa -db amel -outfmt 6 -evalue 1e-4 -max_target_seqs 1

Individual scripts

We here present the individual scripts in code chunks, for convenient viewing.

`ebseq.sh`

#!/bin/bash
#SBATCH --job-name=ebseq
#SBATCH --partition=compute
#SBATCH --mem=500M
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc


rsem=/apps/unit/MikheyevU/sasha/RSEM
refdir=./data/rsem
a=($(ls -1  ./data/raw_reads/trimmed/*_1.fastq.gz |awk -F/ '{print $NF}' |cut -c-2 |uniq)) #4
species=${a["SLURM_ARRAY_TASK_ID"]}
echo $species

# contrasts are relative to control treatments, i.e., control goes first

if [ "$species" == "bt" ]; then
    samples=`for i in 2 4 5 8 10 1 3 6 7 9; do echo -ne $refdir/$species$i".isoforms.results ";done`
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="5,5"
elif [ "$species" == "am" ]; then    
    samples=`for i in 2 4 6 1 3; do echo -ne $refdir/$species$i".isoforms.results ";done`
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="3,2"
elif [ "$species" == "lf" ]; then
    samples=`for i in 4 6 8 10 12 14 16 1 3 5 7 11 13; do echo -ne $refdir/$species$i".isoforms.results " ;done`
    echo $samples
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="7,6"
elif [ "$species" == "ln" ]; then
    samples=`for i in 1 3 5 7 11 2 4 6 8  12; do echo -ne $refdir/$species$i".isoforms.results ";done`
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="5,5"
fi

$rsem/rsem-generate-data-matrix `echo $samples | sed 's/isoforms/genes/g'` > $refdir/$species.genes.matrix
$rsem/rsem-run-ebseq $refdir/$species.genes.matrix $conditions $refdir/$species".genes.ebseq"
$rsem/rsem-control-fdr $refdir/$species".genes.ebseq" .05 $refdir/$species".genes.padj.ebseq"

$rsem/rsem-run-ebseq --ngvector ./ref/$species.ngvec $refdir/$species.matrix $conditions $refdir/$species".ebseq"
$rsem/rsem-control-fdr $refdir/$species".ebseq" .05 $refdir/$species".padj.ebseq"

`kallisto.sh`

#!/bin/bash
#SBATCH --job-name=kalisto
#SBATCH --partition=compute
#SBATCH --mem=3G
#SBATCH --time=5:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc


module load kallisto

##species=lf #am, bt, ln
##SLURM_ARRAY_TASK_ID=0
#for sbatch: --export=species=am
#10 for bt, 6 for am, 12 for ln, 15 lf
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz) 
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`

if [ "$species" == "bt" ]; then
    ref=./ref/GCF_000214255.1_Bter_1.0_rna
elif [ "$species" == "am" ]; then
    ref=./ref/GCF_000002195.4_Amel_4.5_rna
elif [ "$species" == "lf" ]; then
    ref=./ref/lf
elif [ "$species" == "ln" ]; then
    ref=./ref/ln
fi

kallisto quant -i $ref -o data/kallisto/$base -b 100 $f $r

`orthodb.py`

# takes a list of protein sequences and determines their orthogroup ids
#with python/2.7.8  
#takes two files, source of fasta and output
import urllib, json, time, sys
from Bio import SeqIO
urlbase = "http://www.orthodb.org/blast?"
outfile = open(sys.argv[2], "w", 0)
level = "level=7434" #Aculeata
count = 1
start = time.time()
for rec in SeqIO.parse(sys.argv[1], "fasta"):       
    while True:             
        try:
            response = urllib.urlopen(urlbase + level  + "&seq=" + str(rec.seq).replace("*",""))
            data = response.read()
            parsedData = json.loads(data)
        except:
            print "http error"
            time.sleep(2)
        else:
            break
    if parsedData['data']:
        outfile.write(rec.id + "\t" + ",".join(parsedData['data']) + "\n")
    count += 1
    if count % 100 == 0:
        print("%i %.2f" % (count, time.time() - start))
        start = time.time()

`rsem-generate-data-matrix`

#!/usr/bin/perl

use strict;

if (scalar(@ARGV) == 0) {
    print "Usage: rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.matrix\n";
    print "Results files should be either all .genes.results or all .isoforms.results.\n";
    exit(-1);
}

my $offsite = 5; # ###### change to TPM

my $line;
my $n = scalar(@ARGV);
my $M = -1;
my @matrix = ();

# 0, file_name; 1, reference of expected count array; 2, reference of transcript_id/gene_id array
sub loadData {
    open(INPUT, $_[0]);
    my $line = <INPUT>; # The first line contains only column names
    while ($line = <INPUT>) {
    chomp($line); 
    my @fields = split(/\t/, $line);
    push(@{$_[2]}, "\"$fields[0]\"");
    push(@{$_[1]}, $fields[$offsite]);
    }
    close(INPUT);

    if (scalar(@{$_[1]}) == 0) {
    print STDERR "Nothing is detected! $_[0] may not exist or is empty.\n";
    exit(-1);
    }
}

#0, M; 1, reference of @ids_arr; 2, reference of @ids
sub check {
    my $size = $_[0];
    for (my $i = 0; $i < $size; $i++) { 
    if ($_[1]->[$i] ne $_[2]->[$i]) {
        return 0;
    }
    }
    return 1;
}

my @ids_arr = ();

for (my $i = 0; $i < $n; $i++) {
    my (@ids, @ecs) = ();
    &loadData($ARGV[$i], \@ecs, \@ids);

    if ($M < 0) { 
    $M = scalar(@ids); 
    @ids_arr = @ids;
    }
    elsif (!&check($M, \@ids_arr, \@ids)) { 
    print STDERR "Number of lines among samples are not equal!\n"; 
    exit(-1); 
    }

    my $colname;
    if (substr($ARGV[$i], 0, 2) eq "./") { $colname = substr($ARGV[$i], 2); }
    else { $colname = $ARGV[$i]; }
    $colname = "\"$colname\"";
    @ecs = ($colname, @ecs);
    push(@matrix, \@ecs);
}

@ids_arr = ("", @ids_arr);
@matrix = (\@ids_arr, @matrix);

for (my $i = 0; $i <= $M; $i++) {
    for (my $j = 0; $j < $n; $j++) { print "$matrix[$j][$i]\t"; }
    print "$matrix[$n][$i]\n";
}

`rsem.sh`

#!/bin/bash
#SBATCH --job-name=rsem
#SBATCH --partition=compute
#SBATCH --mem=4G
#SBATCH --time=2-00:00:00
#SBATCH --cpus-per-task=10
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc


module load rsem bowtie2

cd data/rsem
refdir=../raw_reads/trimmed
a=($refdir/*_1.fastq.gz) #43
b=($refdir/*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
species=`basename $f | cut -c-2`
base=`basename $f _1.fastq.gz`

if [ "$species" == "bt" ]; then
    ref=../../ref/bt
elif [ "$species" == "am" ]; then
    ref=../../ref/am
elif [ "$species" == "lf" ]; then
    ref=../../ref/lf
elif [ "$species" == "ln" ]; then
    ref=../../ref/ln
fi

echo rsem-calculate-expression  -p 8 --paired-end --bowtie2 $f $r $ref $base
/apps/unit/MikheyevU/sasha/RSEM/rsem-calculate-expression -p 10 --bowtie2  --paired-end $f $r $ref $base

`tophat.sh`

#!/bin/bash
#SBATCH --job-name=tophat
#SBATCH --partition=compute
##SBATCH --mem=1G
#SBATCH --time=1-00:00:00
##SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc

#species=ln  #am, bt, ln
#for sbatch: --export=species=am
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz) #10 for bt, 6 for am, 12 for ln
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`

module load bowtie2/2.2.6 tophat/2.1.1 cufflinks/2.2.1

if [ "$species" == "bt" ]; then
    gff=./ref/GCF_000214255.1_Bter_1.0_genomic.gff
    ref=./ref/bt
elif [ "$species" == "am" ]; then
    gff=./ref/GCF_000002195.4_Amel_4.5_genomic.gff
    ref=./ref/am
else
    gff=./ref/GCA_001045655.1_ASM104565v1_genomic.gff 
    ref=./ref/ln
fi

echo $SLURM_ARRAY_TASK_ID $species $gff $ref

echo  $species $ref  $base $gff $f $r 
#tophat2 -p 1 -G $gff -o ./data/assembly/tophat/$base $ref $f $r
cufflinks -p 1 -g $gff ./data/assembly/tophat/$base/accepted_hits.bam -o ./data/assembly/tophat/$base

`transdecoder.sh`

#!/bin/bash
#SBATCH --job-name=transdecoder
#SBATCH --partition=compute
#SBATCH --mem=10G
#SBATCH --cpus-per-task=10
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err

##SLURM_ARRAY_TASK_ID=2

module load Trinity/2.1.1 

infile=./data/assembly/trinity_lf/Trinity.fasta

base=lf

TransDecoder -t $infile --reuse --workdir ./output_"$base" --search_pfam /apps/unit/MikheyevU/sasha/TransDecoder_r20140704/pfam/Pfam-AB.hmm.bin --CPU 10 -v

`trimmomatic.sh`

#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --partition=compute
#SBATCH --mem=20G
#SBATCH --cpus-per-task=8
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err

trimmomatic=/apps/unit/MikheyevU/sasha/trimmomatic
##SLURM_ARRAY_TASK_ID=1
a=(*/*_1.fastq.gz) # 43
b=(*/*_2.fastq.gz) # 43
f=${a["SLURM_ARRAY_TASK_ID"]}   
r=${b["SLURM_ARRAY_TASK_ID"]}   
base=$(basename $f _1.fastq.gz)
java -jar $trimmomatic/trimmomatic-0.32.jar PE -threads 8 -phred33 $f $r trimmed/$base"_1.fastq.gz" trimmed/$base"_unpaired1.fastq.gz" trimmed/$base"_2.fastq.gz" trimmed/$base"_unpaired2.fastq.gz" ILLUMINACLIP:$trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:25

`trinity.sh`

#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH --partition=compute
#SBATCH --mem=80G
#SBATCH --time=3-00:00:00
#SBATCH --cpus-per-task=12
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc

species=lf
refdir=../data/raw_reads/merged
module load Trinity/2.1.1  bowtie/1.1.0
Trinity --seqType fq --max_memory 75G --left $refdir/"$species"_1.fastq.gz \
 --right  $refdir/"$species"_2.fastq.gz  \
    --CPU 12 --output ../data/assembly/trinity_"$species"

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.2.0 Rcpp_1.0.0      lattice_0.20-35 digest_0.6.18  
 [5] rprojroot_1.3-2 grid_3.5.1      jsonlite_1.6    backports_1.1.2
 [9] git2r_0.23.0    magrittr_1.5    evaluate_0.11   stringi_1.3.1  
[13] fs_1.2.6        whisker_0.3-2   Matrix_1.2-14   reticulate_1.10
[17] rmarkdown_1.10  tools_3.5.1     stringr_1.3.1   glue_1.3.0.9000
[21] yaml_2.2.0      compiler_3.5.1  htmltools_0.3.6 knitr_1.20

Bioinformatic analysis

From: Comparative transcriptomics of social insect queen pheromones

2019-03-07