Aniseed - Protocol

Protocol of RNA-Seq Pipeline

Possible Species

Molgula oculata; Molgula occidentalis; Botryllus schlosseri; Halocynthia roretzi; Halocynthia aurantium; Ciona intestinalis; Ciona robusta;Ciona savignyi; Phallusia mammillata; Phallusia fumigata;

Author

Justine Dardaillon

RNA-Seq Pipeline Description

Raw data (Fastq format) were obtained directly from labs or downloaded from GEO/SRA databases with NCBI SRA-toolkit software (fastq-dump v2.8.2). The pipeline analysis, written in Python (version 2 and 3 compatible), starts to control the quality of the sequences with FastQC v0.11.5 (Andrew S. 2010). Once this step is done, the files are sorted and repaired (in case of paired-end data). Then, the pipeline call Cutadapt v1.13 to trim low-quality ends and adapters from the reads (Martin M. 2011).

STAR was used to align the reads on the genome.

In case of paired-end data, a supplementary step consists in create a subsample of 1000000 reads selected randomly with Seqtk v1.0-r31, the maximum mates gap to set correctly the corresponding parameter as input of STAR commandline (Dobin A. 2013).

The files obtained with STAR (bam format) are sorted with SAMtools v1.4.1 (Li et al. 2009), and in case of technical replicates, the bam files are merged at this step with SAMtools. Then, these BAM alignment files are processed with Htseq-count v0.7.2 (Simon et al. 2014) to get the number of occurrences per genes. Finally, the data has been normalized with R v3.4.0 using DESeq2 package.
The final output files give RPKM (single-end data) or FPKM (paired-end data) and RLE (Relative Log Expression) normalized value for each genes. The relative log expression (RLE) values are computed by calculating for each probe-set the ratio between the expression of a probe-set and the median expression of this probe-set across all arrays of the experiment.

References

Andrews S.(2010).FastQC: a quality control tool for high throughput sequence data. Available online here

Martin M.(2011).Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal [Online] 17(1)

Dobin, A.(2013). STAR: ultrafast universal RNASeq aligner.Bioinformatics 15-21

Li H. et al.(2009).The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078-9.

Anders S. et al.(2014).HTSeq — A Python framework to work with high-throughput sequencing data.Bioinformatics 31(2):166-9