Protocol of RNA-Seq Pipeline
Author
Justine Dardaillon
Raw data (Fastq format) were obtained directly from labs or downloaded
from GEO/SRA databases with NCBI SRA-toolkit software (fastq-dump v2.8.2).
The pipeline analysis, written in Python (version 2 and 3 compatible), starts
to control the quality of the sequences with FastQC v0.11.5 (Andrew S. 2010).
Once this step is done, the files are sorted and repaired (in case of paired-end data).
Then, the pipeline call Cutadapt v1.13 to trim low-quality ends and adapters from
the reads (Martin M. 2011).
STAR was used to align the reads on the genome.
In case of paired-end data, a supplementary step consists in create a subsample of
1000000 reads selected randomly with Seqtk v1.0-r31, the maximum mates gap
to set correctly the corresponding parameter as input of STAR commandline (Dobin A. 2013).
The files obtained with STAR (bam format) are sorted with SAMtools v1.4.1 (Li et al. 2009), and in case of technical replicates, the bam files are merged at this step with SAMtools. Then, these BAM alignment files are processed with Htseq-count v0.7.2 (Simon et al. 2014) to get the number of occurrences per genes. Finally, the data has been normalized with R v3.4.0 using DESeq2 package.
The final output files give RPKM (single-end data) or FPKM (paired-end data) and RLE (Relative Log Expression) normalized value for each genes. The relative log expression (RLE) values are computed by calculating for each probe-set the ratio between the expression of a probe-set and the median expression of this probe-set across all arrays of the experiment.
Andrews S.(2010).FastQC: a quality control tool for high throughput sequence data. Available online here
Martin M.(2011).Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal [Online] 17(1)
Dobin, A.(2013). STAR: ultrafast universal RNASeq aligner.Bioinformatics 15-21
Li H. et al.(2009).The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078-9.
Anders S. et al.(2014).HTSeq — A Python framework to work with high-throughput sequencing data.Bioinformatics 31(2):166-9