0. Conda
0-1. To make an environment ('salmon') with Conda and install all reuquired tools inside the env 'salmon'
# install an environment named "salmon" and install package "salmon"
conda create -n salmon salmon # -n < environment > < package name >
# activate the "salmon" environment
conda activate salmon
# install star, cutadapt, trim-galore in the the "salmon" environment
conda install -c bioconda star
conda install -c bioconda cutadapt
conda install -c bioconda trim-galore
# check if star, salmon, cutadapt and trim_galore work
STAR
# STAR version=2.7.9a
salmon
# salmon v1.5.0
cutadapt --version
# 1.18
trim_galore
source deactivate
1. QC in shell script
cd $(pwd)
source /your_anaconda_path/anaconda3/bin/activate salmon # call salmon environmentmkdir fastqc
for id in RNA_SEQ_R RNA_SEQII ; do
fq1=fastqs/${id}_1.fastq.gz
fq2=fastqs/${id}_2.fastq.gz
fastqc -t 2 --extract -o fastqc $fq1 $fq2
done
2. Read trimming
2-1. trimgalore link
# install
(salmon)$ conda install -c bioconda trim-galore
# run
(salmon)$ trim_galore --paired RNA_SEQ_R1.fastq.gz RNA_SEQ_R2.fastq.gz
3. STAR in shell script
3-1. Download referecnes ( if salmon, GENECODE should be used )
curl -O http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gzcurl -O http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz. curl -O http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
curl -O http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
curl -O http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.transcripts.fa.gz
curl -O http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz
# "genome.fa" should be "primary_assembly" (see below)
2.2 Advanced options.
(taken from https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf)
2.2.1 Which chromosomes/scaffolds/patches to include?
It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.
Examples of acceptable genome sequence files:
• ENSEMBL: files marked with .dna.primary.assembly, such as: ftp://ftp.ensembl. org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_ assembly.fa.gz
• GENCODE: files marked with PRI (primary). Strongly recommended for mouse and human:http://www.gencodegenes.org/.
3-2. Run STAR
cd $(pwd)
STAR --genomeDir star \
source /your_anaconda_path/anaconda3/bin/activate salmon # call salmon environment
--readFilesIn RNA_SEQ_R1.fq.gz RNA_SEQ_R2.fq.gz \
--runThreadN 4 \
--outFileNamePrefix RNA_SEQ. \
--sjdbGTFfile Homo_sapiens.GRCh38.96.gtf \
--outSAMattrRGline ID:RNA_SEQ SM:RNA_SEQ \
--quantMode TranscriptomeSAM \ # to make Aligned.toTranscriptome.out.bam, required for Salmon
--twopassMode Basic \
--outSAMtype BAM Unsorted \ # And then, samtools sort -@... -m... Aligned.out bam Aligned.sorted
--readFilesCommand zcat \
--runRNGseed 0 \
--outFilterMultimapNmax 20 \
--alignSJDBoverhangMin 1 \
--outSAMattributes NH HI AS NM MD \ # NH HI NM MD: standard format in the SAM, AS required for paired-end reads
--quantTranscriptomeBan Singleend # required for Salmon
3-1. STAR in shell script with for-loop
cd $(pwd)
source /your_anaconda_path/anaconda3/bin/activate salmon # call salmon environment
mkdir star
for id in RNA_SEQ_R; do
trim1=trimmed/${id}_1.fastq.gz
trim2=trimmed/${id}_2.fastq.gz
out=star/${id}_
STAR --genomeDir /home/USER/db/refanno/gencode.v33_star-2.7.3a \
--readFilesIn $trim1 $trim2 \
--outFileNamePrefix $out \
--readFilesCommand zcat \
--sjdbGTFfile Homo_sapiens.GRCh38.96.gtf \
--outSAMattrRGline ID:${id}_R1 SM:${id}_R1 \
--quantMode TranscriptomeSAM \ # to make Aligned.toTranscriptome.out.bam, required for Salmon
--twopassMode Basic \
--outSAMtype BAM Unsorted \ # And then, samtools sort -@... -m... Aligned.out bam Aligned.sorted
--readFilesCommand zcat \
--runRNGseed 0 \
--outFilterMultimapNmax 20 \
--alignSJDBoverhangMin 1 \
--outSAMattributes NH HI AS NM MD \ # NH HI NM MD: standard format of SAM, AS for paired-end reads
--quantTranscriptomeBan Singleend # required for Salmon
done
(salmon)$rm -R star/*_STARgenome star/*_STARpass1
3-2. sort BAM files
samtools sort myfile.bam -o myfile_sorted.bam
4. STAR-Salmon Quantification (alignment-based mode)
Alignment with STAR to the target genome, followed by quantification using Salmon with 6 threads
cd $(pwd)
source /your_anaconda_path/anaconda3/bin/activate salmon # call salmon environment
for id in RNA_SEQ_R; do
bam=star/${id}_Aligned.toTranscriptome.out.sorted.bam
salmon quant --threads 6 \
--targets /home/USER/db/refanno/gencode.v33.transcripts.fa \
--gencode \
--libType ISR \
--output salmon_star/$id \
--alignments $bam
done
References
Alignment-based method
ycl6.gitbook.io
FelixKrueger/TrimGalore
A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data - FelixKrueger/TrimGalore
github.com
https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
Getting Started
Salmon is an easy-to-use and ultrafast program for quantification from RNA-seq data
combine-lab.github.io
Quantification of transcript abundance using Salmon
Contributors: Mary Piper and Meeta Mistry Approximate time: 1.25 hours Learning Objectives Explore using lightweight algorithms to quantify reads to abundance estimates Understand how Salmon performs quasi-mapping and transcript abundance estimation Lightw
hbctraining.github.io
GENCODE - Human Release 38
Basic gene annotation ALL It contains the basic gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes) This is a subset of the corresponding comprehensive annotation, including only those transcripts tagge
www.gencodegenes.org
'Learning > Statistics & Data analysis' 카테고리의 다른 글
[ ML / 머신러닝 ] Contents / 목차 (0) | 2021.11.20 |
---|---|
[ Summary part .1 ] Understanding Representation Learning With Autoencoder: Everything You Need to Know About Representation and Feature Learning (0) | 2021.10.08 |
[ technique review / RNA-seq. data analysis / Bulk RNA-seq / advanced ] nf-core/rnaseq (0) | 2021.05.31 |
[ Paper review ] ATAC-seq analysis / pipeline (0) | 2021.05.26 |
t-test (0) | 2020.05.29 |