본문 바로가기

Learning/Statistics & Data analysis

[ Paper review ] ATAC-seq analysis / pipeline

paper review "From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis" published in 2020, Genome biology

 

1.Pre-analysis: quality control and alignment

1-1. Read trimming 

* tools :

cutadapt, AdapterRemoval v2, Skewer, trimmomatic

* Goal : 

to remove overrepresentation of Nextera sequencing adapters which is often observed

* Notes:

(1) These tools require the adapter sequences to trim 

(2) trimmomatic trim the overrepresented adapters and low-quality bases

 

1-2. Read alignment to reference genome 

* tools :

BWA-MEM and Bowtie2

* Good quality data sets have  :

(1) A unique mapping rates > 80 %

(2) min. number of mapped read > 50 million for open chromatin detection and DEG 

(3) min. number of mapped read > 200 million for TF openprinting 

 

1-3. Post alignment quality control

* tools :

Picard and SAMtools

* Goal :

(1) to remove the improperly paried reads or read of low mapping quality  

(2) to remove the Mitochondirial genome, the ENCODE blacklisted regions [5354] (these links taken from directly the paper)

(3) to remove duplicatd reads ( arisens as PCR artifacts )

 

* tool :

ATACseqQC, MultiQC ( comprehensive quality control tool )

* Goal : 

(1) to see if they generate a fragment size distribution plot

and periodical peaks matched to the nucleosome-free regions (NFR),

mono-, di- and trinucleosomes. ( <100 bp, ~200 bp, ~400 bp, ~600 bp, respecitvely ) 

where enrichment is around 100 bp (nucleosome-free regions (NFR)) and 200 bp (mono-)

(2) to see if they generate TSS enrichment plot

where nuclosome-free fragments are enriched around transcription start site (TSS)

where mono-nucleosome fragments are depleted at TSS 

(3) to shift reads + 4 bp and − 5 bp for positive and negative strand respectively,

to account for the 9-bp duplication created by DNA repair of the nick 

 

1-4. This paper recomend

FastQC➔ trimmomatic➔BWA-MEM➔ATACseqQC

 


 

2. Analysis 

2-1. Peak calling

* Goal :

to find the accessible region, "peaks" by pling up the paried-end fragmnets 

* Approache : 

count-based peak caller by profiling fragment distribution

* tools :

(1) MACS2, HOMER, SICER/epic2, all of them use Poisson distribution

(2) ZINBA, which usees zero-inflated negative binomial distribution

(3) F-seq, PeakDEck, both of them use kernel density estimation

(4) JAMM, which perfomes better when applying mixture models due to biological replicates

(5) HMMRATAC, exclusive for ATAC-seq by using a three-state semi-supervised hidden Markov model, better than MACS2 and F-seq

* Notes

(1) Count-based tools behave similarly

(2) Peak tracks generated by these tools can be visualized

 

2-2. this paper recommends

MACS2, HOMER and HMMRATAC if computational power is enough 

 


 

3. Advanced analysis 

3-1. Peak differential analysis

* steps: 

(1) to find the candidate regions (consensus peaks or binned genome)

(2) to normalize

(3) to count the fragments in these regions

(4) compare with other conditions statistically

(4-1) by manually 

(4-2) consensus peak

(4-3) the sliding window-based tools 

 

* tools for the 4-2

HOMER, DBChIP and DiffBind

(1) They all assume a negative binomial (NB) distribution

(2)  They require biological replicates to estimate dispersion. 

(3) HOMER : call consensus peaks by pooling all samples to reduce false positive differential peaks 

(4) DBChIP and DiffBind : generate consensus peaks by intersection or union operations.(it  ignores sample or condition specific peaks)

 

* tools for the 4-3

PePr, DiffReps and ChIPDiff

(1) ChIPDiff : uses an HMM to account for correlation between adjacent windows

(2) PePr, DiffReps : NB test, G-test, or chi-square test, depending on the availability of replicate

 

 

3-2. Peak annotation 

* Goal:

After obtaining peak sets, peaks are annotated by the nearest genes or regulatory elements.

* Tools:

HOMER, ChIPseeker, and ChIPpeakAnno 

* Notes: 

 

After getting the genes, functional enrichment analysis is possible. 

 

3-3. Motif 

Transcription factor recognize and binds to specific sequences (Motif) on DNA and the binding positions are called TF binding sites (TFBS). 

* Studies with motif

(3-3-1) sequence-based prediction for motif frequency or activity

(3-3-2) footprinting for TF occupancy

* Motif databases

* The motif is mainly stored in text format, with a position weight matrix (PWM)

(1)  HOMER

(2) Bioconductor packages TFBSTools

(3) motifmatchr 

(4) "MEME suite" and "PWMScan" are more accessible owing to their web application interfaces

 

 

3-3-1. Motif enrichment and activity analysis

* Goal :

to compare the position and frequency of motifs in each peak region compared to a random background or another condition.

* Final Goal :

in order to predict putative TFBSs indirectly from sequences found within peak regions

* Tools :

"MEME-CentriMo"

(1) identifies motifs enriched near peak centers

(2) a widely used web application that produces a visual report

 

 

3-3-1. Footprinting for TF occupancy

... 

 

 

References

 

From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis

Assay of Transposase Accessible Chromatin sequencing (ATAC-seq) is widely used in studying chromatin biology, but a comprehensive review of the analysis tools has not been completed yet. Here, we discuss the major steps in ATAC-seq data analysis, including

genomebiology.biomedcentral.com