paper review "From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis" published in 2020, Genome biology
1.Pre-analysis: quality control and alignment
1-1. Read trimming
* tools :
cutadapt, AdapterRemoval v2, Skewer, trimmomatic
* Goal :
to remove overrepresentation of Nextera sequencing adapters which is often observed
* Notes:
(1) These tools require the adapter sequences to trim
(2) trimmomatic trim the overrepresented adapters and low-quality bases
1-2. Read alignment to reference genome
* tools :
BWA-MEM and Bowtie2
* Good quality data sets have :
(1) A unique mapping rates > 80 %
(2) min. number of mapped read > 50 million for open chromatin detection and DEG
(3) min. number of mapped read > 200 million for TF openprinting
1-3. Post alignment quality control
* tools :
Picard and SAMtools
* Goal :
(1) to remove the improperly paried reads or read of low mapping quality
(2) to remove the Mitochondirial genome, the ENCODE blacklisted regions [53, 54] (these links taken from directly the paper)
(3) to remove duplicatd reads ( arisens as PCR artifacts )
* tool :
ATACseqQC, MultiQC ( comprehensive quality control tool )
* Goal :
(1) to see if they generate a fragment size distribution plot
and periodical peaks matched to the nucleosome-free regions (NFR),
mono-, di- and trinucleosomes. ( <100 bp, ~200 bp, ~400 bp, ~600 bp, respecitvely )
where enrichment is around 100 bp (nucleosome-free regions (NFR)) and 200 bp (mono-)
(2) to see if they generate TSS enrichment plot
where nuclosome-free fragments are enriched around transcription start site (TSS)
where mono-nucleosome fragments are depleted at TSS
(3) to shift reads + 4 bp and − 5 bp for positive and negative strand respectively,
to account for the 9-bp duplication created by DNA repair of the nick
1-4. This paper recomend
FastQC➔ trimmomatic➔BWA-MEM➔ATACseqQC
2. Analysis
2-1. Peak calling
* Goal :
to find the accessible region, "peaks" by pling up the paried-end fragmnets
* Approache :
count-based peak caller by profiling fragment distribution
* tools :
(1) MACS2, HOMER, SICER/epic2, all of them use Poisson distribution
(2) ZINBA, which usees zero-inflated negative binomial distribution
(3) F-seq, PeakDEck, both of them use kernel density estimation
(4) JAMM, which perfomes better when applying mixture models due to biological replicates
(5) HMMRATAC, exclusive for ATAC-seq by using a three-state semi-supervised hidden Markov model, better than MACS2 and F-seq
* Notes
(1) Count-based tools behave similarly
(2) Peak tracks generated by these tools can be visualized
2-2. this paper recommends
MACS2, HOMER and HMMRATAC if computational power is enough
3. Advanced analysis
3-1. Peak differential analysis
* steps:
(1) to find the candidate regions (consensus peaks or binned genome)
(2) to normalize
(3) to count the fragments in these regions
(4) compare with other conditions statistically
(4-1) by manually
(4-2) consensus peak
(4-3) the sliding window-based tools
* tools for the 4-2
HOMER, DBChIP and DiffBind
(1) They all assume a negative binomial (NB) distribution
(2) They require biological replicates to estimate dispersion.
(3) HOMER : call consensus peaks by pooling all samples to reduce false positive differential peaks
(4) DBChIP and DiffBind : generate consensus peaks by intersection or union operations.(it ignores sample or condition specific peaks)
* tools for the 4-3
PePr, DiffReps and ChIPDiff
(1) ChIPDiff : uses an HMM to account for correlation between adjacent windows
(2) PePr, DiffReps : NB test, G-test, or chi-square test, depending on the availability of replicate
3-2. Peak annotation
* Goal:
After obtaining peak sets, peaks are annotated by the nearest genes or regulatory elements.
* Tools:
HOMER, ChIPseeker, and ChIPpeakAnno
* Notes:
After getting the genes, functional enrichment analysis is possible.
3-3. Motif
Transcription factor recognize and binds to specific sequences (Motif) on DNA and the binding positions are called TF binding sites (TFBS).
* Studies with motif
(3-3-1) sequence-based prediction for motif frequency or activity
(3-3-2) footprinting for TF occupancy
* Motif databases
* The motif is mainly stored in text format, with a position weight matrix (PWM)
(1) HOMER
(2) Bioconductor packages TFBSTools
(3) motifmatchr
(4) "MEME suite" and "PWMScan" are more accessible owing to their web application interfaces
3-3-1. Motif enrichment and activity analysis
* Goal :
to compare the position and frequency of motifs in each peak region compared to a random background or another condition.
* Final Goal :
in order to predict putative TFBSs indirectly from sequences found within peak regions
* Tools :
"MEME-CentriMo"
(1) identifies motifs enriched near peak centers
(2) a widely used web application that produces a visual report
3-3-1. Footprinting for TF occupancy
...
References
From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis
Assay of Transposase Accessible Chromatin sequencing (ATAC-seq) is widely used in studying chromatin biology, but a comprehensive review of the analysis tools has not been completed yet. Here, we discuss the major steps in ATAC-seq data analysis, including
genomebiology.biomedcentral.com