[ Paper review ] ATAC-seq analysis / pipeline

paper review "From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis" published in 2020, Genome biology

1.Pre-analysis: quality control and alignment

1-1. Read trimming

* tools :

cutadapt, AdapterRemoval v2, Skewer, trimmomatic

* Goal :

to remove overrepresentation of Nextera sequencing adapters which is often observed

* Notes:

(1) These tools require the adapter sequences to trim

(2) trimmomatic trim the overrepresented adapters and low-quality bases

1-2. Read alignment to reference genome

* tools :

BWA-MEM and Bowtie2

* Good quality data sets have :

(1) A unique mapping rates > 80 %

(2) min. number of mapped read > 50 million for open chromatin detection and DEG

(3) min. number of mapped read > 200 million for TF openprinting

1-3. Post alignment quality control

* tools :

Picard and SAMtools

* Goal :

(1) to remove the improperly paried reads or read of low mapping quality

(2) to remove the Mitochondirial genome, the ENCODE blacklisted regions [53, 54] (these links taken from directly the paper)

(3) to remove duplicatd reads ( arisens as PCR artifacts )

* tool :

ATACseqQC, MultiQC ( comprehensive quality control tool )

* Goal :

(1) to see if they generate a fragment size distribution plot

and periodical peaks matched to the nucleosome-free regions (NFR),

mono-, di- and trinucleosomes. ( <100 bp, ~200 bp, ~400 bp, ~600 bp, respecitvely )

where enrichment is around 100 bp (nucleosome-free regions (NFR)) and 200 bp (mono-)

(2) to see if they generate TSS enrichment plot

where nuclosome-free fragments are enriched around transcription start site (TSS)

where mono-nucleosome fragments are depleted at TSS

(3) to shift reads + 4 bp and − 5 bp for positive and negative strand respectively,

to account for the 9-bp duplication created by DNA repair of the nick

1-4. This paper recomend

FastQC➔ trimmomatic➔BWA-MEM➔ATACseqQC

2. Analysis

2-1. Peak calling

* Goal :

to find the accessible region, "peaks" by pling up the paried-end fragmnets

* Approache :

count-based peak caller by profiling fragment distribution

* tools :

(1) MACS2, HOMER, SICER/epic2, all of them use Poisson distribution

(2) ZINBA, which usees zero-inflated negative binomial distribution

(3) F-seq, PeakDEck, both of them use kernel density estimation

(4) JAMM, which perfomes better when applying mixture models due to biological replicates

(5) HMMRATAC, exclusive for ATAC-seq by using a three-state semi-supervised hidden Markov model, better than MACS2 and F-seq

* Notes

(1) Count-based tools behave similarly

(2) Peak tracks generated by these tools can be visualized

2-2. this paper recommends

MACS2, HOMER and HMMRATAC if computational power is enough

3. Advanced analysis

3-1. Peak differential analysis

* steps:

(1) to find the candidate regions (consensus peaks or binned genome)

(2) to normalize

(3) to count the fragments in these regions

(4) compare with other conditions statistically

(4-1) by manually

(4-2) consensus peak

(4-3) the sliding window-based tools

* tools for the 4-2

HOMER, DBChIP and DiffBind

(1) They all assume a negative binomial (NB) distribution

(2) They require biological replicates to estimate dispersion.

(3) HOMER : call consensus peaks by pooling all samples to reduce false positive differential peaks

(4) DBChIP and DiffBind : generate consensus peaks by intersection or union operations.(it ignores sample or condition specific peaks)

* tools for the 4-3

PePr, DiffReps and ChIPDiff

(1) ChIPDiff : uses an HMM to account for correlation between adjacent windows

(2) PePr, DiffReps : NB test, G-test, or chi-square test, depending on the availability of replicate

3-2. Peak annotation

* Goal:

After obtaining peak sets, peaks are annotated by the nearest genes or regulatory elements.

* Tools:

HOMER, ChIPseeker, and ChIPpeakAnno

* Notes:

After getting the genes, functional enrichment analysis is possible.

3-3. Motif

Transcription factor recognize and binds to specific sequences (Motif) on DNA and the binding positions are called TF binding sites (TFBS).

* Studies with motif

(3-3-1) sequence-based prediction for motif frequency or activity

(3-3-2) footprinting for TF occupancy

* Motif databases

* The motif is mainly stored in text format, with a position weight matrix (PWM)

(1) HOMER

(2) Bioconductor packages TFBSTools

(3) motifmatchr

(4) "MEME suite" and "PWMScan" are more accessible owing to their web application interfaces

3-3-1. Motif enrichment and activity analysis

* Goal :

to compare the position and frequency of motifs in each peak region compared to a random background or another condition.

* Final Goal :

in order to predict putative TFBSs indirectly from sequences found within peak regions

* Tools :

"MEME-CentriMo"

(1) identifies motifs enriched near peak centers

(2) a widely used web application that produces a visual report

3-3-1. Footprinting for TF occupancy

...

References

From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis

Assay of Transposase Accessible Chromatin sequencing (ATAC-seq) is widely used in studying chromatin biology, but a comprehensive review of the analysis tools has not been completed yet. Here, we discuss the major steps in ATAC-seq data analysis, including

genomebiology.biomedcentral.com

'Learning > Statistics & Data analysis' 카테고리의 다른 글

[ ML / 머신러닝 ] Contents / 목차 (0)	2021.11.20
[ Summary part .1 ] Understanding Representation Learning With Autoencoder: Everything You Need to Know About Representation and Feature Learning (0)	2021.10.08
[ technique review / RNA-seq. data analysis / Bulk RNA-seq / advanced ] nf-core/rnaseq (0)	2021.05.31
[ technique review / RNA-seq. data analysis / Bulk RNA-seq / Basic ] STAR & Salmon & paired-end reads (0)	2021.05.29
t-test (0)	2020.05.29

NoteHaus

[ Paper review ] ATAC-seq analysis / pipeline

'Learning > Statistics & Data analysis' 카테고리의 다른 글

티스토리툴바

[ Paper review ] ATAC-seq analysis / pipeline

'Learning > Statistics & Data analysis' 카테고리의 다른 글

'Learning/Statistics & Data analysis' Related Articles

티스토리툴바