Get Complete Project Material File(s) Now! »
Determination of replication timing by log2 ratio of replicated DNA
The two sub-methods shown in Fig. 1.5, both use the fluorescence-activated cell sorter (FACS) to select cells based upon the increase in DNA content during the S phase. The left protocol has a higher signal-to-noise ratio by pulse labeling with BrdU (5-bromo2-deoxyuridine), which is a kind of base substitutions that can mark newly synthesized DNA. Then immunoprecipitation technology is used to pull down BrdU labeled DNA (Fig. 1.5 left panel, BrdU-IP method). Because all BrdU-containing sequences are synthesized after the start of the S phase, according to the FACS sort result, the BrdU-containing sequence is divided into two groups, i.e. early S and late S, based on the amount of DNA within cells. On the other hand, the S/G1 method (Fig. 1.5, right panel) is just based on the copy number of DNA got by FACS to classify the cells into G1 and S phase groups, respectively. In G1, replication has not yet started, and cells have only two template parent chains. So, cells in the G1 phase contain equal copy numbers of all genomic sequences. However, since without the BrdU labeling, in S /G1 method, the newly synthesized sequences cannot be distinguished from the template parent sequences, therefore it might generate higher background noise than the BrdU-IP method.
No matter the groups between G1 and S or early and late S phase, after separation of the two groups, people can use a high-density whole-genome oligonucleotide microarray or next generation sequence (NGS) to determine the sequences in each group. For example, by NGS, after mapping the sequences to the genome, by counting the mapped sequence numbers in small interval along the genome (the resolution of interval could be set by researchers), people can estimate the replication timing values as log2(early sequence number/late sequence number) or log2(S sequence number/G1 sequence number), respectively.
Figure 1.5: Schematic of two classical sub-methods for obtaining log2 Ratio timing values (Gilbert DM et al., 2010). The two profiles with double peaks represent the FACS statical result for cells. The X-axis shows the fluorescent units representing the amount of DNA within a cell and the Y-axis means the number (or proprotion) of cells corresponding to a DNA amount of X value. Red and green boxes are used to mark the two groups in the cell cycle, which are located in sequential timing order. The red one is earlier than the green one. In the right protocol, red indicates the cells in the G1 phase and green is the cells during the S phase. In the left one, all selected cells are from the S phase and will be classified into the early S phase and late S phase.
Determination of replication timing by Repli-seq
In the Repli-seq method, similar to log2 ratio methods for replication timing value calculation, the reference genome was firstly cut into 50 kb or 100 kb bins, but it classifies the entire S phase into six S-phase cells subpopulation (Fig. 1.6) instead of two groups like log2 ratio methods. The read numbers from the sequencing data underlying these regions were also applied to the six S-phase cell subpopulations (Fig. 1.7). S50 values is a replication timing estimator that measures the fraction of the S phase at which 50% of a given genomic region has been replicated (Chen et al., 2010; Dellino et al., 2013); The S50 value will be scaled in the range from 0 to 1 indicating very-early to very-late replicating regions, respectively. The smaller the S50 value being, the earlier the corresponding position will be replicated. Recently, there is a paper to report a similar way to calculate a more precise timing value by high-resolution Repli-seq with 16 fractions of the S phase (Vouzas and Gilbert, 2021).
Figure 1.6 Density curve of cell count distribution by flow cytometry (Brison et al., 2020). X-axis shows the fluorescent units representing the amount of DNA within a cell, and Y-axis gives the cell counts for each fluorescent unit position. The flow cytometry is able to classify all the cells into group S1 ~ S6 (from early to late), based on the labeled fluorescent units indicating how much percentage of DNA being replicated in cells.
Figure 1.7 Reads distribution of Repli-seq from S1~S6. The IGV screenshot displays the read densities of newly replicated DNA detected in cells obtained in different periods of S phases on chromosome 9, which shows the movement trend of the read distributions from early initiation zones to both sides. In the above plot, the replication initiation zones can be roughly observed in the figure, which is located in the positions enriched BrdU labeling sequences of the G1 phase, but since the resolution of Repli-seq is in 50 kb or 100 kb, it is not enough to meet the requirements for precise positioning of the replication origins.
T-peak regions containing replication initiation zones
Figure 1.8 The schematic of T-peak regions by IGV. The first line gives the T-peak center regions marked by blue bars with 1 kb width. And the red step curve is the timing curve composed of the replication timing values in 50 kb non-overlapping bins. The right-side black arrow reveals the relationship between the height of the timing value and the tempol order of replication. The higher the timing value of a given position is, the given position will replicate earlier. And all T-peak regions are consistent with the positions where replication timing curves up to the local maximum values.
For any specific cell line, people can draw replication timing curves in different resolutions (e.g. different bin sizes, although the resolution of Repli-seq is limited by labelling time and/or sequence depth) along the genome. Hence, we can easily get the replication temporal order in any partial area. According to the replication timing curves, we take the positions, the replication timing of which is earlier than the neighbored regions, to define the replication timing peaks, called T-peaks (Fig. 1.8), which should contain replication initiation zones.
Replication regulation in timing and origin location
The genetic and epigenetic modifications around origins
DNA replication process organized by licensing, Pre-IC formation, and firing steps (Fig. 1.3). The DNA replication must occur in accessible regions with unwinding DNA single strand. So, the early firing origins are frequently located at open chromatin regions and are highly associated with the epigenetic modifications related to several open histone markers, such as H3K4me1, H3K79me2, pho-RNA Pol2. DNase I digestion is a commonly used method to mark the open chromatin regions. At the same time, the DNA replication also needs various proteins to help to fire the replication origins. For example, H2AZ is recently recognized as a factor that can facilitate licensing and activation of early replication origins (Long et al., 2020). Some studies have also reported that replication origins may correlate with transcription start sites (TSSs), CpG islands (CGIs), and G-quadruplex (G4) sequence motifs (Karnani et al., 2009; Masai et al., 2010; Mukhopadhyay et al., 2014).
Stochastic model of initiation-timing regulation
As for whether a given origin will be fired and when it will be replicated, currently, there are several models to explain, and they are still under debated. The main models are the Domino-like model (Sporbert et al., 2002), deterministic model (Lebofsky et al., 2006) and stochastic model (Rhind, 2006).
Domino-like model suggests that the origin firing is triggered by replication of adjacent regions in a next-in-line mode. There is a paper (Guilbaud et al., 2011) reporting chromosomal regions in HeLa cells with sequentially activated origins that are neither clearly early nor clearly late replicating. In addition, concerned with the chromatin folding, such adjacent effect could even cause spatial effect to amplify the 1D replication cluster along DNA sequence to 3D replication cluster in heterochromatin during late S-phase (Löb et al., 2016).
Besides that, deterministic models suppose that different initiation sites are programed to initiate at different, well‐defined times. Stochastic model posits that different initiation sites have different initiation probabilities but can fire at any time during S phase (Rhind et al., 2010). Furthermore, some papers propose the combination of stochastic and deterministic models (Labit et al., 2008).Whether metazoan initiation timing is stochastic or deterministic, or some combination of the two, is still very much an open question (Bechhoefer and Rhind, 2012).
The current technologies used for origin identification by bulk data
SNS-seq
Two essential derivatives produced during DNA replication are short nascent strand (SNS) DNA and Okazaki fragments. Some studies are intended to trace back the firing origins by locating the SNS. It should be noted that, whether it is a short nascent strand or an Okazaki fragment, they have an RNA primer at the 3′ end (Fig. 1.2), which can protect their sequence from exonuclease hydrolysis from 5′ to 3′. Therefore, the DNA of the asynchronous cells containing S phase cells is extracted, DNA will be purified with λ-exonuclease digestion to remove all the contaminant SSS (short single-strands) due to sheared DNA. Finally, only SNS and Okazaki fragments are left. The average length of a short nascent strand is around 1~1.5kb and Okazaki fragments are with a length from 150~200 bases. So, by size selection, it is easy to separate them and map short nascent strands to genome reference by next generation sequencing. Then the piled-up signal peaks of the isolated short nascent strands in population-based data provide the replication origin positions (Fig. 1.9). People call this method SNS-seq.
Currently, SNS-seq has been applied to mice, Drosophila (Lombraña et al., 2016), and different human cell lines, such as Hela, IMR90, IPS, H9, etc. (Picard et al., 2014). More than 200,000 potential initiation sites on the human genome are founded by SNS-seq (Besnard et al., 2012). These origin sites are size-specific origins with an average length of 760 bp and cover 6% of the human genome (Picard et al., 2014). Some studies also show that active origin sites often correlate with transcription start sites and are located in GC-rich regions, near CpG islands and G4 (G-quadruplex secondary structures) sequence motifs (Besnard et al., 2012; Langley et al., 2016).
Figure 1.9 The illustration of SNS-seq principle (Francesco De Carli, 2017). The upper left part shows the bubble extension process. There will be replication bubbles of different lengths in asynchronous data. The long black arrows representing nascent strands and the small red arrows on the lagging strand representing the Okazaki fragments. The bottom left part is the isolated short nascent strand from replication bubble of different sizes with sorted length varied from 1~1.5 kb. The green dots on the plot are the RNA primers, which can protect only the SNS sequence and Okazaki fragments from the digestion of exonuclease. After λ-exonuclease purification treatment, the positions with SNS (>1 kb) enrichment will reflect the replication origin positions. The right side shows the enriched SNS signal peak around a replication origin site.
Based on SNS-seq, Mukhopadhyay and colleagues have developed BrIP-SNSseq (Mukhopadhyay et al., 2014), a variant of the technique consisting of sorting the BrdU-labeled DNA at increasing time-points during the S phase. The relative amount of SNS at each point in the different fractions allows computing the genome-wide replication timing profile, too. Meanwhile, the introduction of BrdU can also further reduce the noise data from not fully digested short single-strands. However, some studies point out the result of SNS-seq may contain a lot of false positive data, because of the hard control of exonuclease’s activity. A comparable number of ORI (origin replication initiation site) peaks was obtained with or without λ-exo treatment (Valenzuela et al., 2011). It was also proposed that the strength ORIs might have been overestimated about tenfold, considering that an accumulation of small (~200 bp), duplex DNA molecules (proposed to represent abortive initiation intermediates), was detected in total genomic DNA (Gómez and Antequera, 2008). Even only 56.5% of the BrdU-SNS-seq peaks accorded with 50.2% of these SNS-seq peaks (Picard et al., 2014). This poor consistency raised a debate on whether it is possible or not to detect real SNS with prior λ-exo digestion, considering that the vast of “SSS” (short single-strands) DNA is still mixed up.
At the same time, another controversial topic is the relation of replication origins and G4. Some studies report the origins detected by SNS-seq are associated with G-quadruplex consensus motifs (Besnard et al., 2012) and suggesting G4 as an potential regulator of origin function. But some other studies propose that this association may be due to the fork stalling. Federico and colleagues have found the role of origin-proximal G-quadruplexes which tend to stall replication forks in vivo transiently (Comoglio et al., 2015). The short nascent strands following the replication forks also stop around the G4, which takes the stalling position as origins located in G4 regions. Therefore, if it is the case, the final SNS-seq result will be mixed with a part of these false positive signals.
Bubble track
Bubble track uses the special structure of replication origins to hunt for their positions. When the replication process starts to fire at a replication origin, a bubble structure will form with two divergent replication forks. Due to the read length limitation of next generation sequencing technology which is only 100~500 bp, the longer DNA molecules had to be broken into smaller fragments to get sequence information, and then based on the overlapped part of read information to assemble complete sequence information. The shotgun method (Weber and Myers, 1997) is the technology to cut the DNA randomly into small segments. The traditional shotgun uses restriction endonucleases to cut the recognition sites on the target DNA to form short sequence fragments that follow a normal distribution within a certain length range. Besides that, the other common method uses ultrasonic DNA fragmentation to break large molecular DNA into small fragments of about 350 bp. In bubble-seq, after shotgun treatment with restriction endonuclease the ~2 kb DNA fragments obtained could be 3 different shapes, i.e. the linear DNA, the Y-shaped replication forks, and O-shaped replication bubble. However, these 3 DNA fragments have different speeds under agarose gel electrophoresis. At the molecular level, agarose has many dendritic structures. The linear DNA can pass polymerizing agarose fibers relatively easily, followed by Y-shaped replication forks, and replication origins could be hung on the dendritic structure with the slowest speed. In this way, the replication bubble could be isolated and detect the origin positions by next generation sequencing. The bubble traps were validated by 2D gel electrophoresis, which confirmed a purity with >80% replication bubbles (Mesner et al., 2011). Bubble track (or Bubble-seq) has found more than 50,000 initial zones cluster by origins detected by bubble track in human GM06990 cell line along the genome (Mesner et al., 2013), and also applied for Hela cell line to generate the library within ENCODE pilot regions covering ~1% of the human genome (Mesner et al., 2011). In GM06990 cell line, the average and the median zone size are 20 kb and 16 kb, respectively. Around 32% of initial zones are early-firing with the highest origin density. The late initial zones in the 1 Mb scale have the 17% lower than early-firing initial zones in origin density, followed by initial zones in the mid-S-phase with the lowest origin density. However, only 45-46% of the SNS-seq peaks overlapped the 36-37% of the bubble-seq (Picard et al., 2014). This may be due to the reason that bubble traps identified large initiation zones that are variable between cell lines, while SNS called sharp initiation peaks that are more conserved between cell lines. Besides that, bubble track is also limited by the relatively large sizes of Y-shape fragments with slow-moving speed, too, which can cause false-positive noise results.
MCM / ORC ChIP-seq
As shown in Figure 1.3, origin licensing is the first step of DNA replication initiation. The origin replication complex (ORC) is an essential element of Pre-RC to finish the complete replication initiation process. But even if the licensing origins finished, they may still keep silent, and not continue the following Pre-IC formation and origin firing process. Therefore, the genomic regions bound by ORC may not be the final firing origin sites. However, no matter what, it was still proposed very early as a method to find all potential origin sites, and ORC ChIP-seq (Coupling chromatin immunoprecipitation) came into being. This method captures the target protein ORC by immunoprecipitation and then detects all licensing positions binding to ORC by next generation sequence.
In MCM ChIP or ORC ChIP-seq, in order to reduce technical noise, the researchers openly performed multi experiments, i.e. several biological replicates, in unsynchronized cells to get the sequences binding with MCM or ORC. Then based on the overlapped regions between several replicates, or within 0.5 kb inter distance as criterion, to pick out proper regions showing up in all or most replicates as potential replication regions (Fig. 1.10). Then using SNS-seq and DNase-seq data as supporting evidence to classify the potential origins into firing origins or dormant origins. The origins close (within 0.5 kb) or overlapped with SNS-seq regions and show DNase-seq signal peaks will be recognized as firing origins, and the other regions will be more probably dormant origins. There are around 200,000 MCM7 peaks were found in several replicates of Hela cell line and 78,257 sites are associated with SNS-seq origins (Sugimoto et al., 2018). In the latest research of human cells, people found the distribution of ORC and MCM is dependent on transcription and depleted from transcribed gene bodies. But they are enriched in the TSSs (transcription start sites). ORC/MCM genomic distribution has an obvious correlation between replication timing but not related with initiation zone (Kirstein et al., 2021).
Figure 1.10 Genome-wide identification of firing and dormant origin sites (Masatoshi Fujita1 et al., 2018). First line DNase-seq, DNA footprints by DNase I digestion. The second and third lines (MCM7_1st and MCM7_2nd) are the two replicates of immunoprecipitation results for MCM7, and sMCM7 is the filtered overlapped potential origins after comparing MCM7_1st to MCM7_2nd. The 5th line is reference SNS-seq regions. The last 2 lines are classified as firing origins and dormant origins based on whether it overlaps with DNase1 signal peaks marked by green column or overlapped with SNS-seq regions.
Now the MCM/ORC ChIP-seq has applied to yeast, fruit flies, and human cells (Dellino et al 2013, MacAlpine et al 2010, Miotto et al 2016, Xu et al 2006). In S. cerevisiae, Autonomously Replicating Sequences (ARSs) contain a consensus sequence (ACS) that can be bound by ORC has been found essential for origin function. In addition, there is near perfect concordance between ORC and Mcm2-7 binding peaks. But in Drosophila, there is a vast excess of Mcm2-7 relative to ORC assembled onto chromatin when cyclin E/CDK2 activity rises in late G1 (Nina Kirstein et al 2021). These excess Mcm2-7 complexes exhibit little co-localization with ORC or replication foci (Sara K Powell et al, 2015). In humans, the Epstein–Barr virus (EBV) was used, whose replication in latency is entirely dependent on the human licensing machinery, to compare ORC and MCM binding and replication initiation sites. It has been shown that, there are a five- to tenfold excess of potential origins are licensed per genome with respect to 1–3 mapped initiation event, which means human replication initiates in zones, which comprise multiple, individually inefficient sites (Kirstein et al., 2021). Besides that, ORC has many functions other than DNA replication. Some ORC proteins work as transcription factors as well (Chesnokov, 2007). So, the detected origins of ORC-ChIP may be related to other functional genomic regions than replication origins. For example, ORC ChIP-seq identified 13,600 ORC1 binding sites in human HeLa cells, which do not reveal any sequence consensus (Dellino et al., 2013). Only 11-30% of these peaks overlapped SNS-seq peaks, and 47% overlapped bubbles. All of these biological reasons result in a huge controversial result in ORC ChIP-seq technology.
EdU-seq-HU
The incorporation of halogenated nucleotides is a conventional method to monitor the ongoing replicated regions during the S phase. The BrdU, which is frequently used in replication studies, can be detected by anti-BrdU antibody only after the DNA becomes single-stranded due to resection (Mukherjee et al., 2015). EdU (5-ethynyl-20-deoxyuridine) is another thymidine analog that has some technical advantages over BrdU usage, since EdU will be conjugated to fluorescent aside by Cu(I)-catalyzed reaction and can be detected in double-stranded DNA (Hua and Kearsey, 2011). Unfortunately, the EdU is toxic to the cells and activated the rad3-dependent checkpoint, which likely blocks over mitosis. Toxicity effects of EdU on mammalian cells have also been reported, suggesting that EdU may not be suitable for continuous labeling studies (Hua and Kearsey, 2011). So, it often takes several times to confirm the mark position by EdU in more than one cell cycle (Diermeier-Daucher et al., 2009; Hua and Kearsey, 2011). In addition, HU (hydroxyurea) can arrest fork progression after origin firing. Therefore, under the HU treatment on cells synchronized at the S phase entry will allow enriching the EdU signals around the replication origins. For a limited DNA synthesis situation, in the hydroxyurea-treated cells, EdU incorporation can be easily detected under fluorescence microscopy. Thus EdU-seq-HU protocol has been developed to locate the early replicated origin positions (Macheret and Halazonetis, 2019). The problem is the cell arrest led to an incomplete cell cycle, which can only detect the origins fired at the beginning of the S phase, and cannot identify the replication origins fired in other periods, such mid or late S phase.
Ini-seq
Figure 1.11 Schematic of ini-seq experiment protocol (Langley et al., 2016). Green lines represent digoxigeninlabeled nascent replicated DNA labeled by dUTP, and black lines represent unreplicated double-stranded DNA.
Another approach for replication origins hunting is ini-seq (Fig. 1.11). This is a method based on an in vitro system and applied to human EJ30 and Hela cell lines. Firstly, the nuclei have been extracted from human synchronized cells in the G1 phase. By adding an extract of proliferating cells makes replication start, and the newly synthesized DNA will be labeled by dUTP. Then the immunoprecipitation will be used to pull down the newly synthesized DNA labeled with dUTP. Sequencing, mapping, and peak calling allow identifying specific replication origin sites along the human genome. The median length of the origin sites is 1,184 bp.
OK-seq
Okazaki fragments must be generated on the lagging replicating strand of opposite replication forks during the replication process at the two sides of origin positions (Fig. 1.12). In another word, if we can detect the distribution of piled-up Okazaki fragments, we can detect the replication origins. OK-seq introduces a novel conception called Replication Fork Directonality (RFD): indicating the proportions of rightward- (R) and leftward- (L) Okazaki fragments at each position along the genome, like the formula, shown in Fig 1.12: RFD = (C-W)/(C+W), where C and W correspond to the numbers of detected Okazaki fragments mapped on Crick and Watson strand, respectively. Since we can calculate the RFD values at each position of the genome, the RFD curves along the entire genome can be drawn and origin regions corresponding to increased RFD curves shift can be determined as indicated in Figures 1.12 & 1.13.
Figure 1.12 OK-seq principle (Petryk et al., 2016). The left panel shows the principle of OK-seq and the formula in calculating Replication Fork Directionality (RFD) based on the OK-seq data. Based on the strand where Okazaki fragments come from, we can classify all of them by left and right forward replication forks that they belong to. The blue Okazaki fragment on the lagging strand of left moving replication forks is named Okazaki Watson, and the red Okazaki fragment on right moving replication forks named Okazaki Crick. After mapping bulk Okazaki fragments to the whole genome, we can count the number of Okazaki Watson (C) and Okazaki Crick (W) in each position and calculate the RFD = (C-W)/(C+W) along the entire genome. In an ideal case, if all cells select the same origin position to fire, the RFD curve will become a vertical ascending step from RFD = -1 on the left side of origin to RFD = 1 on the right side of the origin. In the case with different origin selection within an initiation zone in different cells, normally, the RFD curve around the origins should be like an increasing slope shape. Similarly, in the termination position of a fixed termination site and a termination zones, the RFD curve will show a descending shift and decreasing slope, respectively.
Figure 1.13 OK-seq replication origins called by RFD curve. Each dot indicates the RFD calculated within each 1 kb non-overlapping bin. The vertical blue lines at the positions of two ends of origins show the increased shift trend from negative (blue) to positive (red) RFD values.
The protocol of the experiment based on the above theory could be briefly summarized as 5 steps.
(1) EdU/EdC labeling (1~2 mins) marks the newly synthesized DNA including Okazaki fragments.
(2) Genomic DNA extraction by standard proteinase;
(3) Okazaki fragment isolation and biotinylation: centrifugation can isolate the <200 nt single-stranded DNA, and then biotin-TEG-Azide can pull down the EdU labeled Okazaki fragments.
(4) 2 pairs of adaptors ligation for purified Okazaki fragments, which permit the mutual authentication; and all Okazaki fragments will be captured with 200 mg of Dynabeads MyOne Streptavidin T1 according to the manufacturer’s protocol.
(5) Classical next generation sequence protocol including library amplification by PCR, sequencing data, and alignment to the genome.
Tracing the history of OK-seq, this method has been applied in yeast and humans successively through continuous optimization and evolution, 1st OK-seq (with a different experimental design) was performed in yeast in a ligase mutant (Duncan J. Smith et al, 2012), then it succeeds on the WT human cells by sequencing the highly purified short (<200 nt) EdU (or EdC) labeled single-stranded DNA, which highly enrich Okazaki fragments (Petryk et al., 2016). By OK-seq, Petryk and colleagues have shown that replication initiates stochastically in human cells, primarily within non-transcribed, broad (up to 150 kb) initiation zones that often abut transcribed genes and terminates dispersively between them.
The current single-molecule technologies used for origin identification
All the methods discussed before are origins or potential origins detection from various population-based data. There is a low agreement amongst various genome-wide studies. Regardless of the mechanism level, the major debate is whether replication origins are located at specific sites or stochastic occurred in broad initiation zones. And most methods, more or less, have their own technical or biological problems, leading to different population-based methods that might identify different “types” of origins. Whatever the main reason for the controversial results is the heterogeneity of the choice of replication initiation between cells. At present, the best way to solve this problem is to detect the origin of replication at the single-molecule level. Below I will introduce several commonly used single-molecule detection methods.
DNA combing
Figure 1.14 Schematic of DNA combing experiment process (from the introduction of genomic vision company). Two different fluorescent dyes to label the newly synthesized DNA sequentially. The green lines represent IdU with green fluorescence dye and red lines represent CIdU with red fluorescent dye.
DNA combing is the first single-molecule method applied to replication site detection invented by Bensimon (Allemand et al., 1997). DNA combing marks the newly synthesized DNA by two thymine analogs, such as IdU and CIdU, sequentially (Fig. 1.14). Then, DNA extraction will be performed to isolate intertwined DNA, and extracted DNA will move to the surface of a vinyl silane treated glass carrier. At the end of DNA molecules, there is a specific pH26 exposing polar groups which can bind to ionizable groups coating the hydrophobic surface. Meanwhile, the mid-segments show weaker negative than DNA ends, so, only DNA ends bind to the silanated coverslip. This will follow by stretching labeled DNA molecules to a linear structure by capillary force between two glass coverslips. Moreover, performing FISH (Fluorescent in situ hybridization) on combed molecules permits their genomic identification (Tuduri et al., 2010), although it’s technically challenging.
With the help of intermittent analysis for fluorescence colors by high-resolution microscope, it can clearly indicate the location of replication origins on individual DNA fibers (Fig. 1.14). Compared with the methods introduced before, the biggest advantage of DNA combing is that it is a single-molecule technology. The other methods are peak calling result from bulk data, which exclude the location with weak signals. The origins obtained by the population methods often enrich in the early replication initiation sites shared by multiple cells. And for those replication initiation sites that only fire in late S phase or origins that only fire when replication fork stalling occurs, they can only be detected by method at the single molecular (SM) level. DNA combing is one of such SM approaches. All detected origin locations are real origins. Nevertheless, the limitation of this methods is very low throughput and lack of sequence level resolution. This disadvantage makes DNA combing can’t apply for the genome-wide origin detection.
Nanopore sequencing
Nanopore sequencing is a technology that can detect the sequence by different resistance of bases when the DNA sequence passes through the magnetic beads with electrodes. The magnetic beads continue to discharge, and as the sequence continues to enter the magnetic beads, the base sequence passing through the magnetic beads is continuously replaced. The base sequence of different resistances will cause the current signal to change, so as to distinguish the four bases of A, T, G, C. Similarly, based on the current signal difference, in vitro, it is able to distinguish newly replicated DNA marked by BrdU (5-bromo2-deoxyuridine) with normal base dTTP in unreplicated regions like Fig. 1.15 (Müller et al., 2019). There is a nanopore electric signal feature to detect the BrdU by machine learning. In vivo, it will be barely qualified for thymidine detection from BrdU by model optimization, but there will be false positive BrdU phenomenon occurred in the BrdU-enriched sample reads (Hennion et al., 2018).
A nanopore is indeed a high-throughput detection method compared to DNA combing at the single molecular level. Moreover, compared with the non-single-molecule method, because DNA fiber does not need to be cut by the shotgun method, the ultra-long sequence is directly analyzed. Now, nanopore sequencing is successfully applied to the yeast at near-nucleotide resolution (~200 nt) and found 58,651 replication tracks (Hennion et al., 2020). The sequencing read length is tens of kb or even 100 kb (with few molecules), in yeast, the read length is between 10~140 kb for the BrdU labeling sequence and similar for normal dTTP sequence (Hennion et al., 2020). Although the matching accuracy is far beyond the next generation sequencing technology, it can be used for matching with multi repetitive sequences region with the long reads, this is not possible with second-generation sequencing. However, the high sequencing cost limits the sequencing coverage, therefore, hard to apply to study DNA replication of human cells genome-wide.
Figure 1.15 Schematic of Nanopore technology (Müller et al., 2019). (a) The technical principle of Nanopore. With the one base step length, DNA passes through the magnetic beads with electrodes little by little and calculates the resistance of the DNA fragments wrapped in the magnetic beads through the applied voltage and current, each time they move. Determine the sequence information according to the different resistance values of the four bases of A, T, G, and C. (b) Schema represents the origin detection principle. Like several methods introduced before, Nanopore also used BrdU or another thymidine analog to mark ongoing replication regions like the red line in plot b, and the blue line is the normal DNA sequence. (c) pipeline for the origin detection process by Nanopore. (D) The current signal distribution of thymidine (blue) compares with distributions of 4 different thymidine analogs.
Table of contents :
1.1 DNA REPLICATION MECHANISM AND THE CORRESPONDING KNOWLEDGE
1.1.1 CELL CYCLE
1.1.2 REPLICATION ORIGINS
1.1.3 REPLICATION UNIT
1.1.4 THE COMPLETE BIOLOGICAL REPLICATION INITIATION PROCESS
1.1.5 REPLICATION TIMING
1.2 REPLICATION REGULATION IN TIMING AND ORIGIN LOCATION
1.2.1 THE GENETIC AND EPIGENETIC MODIFICATIONS AROUND ORIGINS
1.2.2 STOCHASTIC MODEL OF INITIATION-TIMING REGULATION
1.3 THE CURRENT TECHNOLOGIES USED FOR ORIGIN IDENTIFICATION BY BULK DATA
1.3.1 SNS-SEQ
1.3.2 BUBBLE TRACK
1.3.3 MCM / ORC CHIP-SEQ
1.3.4 EDU-SEQ-HU
1.3.5 INI-SEQ
1.3.6 OK-SEQ
1.4 THE CURRENT SINGLE-MOLECULE TECHNOLOGIES USED FOR ORIGIN IDENTIFICATION
1.4.1 DNA COMBING
1.4.2 NANOPORE SEQUENCING
1.5 A NOVEL METHOD: ORM (OPTICAL REPLICATION MAPPING)
1.5.1 BIONANO HIGH-THROUGHPUT DNA FIBER MAPPING MATERIAL AND METHODS, AND BASIC ORM SIGNAL ANALYSES
2.1 CELL LINES
2.1.1 CELL SYNCHRONIZATION
2.1.2 CELL LABELING
2.2 OPTICAL REPLICATION MAPPING
2.3 DATA FORMAT OF BIONANO
2.3.1 BNX
2.3.2 RCMAP AND QCMAP
2.3.3 XMP
2.4 THE CALCULATION OF GENOMIC POSITIONS FOR THE RED SIGNALS
2.5 DATA INTEGRATION BY JAR PACKAGES AND OUTPUT FORMAT
2.5.1 ALLRAWDATAREFINING.JAR AND ITS OUTPUT FORMAT
2.5.2 GENERATEGTF_BYALLDATAREFINING_REFORMAT.JAR AND ITS OUTPUT FORMAT
2.6 HOT SPOTS FILTERING
2.6.1 HOT SPOTS
2.7 SEGMENTATION FOR ORM LABELING SIGNALS
2.8 THE RELIABILITY TEST FOR ORM SEGMENTATION
2.8.1 TRACK THE TRAJECTORY OF SEPARATED REPLICATION FORKS
2.8.2 THE UNEXPECTED LENGTH DISTRIBUTION IN ALL DATASETS
2.8.3 TWO HYPNOSIS FOR EXPLAINING THE UNEXPECTED LENGTH DISTRIBUTION
2.8.4 VERIFICATION OF POTENTIAL MODEL
2.8.5 REGAINING THE NEGLECTED SIGNALS
2.8.6 THE EXPLANATION FOR SPARSE LABELING REPLICATION INITIAL ZONE CALLING
3.1 CALCULATION OF NORMALIZED ORM SIGNAL DENSITY
3.2 NORMALIZED SIGNAL DENSITY SMOOTHING
3.3 PEAK AREA RECOGNITION
3.4 CORE REGION REFINING
3.4.1 THE AGGREGATED DENSITY PERCENTAGE
3.4.2 ESTIMATE PROPER SIGNAL PERCENTAGE CUTOFF TO CALL CORE REGIONS OF INITIATION ZONES
3.5 FILTERING AND INITIAL ZONE CALLING
3.5.1 OVERLAPPED REPLICATES NUMBER FILTERING
3.5.2 THE OTHER STANDARD TO ESTIMATE THE QUALITY OF CORE REGION
3.5.3 K-MEANS CLUSTERING FOR IZ LENGTH ADJUSTMENT FORK DIRECTIONALITY ANALYSIS
4.1 FDI: FORK DIRECTION INDEX
4.2 THE TRIALS FOR IDENTIFICATION OF FORK DIRECTION OF INDIVIDUAL TRACKS
4.2.1 THE MACHINE LEARNING CLASSIFIER
4.2.2 FAILED ATTEMPT TO INTRODUCE THE SECOND LABELING SIGNAL
4.3 GENOME-WIDE REPLICATION KINETICS IN ASYNCHRONOUS CELLS DEEPER DERIVATIVE DATA MINING FOR ORM IZS
5.1 STOCHASTIC MODEL
5.1.1 EARLY INITIATION EVENTS IN LATE-REPLICATING DOMAINS
5.1.2 LATE-REPLICATING SIGNALS ARE NOT NOISE DATA
5.1.3 FIRING EFFICIENCY IS CORRELATED WITH REPLICATION TIMING
5.1.4 NO SPECIFIC INITIATION SITES
5.1.5 COMPUTATIONAL SIMULATION CONFIRMS THE STOCHASTIC MODEL
5.2 COMPARISON BETWEEN REPLICATION ORIGINS MAPPED BY DIFFERENT APPROACHES
5.2.1 MUTUAL AUTHENTICATION
5.2.2 DIFFERENT FIRE EFFICIENCY AND REPLICATION TIMING COMPARISON
5.3 THE EPIGENETIC MODIFICATION MARKS AROUND INITIATION ZONES
5.3.1 THE EPIGENETIC MODIFICATION MARKS ENRICHED AT ORM INITIAL ZONES CONCLUSION AND PERSPECTIVES
6.1 MAIN CONCLUSION
6.1.1 ORM – A FUTURE TREND IN INITIATION DETECTION: SINGLE-MOLECULE, CHEAP AND HIGH-THROUGHPUT
6.1.2 DIRECT FIRE EFFICIENCY DETECTION REVEALS THAT INITIATIONS ARE NOT CLUSTERED
6.1.3 ORM DATA SUPPORT A STOCHASTIC MODEL IN REPLICATION TIMING REGULATION