Get Complete Project Material File(s) Now! »
Current view of gene regula-tory logic
The DNA regulatory modules of transcription
The DNA sequence is a stretch of four nucleotide types, precisely ordered to form genes and regulatory regions [3, 6]. Regulatory regions encode sequences targeted by transcription factors (TF). The binding of one or multiple tran-scription factors to a specific regulatory region controls the transcription of the surrounding gene(s) by promoting or re-pressing the recruitment of the poly-merase [3, 23]. The regulatory region and the targeted genes are chiefly in close or direct proximity, on the same DNA strand. Consequently, we con-sider these TF-mediated interactions to be cis-driven, and define the regulatory regions involved as cis-regulatory mod-ules (CRM). In general, two main types of CRMs are distinguished [24, 25] (Fig. 1.3). On the one hand, promoters are characterised by their capacity to re-cruit polymerase and trigger gene tran-scription; they are located next to the gene transcription start site (TSS). On the other hand, enhancers tend to be located further away from TSS ; they have the capacity to recruit TFs. Al-though, evidences accumulate on CRMs showing both promoter and enhancer characteristics [24]. It is therefore not clear whether such classification is bio-logically relevant. Instead, recent stud-ies suggest that CRMs rather ranges ac-cording to a continuum between pure promoters and pure enhancers [26].
The enhancer-promoter regulatory dialog
In order to trigger gene transcription, enhancers physically interact with pro-moters with the help of other proteins, forming the so-called mediator complex (Fig. 1.3). The detailed mechanisms un-derlying this looping mechanism bring-ing promoter and enhancer together (P-E interaction) remain to be deciphered [23, 29].
With TFs and polymerase binding, we can see that gene regulation does not solely involve the cis -regulation of CRMs, but also requires the action of trans-acting molecules (Fig. 1.3). Con-sequently, the regulation of gene expres-sion does not only stem from the 2D sequence of DNA ; it is rather driven by a complex 3D structure of molecules bound together [29].
The shape of the DNA molecule is con-trolled by multiple factors. Firstly, the nucleotide composition of DNA itself will affect the helix groove [30]. Sec-ondly, in the nucleus, the DNA molecule is densely packed by nucleosomes, form-ing the chromatin [31]. Each nucleosome is formed by an octamer of four pairs of histones. They function like molecular spools for DNA strand, providing a tight compaction. This compaction capacity is critical for the cell cycle, as it permit the formation of chromosome during cell division. This conformation is also nec-essary to control the CRMs activity. In-deed, a local chromatin compaction on an enhancer can prevent TF from bind-ing if the target site is not accessible [23, 31]. This additional layer of regu-lation formed by the molecules around the DNA represent the epigenomic land-scape [16].
The regulation of chromatin compaction involves histone tail post-translational modifications. As the N -terminal tails of histone protrude from the nucleosome, some of their amino-acids can be mod-ified by biochemical reactions, such as methylation and acetylation [27, 32].
Some of these modifications have been shown to co-vary with transcriptional regulation and chromatin compaction changes [27, 33, 34] (Table 1.1). For ex-ample, the acetylation of the 27th lysine residue from the histone H3 (H3K27ac) is associated with the presence of ac-tive promoter or enhancer activity. It is therefore suggested that histone mod-ifications define a code for transcription regulation.
A more global regulation of gene ex-pression is imposed by a higher scale 3D organisation. In particular, topo-logically associated domains (TAD) are formed by clusters of long-range con-tacts between regions from the same DNA molecule [29, 35] ; their boundary are defined by insulators, characterised by the binding of the protein CTCF. These regions of higher physical inter-actions are known to favour transcrip-tion regulation by increasing the chance of promoter-enhancer contact within a TAD. In contrast, the presence of an insulator tend to limit interactions be-tween the flanking regions [36].
In summary, the spatial-temporal tun-ing of transcription involves a complex interplay of different actors regulating the accessibility of regulatory regions.
The regulation of enhancer activity in space and time
During development, the activation of specific gene must be perfectly con-trolled in space and time to properly pattern the embryo [23]. Thus, it is crucial to tightly regulate the activation and repression of enhancers and promot-ers. In this respect, both long-term and short-term actions are taking place.
Firstly, the DNA compaction can be adapted by changing nucleosome po-sitioning on DNA [34]. Specific en-zymes can read, erase or write the hi-stone modification code to flag a region as a target for chromatin remodelling [33, 34]. For example, the methyltrans-ferases and demethylases can respectively add and remove methylation on the histone tails. Following such histone post-translational modifications, nucle-osomes can relocate and modify the ac-cessibility of surrounding CRMs. Sec-ondly, pioneer transcription factors have the ability to bind closed chromatin re-gions and promote local nucleosome re-lease [37].
These mechanisms of chromatin remod-elling take time, as they require nucleo-some re-positioning. Transcription fac-tor binding further refine the spatial-temporal resolution of transcriptional regulation. Indeed, the diffusion of TFs and their recruitment at enhancer re-gions are comparatively fast. They can rapidly adapt the gene expression level, while keeping a high signal sensitivity and specificity [23]. The specificity of transcription factor signal can be no-tably achieved by cooperative binding, where multiple TFs need to join force to trigger the gene activation [23].
Within an embryo, the regulation of the activity of transcription factors is there-fore the key mechanism for rapid reg-ulatory changes. Additionally, slower changes in chromatin accessibility can help to maintain transcriptional control at broader scales [38].
Enhancer activity screening and annotation
Following decades of work on model or-ganisms, extensive annotation resources on genes and regulatory regions are now available, together with the charac-terisation of the corresponding spatial-temporal activity patterns.
These patterns have been characterised in multiple ways. Notably, reporter as-says have been largely used to study the CRM involved in embryogenesis [25]. In this type of experiment, the DNA se-quence of a candidate cis-regulatory region is combined with a minimal pro-moter and a reporter gene (e.g. lu-ciferase). In 2014, the Stark laboratory has used systematic reporter assays to characterise 7,793 cis -regulatory regions and their respective transcriptional ac-tivity throughout the space and devel-opmental time of Drosophila embryoge-nesis [39]. The Furlong laboratory also contributed to these annotation efforts with 525 manually curated regions doc-umented in a CRM Activity database (CAD) [40].
Another method to study the space-time dynamic of gene expression is the use of in-situ hybridisation [41, 42]. This type of experiment consists in designing a DNA or RNA probe, complementing the se-quence of a target gene. By adding a label to the probe (enzymes, antibody, fluorescent label) and injecting this con-struct into a fixed embryo, it becomes possible to visualise the corresponding gene expression pattern. The Davidson laboratory combined this approach with gene perturbations to infer the gene reg-ulatory network governing the endome-soderm specification of the sea urchin Strongylocentrotus Purpuratus [43]. Sim-ilarly, the Lepage laboratory produced a considerable amount of in-situ experi-ments to characterise the regulatory net-work of the ectoderm specification in an-other sea urchin, Paracentrotus Lividus [41].
To have an idea of the considerable work done by the Drosophila and Sea urchin communities, one can look at the avail-able wealth of curated data.
In the REDfly database [44], 24,415 Drosophila CRMs are curated from 1,058 publications. Importantly, with more than 60% of its protein-coding genes having one or more homologs in human [6], the Drosophila knowledge is a valuable resource to study gene regula-tion for both fundamental and biomed-ical research. For example, the FlyBase Human disease model index links 1451 Drosophila disease models to specific hu-man diseases.
In the Echinobase database [45], the Gene Regulatory Network of the en-domesoderm initially started by Eric Davidson now gather the results of mul-tiple research laboratories and offer an impressive granularity in space and time of the network structure, documenting over a hundred genes and their interac-tions.
The wealth of data gathered from decades of genetic screens on the fruit fly and the sea urchin opens the pos-sibility to build novel hypotheses re-garding the regulatory control of gene expression. These low throughput ap-proaches are now supplemented by high-throughput approaches to detect novel regulatory interactions and characterise gene expression, which are introduced in the next section.
Methods and challenges to as-say genome complexity
Pan-genomic characterisation of gene expression and epigenomic status
Studying gene regulation implies to tackle two scale problems. Firstly, DNA is a molecule compacted inside the nucleus which size typically ranges between 2 and 10 microns [6]. Ob-serving physical interactions occurring at such small scale requires advanced imaging technics. Secondly, the DNA sequence can reach several Giga base pairs (Gbp) in length [6], calling for for high-throughput reading/sequencing technologies.
Sequencing DNA became reality with the Sanger sequencing in 1977 [47] and reached a high-throughput capac-ity with the technical advances of Roche and Illumina sequencing [47]. The Il-lumina sequencing technology sequen-tially incorporate fluorescently labelled nucleotides on short DNA strands (also called reads or tags), generated by the polymerase from a template strand.
To specifically study the regulatory re-gions, several types of experiments have been designed using sequencing methods (Table 1.2).
All these techniques bring precious in-formation to better delineate each of the regulatory layers described in the pre-vious section. Indeed, gDNA-seq de-tects genetic variants (nucleotide poly-morphism, insertion, deletion) ; ATAC-seq provides a direct measure of chro-matin accessibility and thus highlights potential CRMs [48] ; ChIP-seq targeting histone modifications depicts the activa-tion state of these CRMs [49] ; ChIP-seq targeting TFs can identify the targets of these factors for specific cell types, tis-sues and conditions [49] ; Hi-C-seq delin-eates the 3D organisation of the genome [50] ; RNA-seq reveals the expression level of each individual gene [51]. As a re-sult, high-throughput sequencing tech-nologies help to unfold the 3D structure back on the DNA.
Nevertheless, these methods still have limitations. Indeed, they do not always enable the precise definition of active regions [52]. For example, it has been shown that chromatin accessibility is not a perfect proxy for enhancer activity [38, 53, 54]. Moreover, ChIP-seq methods in-herently display high noise and do not detect histone or TF location at a base pair resolution [55, 56]. Additionally, the enzyme cleavage bias [57] and a pro-longed formaldehyde fixation [58] gener-ate signal artifacts. Lastly, these exper-iments are based on pools of cells, con-sequently flattening the cell-specific sig-nals into an average measure. In order to overcome these limitations, new se-quencing methods are rapidly emerging, such as long read sequencing [59], native ChIP and single-cell technique 61].
In parallel to high -throughput sequenc-ing method, there is also a fast develop-ment of tools based on high resolution microscopy [62]. These techniques have the advantage of being informative re-garding both the space and time dimen-sions. Yet, all these different mentioned techniques are just the tip of the ice-berg, as we currently experience a sharp increase in the number of newly devel-oped methods (cf. Sequencing Method explorer from Illumina).
Computational methods to harness genome-wide data
Although new sequencing techniques al-low for a refined characterisation of the transcription regulatory landscape, they still need to be carefully processed with adapted bioinformatic tools to eliminate potential biases and extract rele-vant functional information. The main read processing steps and their respec-tive biases are listed in Table 1.3.
Sequenced reads may be subjected to se-quencing errors, stemming from techni-cal noise (eg. weak fluorescence, over-lapping probes). To grade the sequencing quality, sequencer machines assign to each nucleotide base call a Phred score, based on the probability of incorrect base identification. Reads with an average low Phred score are chiefly discarded using quality-check tools such as FastQC (bioinformat-ics.babraham.ac.uk/projects/fastqc).
Read mapping constitutes one of the first processing steps. It consists in lo-calising, within the genome sequence, the genomic coordinates corresponding to the region of origin of each read. Al-
though intuitive at first glance, it re-quires a efficient implementation and precise parameter tuning to be correctly optimised. Indeed, as a sequencing experiment produces several million of reads, there is a clear need for efficient mapping algorithms. Mapping software such as Bowtie 2 [66] and STAR [67] have relatively short running times thanks to their genome indexing method.
Mapping algorithms align the reads on a pre-existing reference genome. This method avoids the need of a de-novo genome assembly for each new sequenc-ing assay. However, the reference genome does not exactly correspond to the probed genome. Consequently, mis-matches may exist between the read and its genomic source on the reference. To-gether with sequencing errors, these mis-matches must be taken into account to avoid discarding properly-mapped read. In that respect, mapping algorithms can accept imperfectly aligned read if they do not exceed a certain penalty score, based on the number of mismatches and gaps in the alignment.
Reads generally do not exceed 300bp length with Illumina machines. A ge-nomic sequence of the same size is usu-ally only found once within the genome, excepted for regions with low com-plexity and/or series of short repeats. Read originating from such regions may equally align to multiple genomic coor-dinates. As the real region of origin cannot be distinguished from the oth-ers, multi-mapping reads are generally discarded.
Following read mapping, we usually aim at comparing the signal within and be-tween samples. HTseq [68] and STAR enable the quantification of read counts per genomic feature. However, as two samples may not have the same sequenc-ing depth, the raw number of read map-ping on the target region may not reflect the same level of signal. Consequently, it is crucial to apply a library scaling [63] to adjust sequencing depth prior to the comparison.
For high-throughput data targeting non-coding regions, one additional step is the definition of peaks (Fig. 1.5). In-deed, the features used for read count in RNA-seq stem from gene annotation databases. For ATAC-seq and ChIP-seq data, there is not predefined set of re-gions to assess. It is therefore necessary to define, for each sample, the en-riched non-coding regions. This peak calling step is implemented in multiple algorithms, such as MACS2 [69]. Peak-calling algorithms usually account for false positive detection by comparing the ChIP-seq measures with signal com-ing from untargeted DNA fragments ex-tracted from the same sample (input).
Following feature count and library scal-ing, the data are still susceptible to bias stemming from very highly expressed features (genes or peaks), which monop-olise a large fraction of the sequencing effort in a subset of the samples. In such case, even though all samples are scaled to the same sequencing depth, read will not be equally distributed across the genome [63]. Several strategies specifi-cally normalise the signal of the outlier features. The most widely used method is the trimmed mean of M values [65], im-plemented in the software DESeq2 [70] and EdgeR [71] for genic signal and in csaw for intergenic signal [72].
After careful scaling and normalisation, the comparison of the signal between two samples can be tested within a sta-tistical framework. In order to cope with the large dispersion of count data, the beta-binomial distribution with es-timated over-dispertion parameter is chiefly used to test for differential fea-ture expression. This statistical test is implemented in DESeq2 [70] and EdgeR [71]. On key requirement of the study de-sign to greatly improve the power of the test is the inclusion of biological repli-cates for each condition.
ChIP-seq data targeting transcription factors also open the possibility to search for motifs of transcription bind-ing site [73]. Peaks detected from ChIP-seq targeting a given transcription factor should be enriched in sequences match-ing its binding motif and the one of its potential co-factor (Fig. 1.5). One can infer the corresponding binding motifs with computational suites, such as the RSAT suite [73–75]. Motifs are usually represented as Position Weight Matrices (PWM), giving the likelihood of observ-ing one of the four possible nucleotides at each base pair position [76]. Although such analyses are extremely powerful for studying the actors of transcription regulation, they still require the consider-ation of large amounts of data to delin-eate tissue or co-binding specificities [77]. Additionally, such method are still lack-ing detection power for assays yielding less specific and broader signal, such as ChIP-seq targeting histone [78].
To conclude, there is a vast diversity of tools to process functional genomics se-quencing data. They each come with their specific specificity, advantages and challenge. However, due to the diver-sity of possible analysis design, there is still no clear consensus in the « best » pipeline to use. This situation can lead to differences in analysis results and hin-der reproducibility when associated to poor documentation. To tackle this problem, consortium such as ENCODE [79] and ROADMAP [80] are document-ing and making publicly available pro-cessing guidelines, although they might not always be completely flexible (eg. ROADMAP is chiefly targeted on hu-man data).
Table of contents :
1 Introduction
1.1 Historical perspectives
1.1.1 Sea urchin embryology in the 19th century
1.1.2 Fruit fly genetics in the 20th century
1.1.3 Drafting a gene regulatory circuit
1.1.4 Emergence of computational systems biology
1.2 Current view of gene regulatory logic
1.2.1 The DNA regulatory modules of transcription
1.2.2 The enhancer-promoter regulatory dialog
1.2.3 The regulation of enhancer activity in space and time
1.2.4 Enhancer activity screening and annotation
1.3 Methods and challenges to assay genome complexity
1.3.1 Pan-genomic characterisation of gene expression and epigenomic status
1.3.2 Computational methods to harness genome-wide data
1.3.3 Using perturbation to assess functionality
1.4 A network perspective on gene regulation
1.4.1 Systems Biology concepts
1.4.2 Probabilistic network inference
1.4.3 Mechanistic network modelling
1.5 Aims of my PhD
2 Deciphering cis-regulation using genetic variation
2.1 Study summary
2.2 Methodological background
2.2.1 The Drosophila Genetic Reference Panel
2.2.2 Allelic ratio measures in F1 hybrids
2.2.3 Mapping strategies for F1 hybrids
2.2.4 Controlling for mapping bias
2.2.5 Controlling for genotyping bias
2.3 Contribution to the published work
2.4 The mechanisms and evolutionary relevance of regulatory variants in embryonic development
2.4.1 Abstract
2.4.2 Introduction
2.4.3 Results
2.4.4 Discussion
2.4.5 Methods
2.4.6 Supplementary figures
2.4.7 Supplementary methods
2.4.8 References
2.5 Complementary results
2.5.1 Construction of the mappability mask
2.5.2 Impact of the synthetic mask
2.5.3 Impact of using F1 genomic data to discard genotyping errors
2.5.4 Impact of using egg data to discard maternal transcripts
2.5.5 Delineation of genomic regions with overlapping signals
2.5.6 Probing direct interactions with partial correlation
2.5.7 Exploring allelic imbalance at the SNP level
2.5.8 Script availability
3 Depicting trans-regulation using logical modelling
3.1 Study summary
3.2 Methodological background
3.2.1 The regulatory role of TGF-b signalling
3.2.2 Draw me a TGF-b map
3.2.3 Feedback circuits
3.2.4 Logical formalism
3.2.5 Dynamical simulation
3.3 Contribution to the published work
3.4 Deciphering and modelling the TGF-b signalling interplays specifying the dorsal-ventral axis of the sea urchin embryo
3.4.1 Abstract
3.4.2 Introduction
3.4.3 Results
3.4.4 Discussion
3.4.5 Materials and methods
3.4.6 Acknowledgements
3.4.7 References
3.5 Complementary results
3.5.1 Script availability
4 Conclusions and perspectives
4.1 Biological aspects
4.1.1 Coupling between epigenetic and transcriptional regulatory mechanisms
4.1.2 The mechanims of TGF-b cross-inhibition
4.2 Methodological aspects
4.2.1 Allele-specific measurements can contrast cis- versus trans- effects
4.2.2 DNA binding motifs from ChIP-seq targeting histone marks
4.2.3 The iterative process of network modelling
4.3 Prospects
4.3.1 Aiming at a system-wide model
4.3.2 Towards the inference of regulatory networks
4.3.3 Qualitative inference of network dynamics