Get Complete Project Material File(s) Now! »
Text-To-Speech (TTS)
Applications of speech synthesis
Over the last few decades, speech synthesis or TTS has considerably drawn attention and resources from not only researchers but also the industry. This field of work has progressed remarkably in recent years, and it is no longer the case that state-of-the-art systems sound overtly mechanical and robotic. The concept of high quality TTS synthesis appeared in the mid eighties, as a result of important developments in speech synthesis and natural language processing techniques, mostly due to the emergence of new technologies. In recent years, the considerable advances in quality have made TTS systems more common in various domains and numerous applications.
It appears that the first real-life use of TTS systems was to support the blind to read text from a book and converting it into speech. Although the quality of these initial systems was very robotic, they were surprisingly adopted by blind people due to their availability compared to other options such as reading braille or having a real person do the reading (Taylor, 2009). Nowadays, there have been a number of TTS systems to help blind users to interact with computers. One of the most important and longest applications for people with visual impairment is a screen reader in which the TTS can help users navigate around an operating system. Blind people also widely have been benefiting from TTS systems in combination with a scanner and an Optical Character Recognition (OCR) software application that gives them access to written information (Dutoit and Stylianou, 2003). Recently, TTS systems are commonly used by people with reading disorder (i.e. dyslexia) and other reading difficulties as well as by preliterate children. These systems are also frequently employed to aid those with severe speech impairment usually through a voice output communication aid (Hawley et al., 2013). Handicapped people have been widely aided by TTS techniques in Mass Transit.
Nowadays, there exist a large number of talking books and toys that use speech synthesis technologies. High quality TTS synthesis can be coupled with “a computer aided learning system, and provide a helpful tool to learn a new language”. Speech synthesis techniques are also used in entertainment productions such as games and animations, such as the announce-ment of NEC Biglobe 2 on a web service that allows users to create phrases from the voices of Code Geass 3 – a Japanese anime series. TTS systems are also essential for other research fields, such as providing laboratory tools for linguists, vocal monitoring, etc. Beyond this, TTS systems have been used for reading messages, electronic mails, news, stories, weather reports, travel directions and a wide variety of other applications.
One of the main applications of TTS today is in call-center automations where textual information can be accessed over the telephone. In such systems, a user pays an electricity bill or books some travel and conducts the entire transaction through an automatic dialogue sys-tem (Taylor, 2009)(Dutoit, 1997). Another important use of TTS is in speech-based question answering systems (e.g. Yahoo! 2009 4) or voice-search applications (e.g. Microsoft, 2009 5, Google 2009 6), where speech recognition and retrieval system are tightly coupled. In such those systems, users can pose their information need in a natural input modality, i.e. spoken language and then receive a collection of answers that potentially address the information need directly. On smartphones, some typical and well-known voice interactive applications whose main component is multi-lingual TTS are Google Now, Apple Siri, AOL, Nuance Nina, Samsung S-Voice, etc. In these software applications, a virtual assistant allows users to per-form a number of personalized, effortless command/services via a human-like conversational interface, such as authenticating, navigating menus and screens, querying information, or performing transactions.
Basic architecture of TTS
The basic architecture of a TTS system, illustrated in Figure 1.1, has two main parts (Dutoit and Stylianou, 2003) with four components (Huang et al., 2001). The first three – i.e. Text Processing, Grapheme-to-Phoneme (G2P) Conversion and Prosody Modeling – belong to the high-level speech synthesis, or the Natural Language Processing (NLP) part of a TTS system. The low-level speech synthesis or Digital Signal Processing (DSP) part – forth component – generates the synthetic speech using information from the high-level synthesis. The input of a TTS system can be either raw or tagged text. Tags can be used to assist text, phonetic, and prosodic analysis.
The Text Processing component handles the transformation of the input text to the appropriate form so that it becomes speakable. The G2P Conversion component converts orthographic lexical symbols (i.e. the output of the Text Processing component) into the corresponding phonetic sequence, i.e. phonemic representation with possible diacritical in-formation (e.g. position of the accent). The Prosody Modeling attaches appropriate pitch, duration and other prosodic parameters to the phonetic sequence. Finally, the Speech Syn-thesis component takes the parameters from the fully tagged phonetic sequence to generate the corresponding speech waveform (Huang et al., 2001, p. 682). Due to different degrees of knowledge about the structure and content of the text that the applications wish to speak, some components can be skipped. For instance, some certain broad requirements such as rate and pitch can be indicated with simple command tags appropriately located in the text. An application that can extract much information about the structure and content of the text to be spoken considerably improve the quality of synthetic speech. If the input of the system contains the orthographic form, the G2P Conversion module can be absent. In some cases, an application may have F0 contours pre-calculated by some other process, e.g. transplanted from a real speaker’s utterance. The quantitative prosodic controls in these cases can be treated as “special tagged field and sent directly along with the phonetic stream to speech synthesis for voice rendition” (Huang et al., 2001, p. 6).
Text Processing. This component is responsible for “indicating all knowledge about the text or message that is not specifically phonetic or prosodic in nature”. The basic function of this component is to convert non-orthographic items into speakable words. This called text normalization from a variety symbols, numbers, dates, abbreviations and other non-orthographic entities of text into a “common orthographic transcription” suitable for next phonetic conversion. It is also necessary to analyze white spaces, punctuations and other de-limiters to determine document structure. This information provides context for all later pro-cesses. Moreover, some elements of document structure, e.g. sentence breaking and paragraph segmentation, may have direct implications for prosody. Sophisticated syntax and semantic analysis can be done, if necessary, for further processes, e.g. to gain syntactic constituency and semantic features of words, phrases, clauses, and sentences (Huang et al., 2001, p. 682).
G2P Conversion. The task of this component is to “convert lexical orthographic sym-bols to phonemic representation” (i.e. phonemes – basic units of sound) along with “possible diacritic information (e.g. stress placement)” or lexical tones in tonal languages. “Even though future TTS systems might be based on word sounding units with increasing storage technolo-gies, homograph disambiguation and G2P conversion for new words (either true new words being invented over time or morphologically transformed words) are still necessary for sys-tems to correctly utter every word. G2P conversion is trivial for languages where there is a simple relationship between orthography and phonology. Such a simple relationship can be well captured by a handful of rules. Languages such as Spanish and Finnish belong to this category and are referred to as phonetic languages. English, on the other hand, is remote from phonetic language because English words often have many distinct origins”. Letter-to-sound conversion can then be done by general letter-to-sound rules (or modules) and a dictionary lookup to produce accurate pronunciations of any arbitrary word. (Huang et al., 2001, p. 683).
Prosody Modeling. This component provides prosodic information (i.e. “an acoustic representation of prosody”) to parsed text and phone string from linguistic information. First, it it necessary to break a sentence into prosodic phrases, possibly separated by pauses, and to assign labels, such as emphasis, to different syllables or words within each prosodic phrase. The duration, measured in units of centi-seconds (cs) or milliseconds (ms), is then predicted using rule-based (e.g. Klatt) or machine-learning methods (e.g. CART). Pitch, a perceptual correlate of fundamental frequency (F0) in speech perception, expressed in Hz or fractional tones (semitones, quarter tones…), is generated. F0, responsible for the perception of melody, is probably the most characteristic of all the prosody dimensions; hence generation of pitch contours is an incredibly complicated language-dependent problem. Intensity, expressed in decibels (dB), can be also modeled. Besides, prosody depends not only on the linguistic content of a sentence, but also on speakers and their moods/emotions. Different speaking styles can be used for a prosody generation system, and different prosodic representations can then be obtained (Huang et al., 2001).
Speech Synthesis. This final component, a unique one in the low-level synthesis, takes predicted information from the fully tagged phonetic sequence to generate corresponding speech waveform. In general, there currently have been two basic approaches concerning speech synthesis techniques: (i) Source/filter synthesizers: Produce “completely synthetic” voices using a source/filter model from the parametric representation of speech, (ii) Concate-native synthesizers: Concatenate pre-recorded human speech units in order to construct the utterance. The first approach has issues in generating speech parameters from the input text as well as generating good quality speech from the parametric representation. In the second approach, signal processing modification and several algorithms/strategies need to be em-ployed to make the speech sound smooth and continuous, especially at join sections. Details on these approaches are presented in next subsections.
Source/filter synthesizer
The main idea of this type of synthesizer is to re-produce the speech from its’ parametric representation using a source/filter model. This approach makes use of the classical acoustic theory of speech production model, based on vocal tract models. “An impulse train is used to generate voiced sounds and a noise source to generate obstruent sounds. These are then passed through the filters to produce speech” (Taylor, 2009, p. 410).
It turns out that formant synthesis and classical Linear Prediction (LP) are basic tech-niques in this approach. Formant synthesis uses individually “controllable formant filters which can be set to produce accurate estimations of the vocal tract transfer function”. The parameters of the formant synthesizer are determined by a set of rules which examine the phone characteristics and phone context. It can be shown that very natural speech can be generated so long as the parameters are set very accurately. Unfortunately it is extremely hard to do this automatically. The inherent difficulty and complexity in designing formant rules by hand has led to this technique largely being abandoned for engineering purposes. In general, formant synthesis produces intelligible, often “clean” sounding, but far from natural. The reasons for this are: (i) the “too simplistic” source model, (ii) the “too simplistic” target and transition model, which misses many of the subtleties really involved in the dynamics of speech. While the shapes of the formant trajectories are measured from a spectrogram, the underlying process is one of motor control and muscle movement of the articulators (Taylor, 2009, p. 410). Classical Linear Prediction adopts the “all-pole vocal tract model”, which is similar to formant synthesis with respect to the source and vowels in terms of production. It differs in that all sounds are generated by an all-pole filter, whereas parallel filters are common in formant synthesis. Its main strength is that the vocal tract parameters can be determined automatically from speech. Despite its ability to faithfully mimic the target and transition patterns of natural speech, standard LP synthesis has a significant unnatural quality to it, often impressionistically described as “buzzy” or “metallic” sounding. While the vocal tract model parameters can be measured directly from real speech, an explicit impulse/noise model can still be used for the source. The buzzy nature of the speech may be caused by an “overly simplistic” sound source (Taylor, 2009, p. 411).
The main limitations of those techniques concern “not so much the generation of speech from the parametric representation, but rather the generation of these parameters from the input specification which is created by the text analysis process. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human derived rules, no matter how “expert” the rule designer” (Taylor, 2009, p. 412). Furthermore, acquiring data is fundamentally difficult and improving naturalness often necessitates a considerable increase in the complexity of the synthesizer. The classical linear prediction technique can be considered as “a partial solution to the complexities of specification to parameter mapping”, where the issue of generating of the vocal tract param-eters explicitly is bypassed by data measurement. The source parameters however, are still “specified by an explicit model, which was identified as the main source of the unnaturalness” (Taylor, 2009, p. 412).
A new type of glottal flow model, namely a Causal-Anticausal Linear filter Model (CALM), was proposed in the work of Doval et al. (2003). The main idea was to establish a link be-tween two approaches of voice source modeling, namely the spectral modeling approach and the time-domain modeling approach, that seemed incompatible. Both approaches could be envisaged in a unified framework, where time-domain models can be considered, or at least approximated by a mixed CALM. The “source/filter” model can be considered as an “exci-tation/filter” model. The non-linear part of the source model is associated to the excitation (i.e. quasi-periodic impulses), and the mixed causal-anticausal linear part of the model is associated to the filter component, without lack of rigor.
Concatenative synthesizer
This type of synthesizer is based on the idea of concatenating pieces of pre-recorded human speech in order to construct a utterance. This approach can be viewed as an extension of the classical LP technique, with a noticeable increase in quality, largely arising from the abandon-ment of the over simplistic impulse/noise source model. The difference of this idea from the classical linear prediction is that the source waveform is generated using templates/samples (i.e. instances of speech units). The input to the source however is “still controlled by an explicit model”, e.g. “an explicit F0 generation model of the type that generates an F0 value every 10ms” (Taylor, 2009, p. 412).
During database creation, each recorded utterance is segmented into individual phones, di-phones, half-syllables, syllables, morphemes, words, phrases or sentences. Different speech units considerably affect the TTS systems: a system that stores phones or di-phones provides the largest output range, but may lack clarity. For specific (limited) domains, the storage of entire words, phrases or sentences allows for high-quality output. However, di-phones are the most popular type of speech units, a di-phone system is hence a typical concatenative synthesis system.
The synthesis specification is in the form of a list of items, each with a verbal specification, one or more pitch values, and a duration. The prosodic content is generated by explicit algorithms, while signal processing techniques are used to modify the pitch and timing of the di-phones to match that of the specification. Pitch Synchronous OverLap and Add (PSOLA), a traditional method for synthesis, operates in the time domain. It separates the original speech into “frames pitch-synchronously” and performs modification by overlapping and adding these frames onto a new set of epochs, created to match the synthesis specification. Other techniques developed to modify the pitch and timing can be found in the work of Taylor (2009).
While this is successful to a certain extent, it is not a perfect solution. It can be said that we can “never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database is very uneven. Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data” (Taylor, 2009, p. 435). One other obvious issue is how to successfully join sections of a waveform, such that the joins cannot be heard hence the final speech sounds smooth, continuous and not obviously concatenated. The quality of these techniques is considerably higher than classical, impulse excited linear prediction. All these have roughly similar quality, meaning that the choice of which technique to use is mostly made of other criteria, such as speed and storage.
Unit selection and statistical parametric synthesis
Based on two basic approaches of speech synthesis, many improvements have been proposed for a high-quality TTS system. Statistical parameter speech synthesis along with the unit selection techniques are termed two prominent state-of-the-art techniques and hence widely discussed by a number of researchers with different judgments. This section describes and makes a comparison of those techniques.
From concatenation to unit-selection synthesis
In a concatenative TTS system, the pitch and timing of the original waveforms are modified by a signal processing technique to match the pitch and timing of the specification. Taylor (2009, p. 474) made two assumptions for a di-phone system: (i) “within one type of di-phone, all variations are accountable by pitch and timing differences” and (ii) “the signal processing algorithms are capable of performing all necessary pitch and timing modifications without incurring any unnaturalness”. It appears that these assumptions are “overly strong, and are limiting factors on the quality of the synthesis. While work still continues on developing signal processing algorithms, even an algorithm which changed the pitch and timing perfectly would still not address the problems that arise from first assumption. The problem here is that it is simply not true that all the variation within a di-phone is accountable by pitch and timing differences”.
The observations about the weakness of concatenative synthesis lead to the development of “a range of techniques collectively known as unit-selection. These use a richer variety of speech, with the aim of capturing more natural variation and relying less on signal processing”. The idea is that for each basic linguistic type, there are a number of units, which “vary in terms of prosody and other characteristics” (Taylor, 2009, p. 475).
In the unit-selection approach, new naturally sounding utterances can be synthesized by selecting appropriate sub-word units from a database of natural speech (Zen et al., 2009), according to how well a chosen unit matches a specification/a target unit (i.e. target cost) and how well two chosen units join together (i.e. concatenation cost). During synthesis, an algorithm selects one unit from the possible choices, in an attempt to find the best overall sequence of units that matches the specification (Taylor, 2009). The specification and the units are entirely described by a feature set including both linguistic features and speech features. A Viterbi style search is performed to find the sequence of units with the lowest total cost, which is calculated from the feature set.
According to the review of Zen et al. (2009), there seem to be two basic techniques in unit-selection synthesis, even though they are theoretically not very different: (i) the selection model (Hunt and Black, 1996), illustrated in Figure 1.2a (ii) the clustering method that allows the target cost to effectively be pre-calculated (Donovan et al., 1998), illustrated in Figure 1.2b. The difference is that, in the second approach, units of the same type are clustered into a decision tree that asks questions about features available at the time of synthesis (e.g., phonetic and prosodic contexts).
From vocoding to statistical parametric synthesis
As mentioned earlier, the main limitation of source/filter synthesizers is generating speech parameters from the input specification that was created by text analysis. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human derived rules, no matter how “expert” the rule designer is (Taylor, 2009). It is hence necessary a “complex model”, i.e. trainable rules from speech itself, for that purpose.
The solution can be found partly from the idea of vocoding, in which a speech signal is converted into a (usually more compact) representation so that it can be transmitted. In speech synthesis, the parameterized speech is stored instead of transmitted. Those speech pa-rameters are then proceeded to generate the corresponding speech waveform. As a result, the statistical parametric synthesis is based on the idea of vocoding for extracting and generating speech parameters. But the most important is that it provides statistical, machine learning techniques to automatically train the specification-to-parameter mapping from data, thus bypassing the problems associated with hand-written rules. Extracted speech parameters are aligned together with contextual features/features to build “trained models”.
In a typical statistical parametric speech synthesis system, parametric representations of speech including spectral and excitation parameters (i.e. vocoder parameters, which are used as inputs of the vocoder) are extracted from a speech database and then modeled by a set of generative models. The Maximum Likelihood (ML) criterion is usually used to estimate the model parameters. Speech parameters are then generated for a given word sequence to be synthesized from the set of estimated models to maximize their output probabilities. Finally, a speech waveform is reconstructed from the parametric representations of speech (Zen et al., 2009).
Although any generative model can be used, HMMs have been particularly well known. In HMM-based speech synthesis 7 (HTS) (Yoshimura et al., 1999), the speech parameters of a speech unit such as the spectrum and excitation parameters (e.g. fundamental frequency – F0) are statistically modeled and generated by context dependent HMMs. Training and synthesis are two main processes in the core architecture of a typical HMM-based speech synthesis system, as illustrated in Figure 1.3 (Yoshimura, 2002).
In the training process, the ML estimation is performed using the Expectation Maximiza-tion (EM) algorithm, which is very similar to that for speech recognition. The main difference is that both spectrum (e.g., mel-cepstral coefficients and their dynamic features) and exci-tation (e.g., log F0 and its dynamic features) parameters are extracted from a database of natural speech modeled by a set of multi-stream context-dependent HMMs. Another difference is that linguistic and prosodic contexts are taken into account in addition to phonetic ones (called contextual features). Each HMM also has its state-duration distribution to model the temporal structure of speech. Choices for state-duration distributions are the Gaussian distribution and the Gamma distribution. They are estimated from statistical variables ob-tained at the last iteration of the forward-backward algorithm.
Table of contents :
1 Vietnamese Text-To-Speech: Current state and Issues
1.1 Introduction
1.2 Text-To-Speech (TTS)
1.2.1 Applications of speech synthesis
1.2.2 Basic architecture of TTS
1.2.3 Source/filter synthesizer
1.2.4 Concatenative synthesizer
1.3 Unit selection and statistical parametric synthesis
1.3.1 From concatenation to unit-selection synthesis
1.3.2 From vocoding to statistical parametric synthesis
1.3.3 Pros and cons
1.4 Vietnamese language
1.5 Current state of Vietnamese TTS
1.5.1 Unit selection Vietnamese TTS
1.5.2 HMM-based Vietnamese TTS
1.6 Main issues on Vietnamese TTS
1.6.1 Building phone and feature sets
1.6.2 Corpus availability and design
1.6.3 Building a complete TTS system
1.6.4 Prosodic phrasing modeling
1.6.5 Perceptual evaluations with respect to lexical tones
1.7 Proposition and structure of dissertation
2 Hanoi Vietnamese phonetics and phonology: Tonophone approach
2.1 Introduction
2.2 Vietnamese syllable structure
2.2.1 Syllable structure
2.2.2 Syllable types
2.3 Vietnamese phonological system
2.3.1 Initial consonants
2.3.2 Final consonants
2.3.3 Medials or Pre-tonal sounds
2.3.4 Vowels and diphthongs
2.4 Vietnamese lexical tones
2.4.1 Tone system
2.4.2 Phonetics and phonology of tone
2.4.3 Tonal coarticulation
2.5 Grapheme-to-phoneme rules
2.5.1 X-SAMPA representation
2.5.2 Rules for consonants
2.5.3 Rules for vowels/diphthongs
2.6 Tonophone set
2.6.1 Tonophone
2.6.2 Tonophone set
2.6.3 Acoustic-phonetic tonophone set
2.7 PRO-SYLDIC, a pronounceable syllable dictionary
2.7.1 Syllable-orthographic rules
2.7.2 Pronounceable rhymes
2.7.3 PRO-SYLDIC
2.8 Conclusion
3 Corpus design, recording and pre-processing
3.1 Introduction
3.2 Raw text
3.2.1 Rich and balanced corpus
3.2.2 Raw text from different sources
3.3 Text pre-processing
3.3.1 Main tasks
3.3.2 Sentence segmentation
3.3.3 Tokenization into syllables and NSWs
3.3.4 Text cleaning
3.3.5 Text normalization
3.3.6 Text transcription
3.4 Phonemic distribution
3.4.1 Di-tonophone
3.4.2 Theoretical speech unit sets
3.4.3 Real speech unit sets
3.4.4 Distribution of speech units
3.5 Corpus design
3.5.1 Design process
3.5.2 The constraint of size
3.5.3 Full coverage of syllables and di-tonophones
3.5.4 VDTS corpus
3.6 Corpus recording
3.6.1 Recording environment
3.6.2 Quality control
3.7 Corpus preprocessing
3.7.1 Normalizing margin pauses
3.7.2 Automatic labeling
3.7.3 The VDTS speech corpus
3.8 Conclusion
4 Prosodic phrasing modeling
4.1 Introduction
4.2 Analysis corpora and Performance evaluation
4.2.1 Analysis corpora
4.2.2 Precision, Recall and F-score
4.2.3 Syntactic parsing evaluation
4.2.4 Pause prediction evaluation
4.3 Vietnamese syntactic parsing
4.3.1 Syntax theory
4.3.2 Vietnamese syntax
4.3.3 Syntactic parsing techniques
4.3.4 Adoption of parsing model
4.3.5 VTParser, a Vietnamese syntactic parser for TTS
4.4 Preliminary proposal on syntactic rules and breaks
4.4.1 Proposal process
4.4.2 Proposal of syntactic rules
4.4.3 Rule application and analysis
4.4.4 Evaluation of pause detection
4.5 Simple prosodic phrasing model using syntactic blocks
4.5.1 Duration patterns of breath groups
4.5.2 Duration pattern of syllable ancestors
4.5.3 Proposal of syntactic blocks
4.5.4 Optimization of syntactic block size
4.5.5 Simple model for final lengthening and pause prediction
4.6 Single-syllable-block-grouping model for final lengthening
4.6.1 Issue with single syllable blocks
4.6.2 Combination of single syllable blocks
4.7 Syntactic-block+link+POS model for pause prediction
4.7.1 Proposal of syntactic link
4.7.2 Rule-based model
4.7.3 Predictive model with J48
4.8 Conclusion
5 VTED, a Vietnamese HMM-based TTS system
5.1 Introduction
5.2 Typical HMM-based speech synthesis
5.2.1 Hidden Markov Model
5.2.2 Speech parameter modeling
5.2.3 Contextual features
5.2.4 Speech parameter generation
5.2.5 Waveform reconstruction with vocoder
5.3 Proposed architecture
5.3.1 Natural Language Processing (NLP) part
5.3.2 Training part
5.3.3 Synthesis part
5.4 Vietnamese contextual features
5.4.1 Basic Vietnamese training feature set
5.4.2 ToBI-related features
5.4.3 Prosodic phrasing features
5.5 Development platform and configurations
5.5.1 Mary TTS, a multilingual platform for TTS
5.5.2 Mary TTS workflow of adding a new language
5.5.3 HMM-based voice training for VTED
5.6 Vietnamese NLP for TTS
5.6.1 Word segmentation
5.6.2 Text normalization (vted-normalizer)
5.6.3 Grapheme-to-phoneme conversion (vted-g2p)
5.6.4 Part-of-speech (POS) tagger
5.6.5 Prosody modeling
5.6.6 Feature Processing
5.7 VTED training voices
5.8 Conclusion
6 Perceptual evaluations
6.1 Introduction
6.2 Evaluations of ToBI features
6.2.1 Subjective evaluation
6.2.2 Objective evaluation
6.3 Evaluations of general naturalness
6.3.1 Initial test
6.3.2 Final test
6.3.3 Discussion on the two tests
6.4 Evaluations of general intelligibility
6.4.1 Measurement
6.4.2 Preliminary test
6.4.3 Final test with Latin square
6.5 Evaluations of tone intelligibility
6.5.1 Stimuli and paradigm
6.5.2 Initial test
6.5.3 Final test
6.5.4 Confusion in tone intelligibility
6.6 Evaluations of prosodic phrasing model
6.6.1 Evaluations of model using syntactic rules
6.6.2 Evaluations of model using syntactic blocks
6.7 Conclusion
7 Conclusions and perspectives
7.1 Contributions and conclusions
7.1.1 Adopting technique and performing literature reviews
7.1.2 Proposing a new speech unit – tonophone
7.1.3 Designing and building a new corpus
7.1.4 Proposing a prosodic phrasing model
7.1.5 Designing and constructing VTED
7.1.6 Evaluating the TTS system
7.2 Perspectives
7.2.1 Improvement of synthetic voice quality
7.2.2 TTS for other Vietnamese dialects
7.2.3 Expressive speech synthesis
7.2.4 Voice reader
7.2.5 Reading machine