GMM emission probability hardware accelerator

Get Complete Project Material File(s) Now! »

Time-synchronous search

In time-synchronous search, the decoder iterates through all the tokens in the search space at each time step (speech frame). The tokens propagate through the search space according to the Viterbi algorithm. All the tokens finish their propagation before entering into the next time step. It is a breadth-first search. In practice, it is infeasible if there is a token in every HMM state. The number of tokens that need to be propagated would be too large for the decoder. Therefore, pruning is essential for practical applications with the cost of introducing search errors. One of the common pruning techniques is called beam pruning [44, 43].

Overview of Juicer

Juicer is a software package consisting of two main parts [41, 42]. The first part is for WFST generation. Specifically, the software tools provided by the Juicer package generate three WFSTs (C, L and G) from various knowledge sources, which include HMM definition (for generating C), pronunciation dictionary (for generating L) and language model (for generating G). Based on Equation (3.5), the three constituent WFSTs are composed into one fully-integrated WFST, which is further optimized by determinization, weightpushing and minimization. For WFST composition and optimization, Juicer relies on third-party tools such as AT&T FSM Library [37] and MIT FST Toolkit [24]. The second part of the software package is a WFST-based time-synchronous speech recognizer. At the initialization stage, the fully-integrated WFST search space and the HMM definition are loaded into the recognizer.

Fixed-point speech recognition system

For desktop applications, speech recognition algorithm often uses floating-point arithmetic. However, in many embedded systems, a hardware floating-point processing unit is absent. Hence, it is necessary to consider a fixed-point implementation of the speech recognizer and evaluate its performance in terms of recognition accuracy and decoding speed. A typical speech recognition system involves processing numerical data with different dynamic ranges. In order to minimize the quantization error, it is essential to assign different precision formats to different data types. In this chapter, we propose a framework for converting data formats from floating-point to fixed-point. The speech recognition algorithm is partitioned into three sub-tasks, namely, feature extraction, emission probability calculation and Viterbi search.

Emission probability with fixed-point formats

Table 4.2 shows the various data types and their fixed-point formats during the calculation of the emission probabilities. As discussed in Section 4.1.2, the MFCC features are in 32-bit Q11.20 formats. In order to reduce the number of bits in subsequent calculations, each feature coefficient is truncated to 16 bits. Since the dynamic range of each coefficient is different, a separate quantizer is built for each coefficient [9]. As shown in (4.10), the Gaussian mixture mean is subtracted from the MFCC features. Therefore, they have the same fixed-point format. The dynamic range of each mean component, denoted by Rµ(d) , is expressed as follows.

READ  Relevant Chinese Grammar

Target platform – Altera Nios II processor

The target platform is based on Altera Nios II processor, which is a softcore embedded processor [3]. In contrary to a hardcore processor, a softcore processor allows designers to configure the processor core to suit their application needs. A softcore processor is often described in hardware description language (HDL), which can be synthesized on an FPGA device. In addition, the softcore processor-based approach enables designers to develop the entire system by integrating the processor core with various types of peripherals on a programmable chip (system-on-a-programmable-chip or SOPC). Besides standard peripherals, custom peripherals can also be built. Therefore, a softcore processor-based system offers a flexible platform for hardware-software co-design.

Contents :

  • 1 Introduction
    • 1.1 Objectives of the thesis
    • 1.2 Current embedded ASR system architectures
    • 1.3 Contributions of the thesis
    • 1.4 Outline of the thesis
  • 2 Fundamentals of speech recognition
    • 2.1 The decoding problem
    • 2.2 Feature extraction
    • 2.3 Acoustic modelling
      • 2.3.1 Hidden Markov Model (HMM)
      • 2.3.2 Evaluation of HMM
      • 2.3.3 Decoding of HMM
      • 2.3.4 Training of HMM
      • 2.3.5 HMM/GMM system
      • 2.3.6 Hybrid HMM/ANN system
    • 2.4 Language modelling
    • 2.5 Decoding
    • 2.6 Search space representation
      • 2.6.1 Re-entrant lexical tree
      • 2.6.2 Weighted finite state transducer (WFST)
    • 2.7 Search algorithm
      • 2.7.1 Time-synchronous search
      • 2.7.2 Time-asynchronous search
    • 2.8 Performance metrics
    • 2.9 Summary
  • 3 WFST-based speech recognizer
    • 3.1 WFST theory
    • 3.2 Static WFST composition
    • 3.3 Overview of Juicer
    • 3.4 Summary
  • 4 Fixed-point speech recognition system
    • 4.1 Feature extraction
      • 4.1.1 Algorithm
      • 4.1.2 Feature extraction with fixed-point formats
    • 4.2 Emission probability calculation
      • 4.2.1 Algorithm
      • 4.2.2 Emission probability with fixed-point formats
    • 4.3 Viterbi search
      • 4.3.1 Algorithm
      • 4.3.2 Viterbi search with fixed-point formats
    • 4.4 Recognition accuracy
    • 4.5 Summary
  • 5 Pure software-based system
    • 5.1 Target platform – Altera Nios II processor
    • 5.2 System architecture
    • 5.3 Timing profile
    • 5.4 Summary
  • 6 Hardware-software co-processing system
    • 6.1 System architecture
    • 6.2 GMM emission probability hardware accelerator
      • 6.2.1 Datapath
      • 6.2.2 Timing profile
      • 6.2.3 Resource usage
    • 6.3 Adaptive pruning
      • 6.3.1 Algorithm
      • 6.3.2 Timing profile
    • 6.4 Performance evaluation
    • 6.5 Summary
  • 7 Dynamic composition of WFST
    • 7.1 Motivation
    • 7.2 Static WFST composition in ASR
    • 7.3 Current Approaches to Dynamic WFST Composition
    • 7.4 Proposed Approach to Dynamic WFST Composition
      • 7.4.1 Finding the Anticipated Output Labels
      • 7.4.2 The Dynamic Composition Algorithm
    • 7.5 Experimental Results
    • 7.6 Summary
  • 8 Conclusions
    • 8.1 Hardware-software co-processing ASR system
    • 8.2 Dynamic composition of WFST
    • 8.3 Future work
  • A Mel filter bank
  • B DCT and Liftering 

GET THE COMPLETE PROJECT
Embedded Speech Recognition Systems

Related Posts