A real-time applicable hand gesture recognition system

Get Complete Project Material File(s) Now! »

On the nature of hands

A number of theories have been developed to explain the evolutionary dierence of the human and the animal world. Putting the emphasis on one or only some of them would probably not do them justice, also considering the fact that many coherences remain unclear, one of them being, whether the development of the hand has triggered the increase in brain size or vice versa. Trying to link the ideas on the origin of human intelligence, Frank Wilson [128] introduces the hand-thought-language nexus, basing it on the pillars of the two problem-solving strategies of tool manufacturing and communication via language. While tools are also used by a number of animals it remains undisputed that the variety and complexity of those created by man is unreached, besides the fact that we use them in order to create even more (complex) tools. All this would not have been possible if it wasn’t for the complexity of the hand’s bone arrangement along with its 27 degrees of freedom (DoF), this being one of the reasons, which makes the study of the underlying topic so interesting. So where exactly does this complexity come from? Figure 1.6 displays the bone structure of the hand (left) and the resulting DoF (right). Being able to move some parts of the ngers in one, others however in 2 dimensions along with the 6 DoF for the wrist results in the overall 27 dimensions. Creating a tool to be able to make machines interpret and understand the complexity of the hand is a challenging and unsolved task. There is a lot of research directed toward this topic as the applications, which will be laid out in detail later on in this work, are countless. One could imagine replicas of human arms to be helping disabled people, robots mimicking human behaviour to achieve complex tasks or systems controllable by the varieties of hand gestures as already happening in 2D on smartphones and tablets, or envisioned in movies and partly realized by virtual reality environments.

Hand gesture recognition approaches

Several surveys have been conducted and give a concise overview over the vast body of work done so far in the eld of hand gesture recognition [17], [20], [65], [83], [92], [117], [131], [64], with dierent foci applied, e.g. to underlying techniques, HMIrelated questions, algorithmic approaches or current challenges in general. Regarding the algorithmic approach there are two dierent kinds of techniques, one being the model-based approach in which a hand model is created to resemble the 27 DoF of the hand and the other being the data-driven or appearance-based approach aiming at interpreting e.g. the colors or the silhouette of a certain region in the data. Naturally, both methods can be combined to improve recognition results – with one obvious drawback being the impeded computation time – however the underlying thesis follows the purely data-driven approach as one of the main goals is to demonstrate that hand gesture recognition can be performed in real-time without the need to formalize a complex model.
With respect to the interpretation task, one usually follows the three-step approach of detecting a region of interest, tracking it and implementing a recognition algorithm at the end to connote some sort of meaning (hand pose/gesture) to the data. Each of these individual steps can be achieved by numerous techniques (cf. Rautaray [92] for an overview), however the underlying thesis shows that more than satisfying results can be achieved by establishing a large database and using one classier with sophisticated fusion techniques, coupled with a simple denition of dynamic hand gestures, making the need for Hidden Markov Models (HMM) or Finite-State Machines (FSM), and even the need for computationally expensive tracking algorithms, obsolete.

Human-Machine Interaction

One denition of HMI refers to the research and development of interfaces focusing on the behaviour of humans during an interaction process as well as the design of computer technology. The term usability1 frequently co-occurs when speaking of HMI and it denotes, on a general level, a measurement of how easy something can be used or be learned to use. HMI aims at improving the usability of interfaces to be developed and therefore, within the context of this thesis, the goal is to improve the usability of an interface controlled by human hand gestures. More specically, the aim is provide an interface which is intuitive2 to use, provides a low hurdle of entry and optimally lowers the cognitive load for the subject. Computers have increased in computation power, and with simultaneous decrease in cost and size, new means of interaction have been developed. As stated by Wigdor et al. [127] these evolution steps have taken place more or less discontinuously, rather in
phases. The outcomes of these phases still persist today as e.g. the command typing via terminal, followed by the graphical user interface (GUI), are probably still the most widely used interfaces in most households and oces around the world. During these phases naturally some interfaces proved to be more predominant over others, which is why they prevailed, especially for certain tasks. However, as the authors further claim, in many situations there exist hybrids which can be seen by a more in-depth analysis. For instance, the console-like element is available in many dierent OS as a means to quickly search for items stored on some hard drive because this simply has shown to be the easiest and most ecient way to solve the given task (opposed to cumbersome inspecting the contents of each folder) for trained users. Another example of console-like elements is the possibility to enter complex formulas.

READ The 3D Active Shape Model Construction Using 3D Morphable Model to Generate Data

Table of contents :

I Foundations
1 Problem description
1.1 3D sensors
1.1.1 Depth by triangulation
1.1.2 Depth measurement by Kinect
1.1.3 ToF sensors
1.1.3.1 ToF sensors with lightpulse emission
1.1.3.2 ToF sensors with wavelength modulation
1.1.4 Conclusion
1.2 On the nature of hands
1.2.1 Gesture taxonomy
1.2.2 Hand gesture recognition approaches
1.3 Human-Machine Interaction
1.3.1 HMI for Advanced Driver Assistance Systems
1.4 Related work
1.4.1 Contact-based approaches
1.4.2 2D vision-based approaches
1.4.3 3D vision-based approaches
1.4.4 Hybrid approaches
1.4.5 In-vehicle gesture recognition systems
1.4.6 Conclusion
2 3D data algorithms
2.1 Introduction
2.2 Estimating surface normals in a point cloud
2.3 3D shape descriptors
2.3.1 Point Feature Histograms
2.3.2 Fast Point Feature Histograms
2.3.3 The ESF descriptor
2.3.4 The VFH descriptor
2.3.5 The randomized PFH descriptor
2.4 Conclusion
3 Machine Learning and multiclass classication
3.1 Object recognition
3.2 Introduction to Machine Learning
3.2.1 Support vector machines
3.2.2 Neural networks
3.2.3 Deep Learning
3.3 Introduction to multiclass classication
3.3.1 Multiclass classication with support vector machines
3.3.2 Multiclass classication with neural networks
II Own contribution
Introduction
4 Hand gesture database
4.1 Hand-forearm cropping with principal component analysis
5 Fusion techniques for multiclass classication with MLPs
5.1 An approach to multi-sensor data fusion
5.1.1 Descriptor parametrization
5.1.1.1 The ESF descriptor
5.1.1.2 The VFH descriptor
5.1.2 Neural network classication and fusion
5.1.3 Experiments with MLPs
5.1.4 Experiments with SVMs
5.1.5 Discussion
5.2 Extracting information from neuron activities
5.2.1 Contribution and novelty
5.2.2 Methods
5.2.2.1 Exploiting information from output neurons
5.2.2.2 Fusing features and neural activities
5.2.3 MLP structure and data set
5.2.4 Experiments – overview
5.2.5 Experiment 1 – training on output activities
5.2.6 Experiment 2 – fusing output activities with features
5.2.7 Experiment 3 – output neurons plus features with multiple MLPs
5.2.8 Experiment 4 – Generalization on unseen data
5.2.8.1 Neural network topology
5.2.8.2 Preparation of the data sets
5.2.8.3 Experiments and results
5.2.9 Experiment 5 – Temporal integration of information
5.2.9.1 Neural network topology and training parameters
5.2.9.2 Experiments and results
5.2.10 Experiment 5 – MNIST
5.2.11 Discussion
6 A real-time applicable hand gesture recognition system
6.1 A multi-sensor setup for in-vehicle infotainment control
6.1.1 System description
6.1.2 Neural network classication and fusion
6.1.3 Experiments
6.1.3.1 Baseline performance
6.1.3.2 Generalization to unknown persons
6.1.3.3 Boosting by thresholding condence
6.1.3.4 Improvement of generalization to unknown persons .
6.1.4 Conclusion
6.2 Dynamic hand gesture recognition with a single ToF camera
6.2.1 In-car infotainment application
6.2.2 Dynamic hand gestures
6.2.3 Experiments and results
6.2.4 Discussion
7 Building a gesture-controlled Natural User Interface for in-vehicle infotainment control
7.1 A multimodal freehand gesture recognition system for natural HMI .
7.1.1 System setup
7.1.2 Key question – what is intuitive?
7.1.3 User study
7.1.4 Results
7.1.5 Discussion
8 Discussion and Outlook
8.1 Discussion
8.2 Outlook
Bibliography