VIDEO FINGERPRINTING AT WORK: TRACKART

Get Complete Project Material File(s) Now! »

TrackART: Synopsis

The present thesis advances a novel video fingerprinting methodology called TrackART able to identify visual content subjected to different types of user induced, mundane or malicious distortions.
Concisely, the challenges the TrackART video fingerprinting method takes are threefold:

  • Uniqueness: the TrackART method aims at proposing a video fingerprint which represents the video content with mathematical accuracy and rigor.
  • Robustness: the TrackART method aims at providing a general mathematical decision rule for the robustness to distortions and at addressing the challenging use case of live camcorder recording (which has not been yet addressed in the state of the art).
  • Scalability: the TrackART method aims at being operative even in large scale databases.

The functioning principle of the TrackART method consists in two phases: the offline phase and the online phase, as illustrated in Fig.II.1.
As the word “offline” suggests, the offline phase holds the computation executed before the run-time phase. Its purpose is to process the reference video collection in order to enable the retrieval (if existing) of the original version of the query from the reference database, i.e. to enable the localization and the fingerprint modules. The offline phase consists of two modules: pre-processing and offline localization.
The pre-processing stage prepares the reference video sequences for the further processing by performing some parameter setting and common image processing operations as detailed in Section II.2.1).
The offline localization stage (detailed in Section II.2.2) consists in mapping the reference video content to a representation space which allows the matching of video content and enables the localization module.
In the online, a query video sequence is proposed for identification by a user or by another system. By passing the query through the modules of the run-time phase, its original version (if existing in the reference data set) has to be identified. The online phase consists in four modules whose role is briefly given here and further detailed in the rest of this chapter: pre-processing, localization, fingerprint and reduced fingerprint.
The pre-processing module (detailed in Section II.3.1) sets the parameters of the query video sequence to predefined values in order to avoid the variations induced by distortions.
The online localization module (detailed in Section II.3.2) aims at significantly reducing the multitude of reference sequences which are candidates for matching the query (i.e. all the video sequences) and to identify just a few nominees for further testing. Moreover, in the localization module, for each nominated video sequence, a potential starting position (i.e. the frame number) of the query sequence is obtained.
The Fingerprint module computes and matches the fingerprints of the query and reference video sequences. It consists in three blocks: fingerprint computation (detailed in Section II.3.3.2), fingerprint matching (detailed in Section II.3.3.3) and synchronization (detailed in Section II.3.3.4). The synchronization module is designed to ensure the correct content correspondence between the query and reference video sequences which can be altered by video format distortions.
The reduced fingerprint module (detailed in Section II.3.4) consists in two blocks: reduced fingerprint computation (detailed in Section II.3.4.1) and reduced fingerprint matching (as detailed in Section II.3.4.2) and aims at reducing the amount of information needed for identifying a query video sequence.

Offline phase

The offline phase enables the localization of a query sequence within a reference sequence. Its purpose is to process the reference video collection and to map the visual content to a representation space. The representation of the content in a new space enables the comparison of the reference and query sequences with respect to certain similarity measures and under different types of distortions. The offline phase consists of two modules: pre-processing (Section II.2.1) and offline localization (Section II.2.2).

Pre-processing

The pre-processing stage aims at achieving a common formatting for the reference video sequences in order to reduce the influence of video format distortions (detailed in Section I.3.4.1) as follows.
Due to the multitude of different existing video formats and to the manipulations that video sequences subsist through their consumption chain (i.e. encodings, transcodings), the video frame-rate can have a large variation. In order to enable the TrackART method to cope with this fact, the reference video sequences are all brought to the same frame-rate value. In the current implementation, the frame rate was chosen to be 25 fps due to its frequent use in video formats, but another value can be equally chosen. The changes of the video parameters are done with the ffmpeg library which contains dedicated functions for controlling the video parameters.
The black keyframes were discarded and the letterboxing (if existing) was removed in order to consider only the valid visual information.
The length of a typical film is between 150-250 000 frames (i.e. between 1 hour and a half and two hours and a half at a 25 frames per seconds); in order to reduce the complexity, keyframes are extracted uniformly, one frame per second.
Note: Shot boundaries keyframes were also considered in the current study, but as they are not repeatable under video distortions they yield poor results (i.e. due to various distortions, the shot boundaries of an original video and its distorted version will not be the same).
words model, advanced by Sivic and Zisserman in [SIV 03]. Similar to terms in a text document, an image has local interest points or keypoints defined as patches (small regions) that contain rich local information of the image. Inspired by text retrieval techniques, Sivic and Zisserman developed an algorithm for image search based on representing the images as bags (i.e. collections) of visual words (i.e. visual descriptors)
The matching of images is assured by comparing the associated bag of words and by testing their spatial consistency.
Being a search algorithm, the Bag of Words framework has two phases: the offline phase and the run-time phase as illustrated in Fig.II.3.
The scope of the offline phase is to build a visual word representation space based on the reference image collection and to represent each reference image as a collection of visual words from the representation space.
The representation space is called a visual vocabulary. It is built by detecting the local features (detailed in Section II.2.2.1) in all the reference images, by describing these local features with a formalized descriptor (detailed Section II.2.2.2) and by clustering the descriptors into visual words (detailed Section II.2.2.3). Each image in the reference is then expressed as a collection of weighted visual words from the vocabulary (detailed Section II.2.2.4). In order to ensure the retrieval of images in the run-time phase, an inverted file structure stores for every visual word its occurrences in the reference images (detailed Section II.2.2.5).
Within the framework of the TrackART method, the run-time phase of the bag of words framework takes place in the localization module of the TrackART fingerprinting system and is further detailed Section II.3.2.
A local feature is an image pattern which differs from its immediate neighborhood, [TUY 08]. It is usually associated with a change of an image property (e.g. intensity, color and texture) or several properties simultaneously; a few examples are illustrated in Fig II.4. To identify local features in images, the underlying intensity patterns in a local neighborhood of pixels needs to be analyzed by using a local feature detector. Local features can be interest points, regions (blobs) or edge segments.
A set of local features can be used as a robust image representation that allows recognizing objects or scenes without the need for segmentation, [TUY 08].Consequently, local features have gained a lot of momentum in computer vision in the last fifteen years because they are powerful tools in applications like image retrieval from large databases [SCH 97], object retrieval in video [SIV 03], [SIV 04a], visual data mining [SIV 04b], texture recognition [LAZ 03a], [LAZ 03b], shot location [SCH 03], robot localization [SE 02], recognition of object categories [DOR 03]. The relevance of local features has also been demonstrated in the context of object recognition by the human visual system [BIE 98]. Their experiments shown that removing the corners from images impedes human recognition, while removing most of the straight edge does not.
Good local features prove a few properties which make them useful in the above applications, [TUY 08]:

  • – repeatability: the propriety of local region of being re-detected in other image under different camera viewpoints, illumination conditions and noise)
  • distinctiveness: the intensity patterns underlying the detected features should show a lot of variation, such that features can be distinguished and matched)
  • – locality: the features should be local, so as to reduce the probability of occlusion and to allow simple model approximations of the geometric and photometric deformations between two images taken under different viewing conditions
  • quantity: the number of detected features should be sufficiently large, such that a reasonable number of features are detected even on small objects, ideally, the number of detected features should be adaptable over a large range by a simple and intuitive threshold
  • accuracy: the detected features should be accurately localized, both in image location, as with respect to scale and possibly shape
  • efficiency: the detection of features in a new image should allow for time-critical applications.
READ  2.1. Inverter Model Derivation

Table of contents :

ABSTRACT
PART I: VIDEO FINGERPRINTING OVERVIEW
I.1 Introduction
I.2 Definition
I.3 Theoretical properties and requirements
I.3.1 Uniqueness
I.3.2 Robustness
I.3.3 Database search efficiency
I.3.4 Evaluation framework
I.3.5 Video fingerprinting requirements
I.4 Applicative and industrial panorama
I.4.1 Video identification and retrieval
I.4.2 Authentication of multimedia content
I.4.3 Copyright infringement prevention
I.4.4 Digital watermarking
I.4.5 Broadcast monitoring
I.4.6 Business analytics
I.5 State of the art
I.5.1 Industrial solutions
I.5.2 Academic state of the art
I.6 Conclusion
References
PART II: VIDEO FINGERPRINTING AT WORK: TRACKART
II.1 TrackART: Synopsis
II.2 Offline phase
II.2.1 Pre-processing
II.2.2 Offline localization
II.3 Online phase
II.3.1 Pre-processing
II.3.2 Online localization
II.3.3 Fingerprint
II.3.4 Reduced fingerprint
II.4 TrackART possible configurations
II.5 Conclusion
References
PART III: TRACKART – EXPERIMENTAL RESULTS
III.1 Context
III.2 Testing corpus
III.3 Video retrieval use-case
III.3.1 TrackART Full Fingerprint evaluation
III.3.2 TrackART Reduced Fingerprint evaluation
III.4 Live camcorder recording use-case
III.4.1 TrackART Full Fingerprint evaluation
III.4.2 TrackART Reduced Fingerprint evaluation
III.5 Computational cost
III.6 Video fingerprint demonstrator
III.7 Conclusion
Reference
PART IV: FINAL CONCLUSIONS AND PERSPECTIVES
Conclusions
Perspectives
Appendix
A.1. Online localization illustrations
A.2. Publications
A.3. Selection of publications

GET THE COMPLETE PROJECT

Related Posts