In this paper, we present machine learning algorithms and systems for similar video retrieval. Here, the query is itself a video. For the similarity measurement, exemplars, or representative frames in each video, are extracted by unsupervised learning. For this learning, we chose the order-aware competitive learning. After obtaining a set of exemplars for each video, the similarity is computed. Because the numbers and positions of the exemplars are different in each video, we use a similarity computing method called M-distance, which generalizes existing global and local alignment methods using followers to the exemplars. To represent each frame in the video, this paper emphasizes the Frame Signature of the ISO/IEC standard so that the total system, along with its graphical user interface, becomes practical. Experiments on the detection of inserted plagiaristic scenes showed excellent precision-recall curves, with precision values very close to 1. Thus, the proposed system can work as a plagiarism detector for videos. In addition, this method can be regarded as the structuring of unstructured data via numerical labeling by exemplars. Finally, further sophistication of this labeling is discussed.
The compilation of videos, or moving images, through the Internet is growing rapidly. This situation is due to the spread of smartphones and sensor cameras. The ease of the collection and upload of videos has created a contemporary unorganized data structure that hinders the retrieval of appropriate videos.
Videos on the web usually have associated text, such as a title, description, and category, that has been annotated by hand. Such text-based meta information describes the content of a whole video. This means that the temporal resolution of textual information is low, and it cannot represent well temporally local context. Thus, retrieving a time span in videos according to visual context is impractical for text-based video retrieval systems. To enhance the temporal resolution, we would need to create more textual labels for a video, which incurs additional cost. In addition, the retrieved results are still obtained based on only the textual labels of the videos, not the visual contents.
Therefore, in this paper, we present a set of learning algorithms and their systems for content-based video retrieval. The concept and variants of this type of retrieval are surveyed in [
To automatically generate numerical labels for videos, we perform the following steps.
Step 1: We extract the representative features for each frame of each video. (Note that every frame of a video is a still image.) Such features correspond to the labels.
Step 2: Using the frame-wise label, a set of representative frames, or exemplars, is determined by an inter-frame learning algorithm. The set of order-aware exemplars with their positions and followers becomes the final label for each video.
Step 3: We use matching algorithms to compare two videos using these labels. This method is able to compute the similarity of two videos through the positions of the exemplars and their number of the followers.
Step 4: For the results of a given query video, the retrieval ranking is presented according to similarity score.
The feature extraction for each frame of Step 1 can be implemented using different methods by a system designer. Our final selection in this system is the Frame Signature of ISO/IEC [
The organization of the rest of this paper is as follows. In Section 2, we explain the concept of content-based similar video retrieval. Section 3 presents feature extraction and conversion methods to generate video descriptors for each frame. In Section 4, we present a set of inter-frame learning algorithms for selecting order-aware exemplars. Section 5 is devoted to the presentation of the M-distance similarity computation. Section 6 shows the performance of the designed system, which is evaluated by 11-point interpolated precision-recall curves. The results demonstrate that the proposed system is applicable to detect the unlicensed insertion of video clips. In the concluding remarks of Section 7, we discuss the structuring of general data by more sophisticated descriptors.
Content-based video retrieval is the problem of retrieving videos from a database that are similar to an input video. The overall process of our system is illustrated in
Let a collection of videos be V , comprising elements of video V ∈ V . Each video is a time series of frame images V = { v t } t = 1 T of length T. Every frame image is indexed by t, from 1 to T. Here, we simply denote one of the videos in collection V as V, without an index to avoid complicated notation. Although we omit the index, lengths and contents of videos in V can vary.
The first step of the process is to label videos with numerical features. This step is shown in the upper box in
conducted for each frame image v t . After preprocessing, we extract representative descriptors e n ∈ X , ( n = 1, ⋯ , N ) to reduce the size of the video data. These descriptors are called exemplar frames, or simply exemplars. As mentioned in the previous section, we find exemplars and their followers using our machine learning algorithms, not by hand. We also calculate the number of followers E n for each exemplar e n . As a result, we obtain the feature E = { e n , E n } n = 1 N of a video V. Finally, we store E into a database K .
When we use a query video to search videos in our system, the system retrieves videos in database K using their features. Because the query is not included in the database generally, the system extracts the feature of the query video before retrieval. Let V q = { v q t } t = 1 T q be a query video of length T q . The system obtains the descriptors of query X q = { x q t } t = 1 T q and features E q = { e q n , E q n } n = 1 N q using the same processes used to build K , that is, using machine learning algorithms. We then obtain the ranking of the similarity by comparing E q with all E ∈ K . This step is shown in the lower box in
Each frame of a video is a still image whose complexity and size may differ from those of other videos. Therefore, we need a universal and concise expression, or a descriptor x t , for frame v t . In our system, we support two types of such descriptors: the Color Structure Descriptor (CSD) [
The CSD of MPEG-7 is a standard color histogram obtained using a small sliding window. In our system, we first quantize the color information of the image with { 12,8,8 } -levels in HSV (Hue-Saturation-Value) color space. Then, we use a window of size 8 × 8 pixels. The window fills 12 × 8 × 8 = 768 color bins to make the the histogram while sliding across the image pixel by pixel. A simple example of this method is illustrated in
The final histogram is normalized to unity. Therefore, each video frame v t generates a nonnegative real-numbered vector x t , and the summation of all its
elements is 1. This vector is 768 dimensions and resides on a 767-dimensional simplex.
The Video Signature Tools is an ISO/IEC standard for multimedia contents [
For the Frame Signature, we first resample each frame image v t to 32 × 32 = 1024 sub-blocks by dividing the width and height of the image into 32 parts. Each sub-block is assigned the average value of the luminance of the pixels in the block. The luminance is the Y component of YCbCr color, which is computed from RGB components [
Y = [ 0.299,0.587,0.114 ] [ Red , Green , Blue ] T (1)
Note that the range of an RGB component is [ 0,1 ] , and the value of the Y component is quantized to 256 levels. Thus, we obtain a monochrome video V ^ = { v ^ t } t = 1 T , where v ^ t contains the 1024 luminance values of the sub-blocks obtained by Equation (1).
Second, we generate a descriptor X from V ^ . The Frame Signature algorithm provides a 380-dimensional vector x t in which every element x t d takes a ternary value of { 0,1,2 } according to v ^ t . For the d-th element x t d in x t , the algorithm determines its value using the average luminance of a sub-region, which is a region composed by several sub-blocks. A sub-region is linked to a single dimension d of x t , and the composition of the sub-blocks in each sub-region is defined by the standard in [
・ For the first 32 dimensions d = 1 , ⋯ , 32 , x t d is computed using the corresponding y d as follows.
x t d = ( 2 , if y d − 128 > θ type 1 , if | y d − 128 | ≤ θ type 0 , if y d − 128 < − θ type (2)
Threshold θ type is determined by the y d values across some of the dimensions [
・ For the remaining parts d = 33 , ⋯ , 380 , the computation of x t d compares two non-overlapping sub-regions y d 1 and y d 2 using threshold θ type , as defined in [
x t d = ( 2 , if y d , 1 − y d , 2 > θ type 1 , if | y d , 1 − y d , 2 | ≤ θ type 0 , if y d , 1 − y d , 2 < − θ type (3)
Finally, we obtain each x t , which is a vector of { 0,1,2 } 380 . The above steps are applied to each frame to obtain the final descriptor X of video V [
In this section, we explain our machine learning methods that can find exemplars and their influence regions from frame descriptor sequence X. The influence region is the span of follower frames that reside on both sides of an exemplar. We obtain a set of exemplars automatically by machine learning. Five clustering algorihtms were considered for this system. They are described as follows.
This method is based on affinity propagation [
We consider two competitive learning methods. The first type is the harmonic competition method [
First, we divide V = { v t } t = 1 T into non-overlapping frame sets for the time axis using a fixed length b. Thus, we have a total of ⌈ T / b ⌉ blocks. In each block, a set of centroids is computed by the k-means method, i.e., the vector quantization. These centroids are pseudo exemplars because they are not video frames. Therefore, we find the frame that is the nearest to each pseudo exemplar. The set of such frames comprises the exemplars.
In our preliminary experiments, the TPKM method often generates large clusters that contain elements over a wide range of times. This case occurs when block length b is large. In such a case, it is desirable to split clusters containing elements that are distant in time. Except for this splitting mechanism, the rest of this method is the same as TPKM.
Our preliminary experiments proved that TPKM and TSKM are significantly faster than TBAP by two orders of magnitude. However, TBAP still has theoretical merit. That is, exemplars can be obtained without computing the centroids as pseudo exemplars. Therefore, we consider k-means methods that can identify exemplars directly. For this reason, we omit further details regarding TPKM and TSKM, however, they can be found in [
Next, we pay attention to an approximate k-means method called the pairwise nearest neighbor vector quantization (PNN-VQ, or simply PNN) [
Step 1: Set δ to the desired (non-negative) minimum distance between exemplars.
Step 2: Compute the centroid of all data points, say c.
Step 3: Find the two points that are the closest in distance. This is the nearest neighbor pair.
Step 4: If the distance is equal to or less than threshold δ, then go to Step 5; otherwise, go to Step 6.
Step 5: Remove the member of the pair that is farther from centroid c. Note that the original PNN removes both points but inserts their centroid.
Step 6: Output the remaining data points as exemplars.
Based on our PNN method, we propose the following learning algorithms to obtain a feature of a video data.
The TB-PNN method is almost same as the modified PNN above but uses a different method to find the closest pair. The modified PNN measures the distances of all data points to find the closest pair. However, TB-PNN only measures distances from a data point and its surrounding ones along the time axis. This mechanism reduces the range over which the distance between data points must be measured. Consequently, an exemplar will not possess data points distant in time. Thus, we can find exemplars while taking the order of time into account. The length of the bounding range is a design parameter, that must be given.
The TP-PNN algorithm is similar to TPKM but it utilizes our modified PNN instead of k-means. First, we divide V = { v t } t = 1 T into non-overlapping frame sets along the time axis by a fixed length b. Thus, there are ⌈ T / b ⌉ blocks. Then, for each block, exemplars are obtained using the modified PNN.
In this work, we chose TP-PNN for our final content-based video retrieval system because of the results of our preliminary experiments. We hence present more details of TP-PNN here.
Step 1: Divide the descriptor sequence X into ⌈ T / b ⌉ blocks.
Step 2: For each partition:
Use the modified PNN to obtain the exemplar(s). Although each exemplar can be any of { v t } t = 1 T , here, we denote it by e n . A descriptor that appears at the n-th item in the exemplars along the time axis in X is denoted by e n .
Meanwhile, we determine the number of data points following the exemplars. The followers removed at Step 5 of the modified PNN indicate the time range possessed by an exemplar. The method is as follows: if we leave e n and remove
Step 3: Output all the results acquired from all partitions in Step 2 as
The output
In actual cases, the total number of exemplars N varies because of threshold δ
of the modified PNN. The data size of a video is clearly reduced by TP-PNN. Therefore, TP-PNN works as a data compressor.
To find similar videos, we need a method to compare the exemplars. In addition, the comparison methods should reflect the number of followers of each exemplar. Therefore, we present a set of methods that extend the sequence alignment method. Sequence alignment is an algorithm that discovers the best matching pattern of two sequences and provides its degree of fitness as their similarity. We refer to the degree of fitness as the matching score, or simply the score. Our methods are based on the Levenshtein distance (L-distance) [
The computation of sequence alignment algorithms can be done given the similarity between any two elements comprising two sequences. In our similar video retrieval system, a sequence is an array of exemplars
・
・
Because we use a distance measure in the exemplar selection, we can formulate the similarity measure using that distance measure, say
Here,
In practical situations, the lengths of sequences must differ. To arrange such sequences, we need to make the lengths even using padding. Further, if a similar or the same pattern of a sub-sequence in one sequence is not found in the other, we create padding that indicates there is no pattern. Within the sequence alignment, such padding is called a gap. A gap is interpreted as a special element of the exemplars. The similarity between a gap and a normal element is usually a negative constant value. This is because inserting gaps should reduce the total similarity of the sequences. For this reason, it is referred to as a gap penalty.
In the next two sections, we describe the two alignment M-distance algorithms. For clarity, we provide the notation here. The symbols are also described in Section 2. We compare two videos A and B. Each video consists of a sequence of frame images
The global alignment M-distance computes the similarity between videos A and B over their whole lengths through features
Step 1: Prepare an
Step 2: Set
Step 3: Fill the first row by
Step 4: Fill the first column by
Step 5: Starting from the position
Here, r is a coefficient of similarity that reflects the magnitude of the followers, and s is the similarity between the exemplars mentioned above. While filling the table, we insert an arrow as a pointer from position
Step 6: (Tracebacking) Traceback from the end of cells at
・ A diagonally upward arrow
・ The horizontal arrow ¬ from
・ The upward arrow from
Step 7: (Normalizing the global alignment score) Normalize the total score
where w is an averaging function.
Step 8: Output the best matching pattern found by the traceback in Step 6 and the score
In our experiments, the following function was chosen as
In function
In contrast to the global matching, the aim of the local alignment M-distance is to find matching local time spans in
Step 1: Prepare an
Step 2: Fill all the cells in the
with 0.
Step 3: Starting from position
In this step, we draw an arrow as a pointer from position
Step 4: Identify the cell that yields the maximum value
Step 5: (Tracebacking) Traceback along the arrows from the cell with
Step 6: Output the best matching local span found by the traceback in Step 5. Score
In this example, the traceback in Step 5 starts at
We prepared 2100 videos to form 100 sets of 21 videos from [
・ The video resolution is
・ The frame rate is 30 frame per seconds (fps).
・ Their lengths are from 30 to 180 (secs). That is, we retrieved videos of different lengths.
In each of the 21 videos, we selected one video as the query at random. Then, we choose more than 10% portions of the query. This part is modified by the following five methods.
1) The frame rate is sped up by removing one frame per 6, 3, or 2 frames.
2) The video was changed to monochromatic scenes using gray-scale transformation.
Note that the coefficients in the above equation are equivalent to the Y component of YCbCr color space except their value is in higher precision than Equation (1).
3) The brightness of the RGB components were multiplied by 0.6 to 0.9 randomly.
4) The size was randomly changed by a factor of 0.5 to 2.0.
5) The JPEG compression quality was changed using OpenCV, which is an open source image processing library. Specifically, we set the CV_IMWRITE_JPEG_QUALITY parameter to a randomly chosen integer value between 20 and 80 for each video. Its default and maximum values are 95 and 100, respectively.
The videos modified by the above five types were randomly inserted into the remaining 20 videos. Thus, these target videos included plagiarized material.
In plagiarism detection, the local alignment method of Section 5.3 is appropriate because the global alignment of Section 5.2 absorbs local characteristics as much as possible. To evaluate the set of pseudo-illegal videos prepared by Section 6.1, we used the following steps.
Step 1: A query video q of length
Step 2: A feature
Step 3: The local alignment M-distance described in Section 5.3 is computed for the pre-computed exemplar sets in the video database as labels.
Step 4: The similarity ranking is obtained from the M-distance scores.
Note that, for other experiments in which we find whole video properties, we need to use the global alignment M-distance of Section 5.2.
We use precision-recall curves to evaluate of our content-based similar video retrieval system. Recall and precision are defined as follows.
For precision, we use 11-point interpolated precision because our system outputs a ranked result of the plagiarism detection task. An interpolated precision is widely used for ranked results in information retrieval. Let
Here, r is one of 11 points of
We designed a graphical user interface (GUI) to facilitate the experiments of the similar video retrieval.
selection method, and alignment method. The parameter settings for the learning and alignment method are specified here. The middle portion of the window shows thumbnails of the videos ordered by similarity scores. The bottom part of the window shows frames that correspond to the exemplar sets of the query video and the selected video. In this illustration, four gaps are inserted on the query side by the M-distance computation.
We conducted experiments on plagiarism detection using the local alignment M-distance. The data were generated by the method described in Section 6.1. There are two sets of experiments depending on the descriptors: CSD or Frame Signature. For these experiments, we used the following computational resources.
1) A conventional standalone machine consisting of two Intel Xeon 2.10 GHz CPUs and 64 GB RAM: This machine was used for the GUI of
2) Eight virtual machines provided by Amazon Web Services (AWS) that are equivalent to an Intel Xeon 2.8 GHz CPU and 32 GB RAM: These machines were used compute the CSD, TBAP, TPKM, and TB-PNN.
First, we show the results yielded by the conventional CSD descriptor. We divided the HSV color space into
・ The CSD descriptor shows sufficient performance with respect to frame rate and size changes. This performance is due to the ability to detect scattered color pixels.
・ For the JPEG quality changes, the CSD descriptor is appropriate only if the top several recalls are used.
・ The CSD method is susceptible to changes in brightness.
・ If the video is changed to grayscale, the CSD video descriptor is not suitable.
The performance of CSD shows that this video descriptor does not comprise much information about the frames.
When the descriptor is the Frame Signature, we need not normalize the obtained feature vectors. Therefore, we used a distance measure of
・ All precision-recall curves show that the Frame Signature is a creditable descriptor for the content-based video detection.
・ The precision-recall curve of the frame rate change degrades more than other changes in high-recall regions. This performance degradation is due to excessively fast forwarding, which is similar to human behavior. However, a choice of a smaller δ can address this problem.
・ Because the computation of the Frame Signature is based on integers, its speed is faster than the floating-point computation of the CSD vectors by two orders of magnitude. This property is a great advantage in addition to the retrieval precision.
The content-based similar video retrieval system presented in this paper showed high performance, as evidenced by
observations. The Frame Signature, standardized by the ISO/IEC, matches the frame descriptor to our content-based video retrieval system. The exemplar set extracted by our machine learning algorithms on each video can function as its numerical label. By attaching such numerical labels to each video, we can structure massive amounts of video data effectively. These labels have a wide range of applications. The detection of inserted plagiarized images in our experiments is one such application. The importance of the choice of descriptor was revealed by the experiments of
We have longstanding anticipation whose direction is opposite to the similar video retrieval. That is the machine generation of artificial videos from descriptors. In the case of the still image generation by the Generative Adversarial Nets [
Mr. Masafumi Moriwaki receives special gratitude for his initial contribution to the invention of the M-distance together with the last author. Dr. Ryota Yokote, and Messrs. Shota Ninomiya, Akihiro Shikano, and Hiromichi Iwase receive thanks from the authors for their ideas and help developing the total system. This paper is a part of the outcome of research performed under a Waseda University Grant for Special Research Projects (Project number: 2016K-176). In addition, this work was supported in part by the Japan Society for the Promotion of Science through Grants-in-Aid for Scientific Research (C) (17K00135).
Horie, T., Uchida, M. and Matsuyama, Y. (2018) Similar Video Retrieval via Order-Aware Exemplars and Alignment. Journal of Signal and Information Processing, 9, 73-91. https://doi.org/10.4236/jsip.2018.92005