Cite this asSanthoshkumar SP, Kumar MP, Beaulah HL (2021) Visual experience recognition using adaptive support vector machine. Trends Comput Sci Inf Technol 6(3): 072-076. DOI: 10.17352/tcsit.000043
Copyright© 2021 Santhoshkumar SP, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Video has more information than the isolated images. Processing, analyzing and understanding of contents present in videos are becoming very important. Consumer videos are generally captured by amateurs using handheld cameras of events and it contains considerable camera motion, occlusion, cluttered background, and large intraclass variations within the same type of events, making their visual cues highly variable and less discriminant. So visual event recognition is an extremely challenging task in computer vision. A visual event recognition framework for consumer videos is framed by leveraging a large amount of loosely labeled web videos. The videos are divided into training and testing sets manually. A simple method called the Aligned Space-Time Pyramid Matching method was proposed to effectively measure the distances between two video clips from different domains. Each video is divided into space-time volumes over multiple levels. A new transfer learning method is referred to as Adaptive Multiple Kernel Learning fuse the information from multiple pyramid levels, features, and copes with the considerable variation in feature distributions between videos from two domains web video domain and consumer video domain.With the help of MATLAB Simulink videos are divided and compared with web domain videos. The inputs are taken from the Kodak data set and the results are given in the form of MATLAB simulation.
In the past few years, computer vision researchers have witnessed a surge of interest in human action analysis through videos. With the rapid adoption of digital cameras and mobile phone cameras, visual event recognition in personal videos produced by consumers has become an important research topic due to its usefulness in automatic video retrieval and indexing. Event recognition from visual cues is a challenging task because of complex motion, cluttered backgrounds, occlusions, as well as geometric and photometric variances of objects. Previous work on video event recognition can be roughly classified as either activity recognition or abnormal event recognition. First, a large corpus of training data is collected, in the concept, labels are generally obtained through expensive human annotation. Next, robust classifiers also called models or concept detectors are learned from the training data. Finally, the classifiers are used to detect the presence of the concepts in any test data. Sufficient and strong labeled training samples are provided, these event recognition methods have achieved promising results. However, it is well-known that the learned classifiers from a limited number of labeled training samples are usually not robust and do not generalize well. This project proposes a new event recognition framework for consumer videos by leveraging a large number of loosely labeled YouTube videos. A large amount of loosely labeled YouTube can be readily obtained by using keywords-based search. YouTube videos are downsampled and compressed by the web server, so the quality of YouTube videos is generally lower than consumer videos. YouTube videos may have been selected and edited to attract attention, while consumer videos are in their natural captured state. Figure 1 shows four frames from two events picnic and sports as examples to illustrate the considerable appearance differences between consumer videos and YouTube videos. Therefore, the feature distributions of samples from the two domains web video domain and consumer video domain may change considerably in terms of the statistical properties such as mean, intra-class, and interclass variance.
An event recognition framework extends the recent work on pyramid matching and presents a new matching method called Aligned Space-Time Pyramid Matching to effectively measure the distance between two video clips that may be from different domains. Divide each video into space-time volumes over multiple levels and calculate the pairwise distances between any two volumes and further integrate the information from different volumes with integer flow Earth Movers Distance to explicitly align the volumes. The Earth Mover’s Distance (EMD) is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space. The EMD lifts this distance from individual features to full distributions.
A technique that uses local space-time features to classify six human actions like walk, jog, run, wave, clap, and box in challenging real-world video sequences. This technique achieves comparable performance in the presence of camera motion, scale variation, and viewpoint changes. Hinder the use of 2D local descriptors for object detection in static images also impact spatiotemporal local descriptors. Cross-domain learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to cope with the considerable variation in feature distributions between videos from the web domain and consumerdomain. Each pyramid level and each type of local feature, train a set of Adaptive SVM classifiers.Based on a combined training set from two domains by using multiple base kernels of different kernel types and parameters, are further fused with equal weights to obtain an average classifier. A new objective function to learn an adapted classifier based on multiple base kernels and the learned average classifiers by minimizing both the structural risk functional and mismatch of data distributions from two domains.
Event recognition methods can be roughly categorized into model-based methods and appearance-based techniques. Model-based approaches relied on various models including HMM, coupled HMM, and Dynamic Bayesian Network  to model the temporal evolution. Appearance-based approaches employed space-time features extracted from salient regions with significant local variations in both spatial and temporal dimensions [2-4].
Statistical learning methods including Support Vector Machine (SVM) , probabilistic Latent Semantic Analysis (pLSA) , and Boosting  were applied to the space-time features to obtain the final classification. Promising results [3,4,6,7,] have been reported on video data sets under controlled settings, such as Weizman  and KTH  data sets. Classifier adaptation can be seen as an effort to solve the fundamental problem of mismatched distributions between the training and testing data. This problem occurs in concept detection in a video corpus such as TRECVID , which contains data from different sources programs. In existing approaches [8-10], concept classifiers are built from and applied to data collected from all the programs without considering their difference in distribution. In this paper, a different scenario where classifiers trained from one or several.
The proposed classifier adaptation method is related to the work on drifting concept detection in the data mining community and transfer learning and incremental learning in the machine learning community. Incremental learning methods, such as incremental SVMs [6,11], continuously update a model with new examples without re-training over all the examples. The training and test distribution are identical, A-SVMs can be treated as a generic incremental method that can handle classifiers of any type. It is also more efficient than existing methods [6,11] whose training involves at least part of the previous examples support vectors.
Spatial pyramid matching  and its space-time extension  used fixed block-to-block matching and fixed volume-to-volume matching. In contrast, this aligned pyramid matching extends the methods of Spatially Aligned Pyramid Matching (SAPM)  and Temporally Aligned Pyramid Matching (TAPM)  from either spatial domain or temporal domain to the joint space-time domain, the volumes across different space and time locations may be matched.
Similar to , divide each video clip into 8l non overlapped space-time volumes over multiple levels, l=0,…., L-1 where the volume size is set as 1/2l of the original video in width, height, and temporal dimension. Following , extract the local space-time (ST) features including Histograms of Oriented Gradient (HoG) and Histograms of Optical Flow (HoF), are further concatenated together to form lengthy feature vectors. Sample each video clip to extract image frames and then extract static local SIFT features from them . This method consists of two matching stages. In the first matching stage, calculate the pairwise distance Drc between each two space-time volumes Vi(r) and Vj(c), where r,c = 1,….., R with R being the total number of volumes in a video. The space-time features are vector-quantized into visual words and then each space-time volume is represented as a token-frequency feature. As suggested in , to measure the distance Drc using equation (1) Note that each space-time volume consists of a set of image blocks.
Token-frequency (tf) features from each image block are extracted by vector-quantizing the corresponding SIFT features into visual words. Based on the SIFT features, as suggested in , the pairwise distance Drc between two volumes Vi(r) and Vj(c) is calculated by using Earth Mover’s Distance (EMD),
Where H, I are the numbers of image blocks in Vi(r), Vj(c) respectively, duv is the distance between two image blocks Euclidean distance is used in this work and fuv is the optimal flow that can be obtained by solving the linear programming problem as follows:
In the second stage, further, integrate the information from different volumes with Integer-flow EMD to explicitly align the volumes. Try to solve a flow matrix containing binary elements that represent unique matches between volumes Vi(r) and Vj(c). As suggested in , such a binary solution can be conveniently computed by using the standard Simplex method for linear programming.
Learning: The proposed framework consists of three contributions:
A visual event recognition framework for consumer videos with only a limited number of labeled consumer videos by leveraging a large amount of loosely labeled web videos.
Pyramid matching extended by presenting a new matching method called Aligned Space-Time Pyramid Matching (ASTPM) to effectively measure the distances between two video clips.
A cross-domain learning method, Adaptive Multiple Kernel Learning (A-MKL), is used to cope with the considerable variation in feature distributions between videos from the web video domain and consumer video domain by minimizing both the structural risk functional and mismatch of data distributions from two domains.
Web video domain is taken as the auxiliary domain DA source domain and the consumer video domain as the target domain DT. DT= D T U DT, Where DT andDTu represent the labeled and unlabeled data in the target domain. Transfer learning domain adaptation or cross-domain learning methods have been proposed for many applications. To take advantage of all labeled patterns from both auxiliar y and target domains, in previous work proposed a Feature Replication (FR) by using augmented features for SVM training. In Adaptive SVM (ASVM) the target classifier is adapted from an existing classifier as an auxiliary classifier trained based on the samples from the auxiliary domain. Figure 2 illustrate event recognition for consumervideos by leveraging alargenumberof loosely labeled YouTube videos.
Divide each video into 8l non-overlapped space-time volumes over multiple levels, l=0,…, L-1. where the volume size is set as 1/2l of the original video in width, height, and temporal dimension. The partition for two videos Vi and Vj at level-1. The local Space-Time (ST) features including Histograms of Oriented Gradient (HoG) and Histograms of Optical Flow (HoF), are extracted and further concatenated together to form lengthy feature vectors. Sample each video clip to extract image frames and then extract static local SIFT features from them.
The two matching stages are: In the first matching stage, calculate the pairwise distance Drc between each two space-time volumes Vi(r) and Vj(c), where r,c=1,….., R with R being the total number of volumes in a video.
In the second stage, further, integrate the information from different volumes withInteger flow Earth Mover’s Distance to explicitly align the volumes.Solve aflow matrix containing binary elements that represent unique matches between volumes Vi(r) and Vj(c) :
Then, the distance between two videos Vi and Vj canbe directly calculated by
The matching results are obtained by using the ASTPM method. Each pair of matched volumes from two videos is highlighted in the same color. Cross-domain learning methods have been proposed for many applications [11,14,15]. To take advantage of all labeled patterns from both auxiliary and target domains, Daum´e III  proposed Feature Replication (FR) by using augmented features for SVM training. In Adaptive SVM (A-SVM)], the target classifier fT (x) is adapted from an existing classifier fA(x) referred to as auxiliary classifier trained based on the samples from the auxiliary domain.
The target decision function is defined as While A-SVM can also employ multiple auxiliary classifiers, these auxiliary classifiers are equally fused to obtain fA(x). Moreover, the target classifier fT (x) is learned based on only one kernel. Recently, Duan  proposed Domain Transfer SVM (DTSVM) to simultaneously reduce the mismatch in the distributions between two domains and learn a target decision function.
The learned classifiers are used prior to learning a robust adapted target classifier. Train a set of independent classifiers for each pyramid level and each type of local feature using the training data from two domains. The learned classifiers are used prior for learning a robust adapted target classifier. Further equally fuse these classifiers to obtain average classifiers and . These Classifiers are then used as prelearned classifiers .
The kernel function k is a linear combination of base kernels km’s, ,where dm is the linear combination coefficient, and the kernel function km is induced from the nonlinear feature mapping function In A-MKL, the first objective is to reduce the mismatch in data distributions between two domains.
Where h = [tr(K1S,….,tr(KMS)] , and
is the mth base kernel matrix defined on the samples from both auxiliary and target domains.
The second objective of A-MKL is to minimize the structural risk functional. MKL methods utilize the training data and the test data drawn from the same domain. They come from different distributions, MKL methods may fail to learn the optimal kernel. This would degrade the classification performance in the target domain. On the contrary, A-MKL can better make use of the data from two domains to improve the classification performance.
The matching results are obtained by using the ASTPM method. Each pair of matched volumes from two videos is highlighted in the same color. The mismatch was measured by Maximum Mean Discrepancy (MMD)  based on the distance between the means of samples from the auxiliary domain DA and the target domain DT in the Reproducing Kernel Hilbert Space (RKHS), namely:
Where xA ’s and xTi ’s are the samples from the auxiliary and target domains, respectively. A-SVM [4,17-22] also assumes that the target classifier fT (x) is adapted from existing auxiliary classifiers. An event in consumer video is recognized using a large number of loosely labeled web videos and a limited number of labeled consumer videos. Aligned Space-Time Pyramid matching is used to find out the similarity between videos. Cross-domain learning method Adaptive Multiple Kernel Learning handles the mismatch between the data distributions of the consumer video domain and the web video domain.
A new event recognition framework for consumer video is framed by leveraging a large amount of loosely labeled YouTube videos. A new pyramid matching method called ASTPM and a novel transfer learning method, A-MKL to better fuse the information from multiple pyramid levels and different types of local features and to cope with the mismatch between the feature distributions of consumer videos and web videos. A possible future research direction is to develop effective methods to select more useful videos from a large number of low-quality YouTube videos to construct the auxiliary domain.
The adaption between the web domain and consumer domain studied in this work and other examples that vision researchers have recently been working on including the adaptation of cross-category knowledge to a new category domain, knowledge transfer by mining semantic relatedness, and adaption between two domains with different feature representations. In the future, this method will be extended to A-MKL for internet vision applications.
Subscribe to our articles alerts and stay tuned.