H. Pirsiavash, and D. Ramanan. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, page 612-619. (2014)
DOI: 10.1109/CVPR.2014.85
Abstract
Real-world videos of human activities exhibit temporal structure at various scales, long videos are typically composed out of multiple action instances, where each instance is itself composed of sub-actions with variable durations and orderings. Temporal grammars can presumably model such hierarchical structure, but are computationally difficult to apply for long video streams. We describe simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine. This makes parsing linear time, constant storage, and naturally online. We train grammar parameters using a latent structural SVM, where latent subactions are learned automatically. We illustrate the effectiveness of our approach over common baselines on a new half-million frame dataset of continuous YouTube videos.
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA
year
2014
pages
612-619
file
IEEE Digital Library:2014/PirsiavashRamanan14CVPR.pdf:PDF;Related MIT News:http\://newsoffice.mit.edu/2014/techniques-from-natural-language-processing-enable-computers-to-search-video-0514:URL
%0 Conference Paper
%1 PirsiavashRamanan14CVPR
%A Pirsiavash, Hamed
%A Ramanan, Deva
%B Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA
%D 2014
%K v1410 ieee paper ai video image semantic analysis action recognition learn zzz.vitra
%P 612-619
%R 10.1109/CVPR.2014.85
%T Parsing Videos of Actions with Segmental Grammars
%X Real-world videos of human activities exhibit temporal structure at various scales, long videos are typically composed out of multiple action instances, where each instance is itself composed of sub-actions with variable durations and orderings. Temporal grammars can presumably model such hierarchical structure, but are computationally difficult to apply for long video streams. We describe simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine. This makes parsing linear time, constant storage, and naturally online. We train grammar parameters using a latent structural SVM, where latent subactions are learned automatically. We illustrate the effectiveness of our approach over common baselines on a new half-million frame dataset of continuous YouTube videos.
@inproceedings{PirsiavashRamanan14CVPR,
abstract = {Real-world videos of human activities exhibit temporal structure at various scales, long videos are typically composed out of multiple action instances, where each instance is itself composed of sub-actions with variable durations and orderings. Temporal grammars can presumably model such hierarchical structure, but are computationally difficult to apply for long video streams. We describe simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine. This makes parsing linear time, constant storage, and naturally online. We train grammar parameters using a latent structural SVM, where latent subactions are learned automatically. We illustrate the effectiveness of our approach over common baselines on a new half-million frame dataset of continuous YouTube videos.},
added-at = {2014-10-18T10:57:48.000+0200},
author = {Pirsiavash, Hamed and Ramanan, Deva},
biburl = {https://www.bibsonomy.org/bibtex/22a7d54c472dcb065b78eb681e939b8e2/flint63},
booktitle = {Proceedings of the 2014 {IEEE} Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA},
doi = {10.1109/CVPR.2014.85},
file = {IEEE Digital Library:2014/PirsiavashRamanan14CVPR.pdf:PDF;Related MIT News:http\://newsoffice.mit.edu/2014/techniques-from-natural-language-processing-enable-computers-to-search-video-0514:URL},
groups = {public},
interhash = {547dfbc2f3ff02ad40cf6992a804affb},
intrahash = {2a7d54c472dcb065b78eb681e939b8e2},
keywords = {v1410 ieee paper ai video image semantic analysis action recognition learn zzz.vitra},
pages = {612-619},
timestamp = {2018-04-16T11:30:15.000+0200},
title = {Parsing Videos of Actions with Segmental Grammars},
username = {flint63},
year = 2014
}