TextSweeper - A System for Content Extraction and Overview Page Detection
H. Lang, G. Wohlgenannt, and A. Weichselbraun. International Conference on Information Resources Management (Conf-IRM), Vienna, Austria, AIS, (2012)Forthcoming (accepted 6 February 2012).
Abstract
Web pages not only contain main content, but also other elements such as navigation panels, advertisements and links to related documents.
Furthermore, overview pages (summarization pages and entry points) duplicate and aggregate parts of articles and thereby create redundancies. The noise elements in Web pages as well as overview pages affect the performance of downstream processes such as Web-based Information Retrieval. Context Extraction's task is identifying and extracting the main content from a Web page.
In this research-in-progress paper we present an approach which not only identifies and extracts the main content,
but also detects overview pages and thereby allows skipping them. The content extraction part of the system is an extension of existing Text-to-Tag ratio methods, overview page detection is accomplished with the net text length heuristic. Preliminary results and ad-hoc evaluation indicate a promising system performance. A formal evaluation and comparison to other state-of-the-art approaches is part of future work.
%0 Conference Paper
%1 lang2012
%A Lang, Heinz-Peter
%A Wohlgenannt, Gerhard
%A Weichselbraun, Albert
%B International Conference on Information Resources Management (Conf-IRM)
%C Vienna, Austria
%D 2012
%I AIS
%K Web-based content contextualized extraction, filtering, information language natural overview pages, processing, retrieval spaces, text
%T TextSweeper - A System for Content Extraction and Overview Page Detection
%X Web pages not only contain main content, but also other elements such as navigation panels, advertisements and links to related documents.
Furthermore, overview pages (summarization pages and entry points) duplicate and aggregate parts of articles and thereby create redundancies. The noise elements in Web pages as well as overview pages affect the performance of downstream processes such as Web-based Information Retrieval. Context Extraction's task is identifying and extracting the main content from a Web page.
In this research-in-progress paper we present an approach which not only identifies and extracts the main content,
but also detects overview pages and thereby allows skipping them. The content extraction part of the system is an extension of existing Text-to-Tag ratio methods, overview page detection is accomplished with the net text length heuristic. Preliminary results and ad-hoc evaluation indicate a promising system performance. A formal evaluation and comparison to other state-of-the-art approaches is part of future work.
@inproceedings{lang2012,
abstract = {Web pages not only contain main content, but also other elements such as navigation panels, advertisements and links to related documents.
Furthermore, overview pages (summarization pages and entry points) duplicate and aggregate parts of articles and thereby create redundancies. The noise elements in Web pages as well as overview pages affect the performance of downstream processes such as Web-based Information Retrieval. Context Extraction's task is identifying and extracting the main content from a Web page.
In this research-in-progress paper we present an approach which not only identifies and extracts the main content,
but also detects overview pages and thereby allows skipping them. The content extraction part of the system is an extension of existing Text-to-Tag ratio methods, overview page detection is accomplished with the net text length heuristic. Preliminary results and ad-hoc evaluation indicate a promising system performance. A formal evaluation and comparison to other state-of-the-art approaches is part of future work.},
added-at = {2012-04-16T19:17:24.000+0200},
address = {Vienna, Austria},
author = {Lang, Heinz-Peter and Wohlgenannt, Gerhard and Weichselbraun, Albert},
biburl = {https://www.bibsonomy.org/bibtex/2c5ee7d1dfc5093ca2516b16e31d99cb3/albert.weichselbraun},
booktitle = {International Conference on Information Resources Management (Conf-IRM)},
interhash = {889651476f1067731cd90b39905df0a8},
intrahash = {c5ee7d1dfc5093ca2516b16e31d99cb3},
keywords = {Web-based content contextualized extraction, filtering, information language natural overview pages, processing, retrieval spaces, text},
note = {Forthcoming (accepted 6 February 2012)},
owner = {albert},
publisher = {AIS},
timestamp = {2012-04-16T19:17:24.000+0200},
title = {TextSweeper - A System for Content Extraction and Overview Page Detection},
year = 2012
}