copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale

A. Rheinländer, M. Lehmann, A. Kunkel, J. Meier, and U. Leser. Proceedings of the 2016 International Conference on Management of Data, page 759--771. New York, NY, USA, ACM, (2016)
DOI: 10.1145/2882903.2903736

Abstract

In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of ~1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of ~250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.

Links and resources

BibTeX key: rheinlander2016potential
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 2016 International Conference on Management of Data
year: 2016
pages: 759--771
publisher: ACM
series: SIGMOD '16
acmid: 2903736
isbn: 978-1-4503-3531-7
numpages: 13
location: San Francisco, California, USA
DOI: 10.1145/2882903.2903736
url: http://doi.acm.org/10.1145/2882903.2903736

@jaeschke's tags highlighted

Cite this publication

%0 Conference Paper %1 rheinlander2016potential %A Rheinländer, Astrid %A Lehmann, Mario %A Kunkel, Anja %A Meier, Jörg %A Leser, Ulf %B Proceedings of the 2016 International Conference on Management of Data %C New York, NY, USA %D 2016 %I ACM %K extraction ie information web %P 759--771 %R 10.1145/2882903.2903736 %T Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale %U http://doi.acm.org/10.1145/2882903.2903736 %X In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of ~1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of ~250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale. %@ 978-1-4503-3531-7

@inproceedings{rheinlander2016potential, abstract = {In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. In this paper, we report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of ~1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of ~250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.}, acmid = {2903736}, added-at = {2018-07-04T15:00:45.000+0200}, address = {New York, NY, USA}, author = {Rheinländer, Astrid and Lehmann, Mario and Kunkel, Anja and Meier, Jörg and Leser, Ulf}, biburl = {https://www.bibsonomy.org/bibtex/260897a8b8d9fb046d7d1097231cff702/jaeschke}, booktitle = {Proceedings of the 2016 International Conference on Management of Data}, doi = {10.1145/2882903.2903736}, interhash = {1cc83294b0ecfcc10788d7b85e8711b4}, intrahash = {60897a8b8d9fb046d7d1097231cff702}, isbn = {978-1-4503-3531-7}, keywords = {extraction ie information web}, location = {San Francisco, California, USA}, numpages = {13}, pages = {759--771}, publisher = {ACM}, series = {SIGMOD '16}, timestamp = {2018-07-04T15:00:45.000+0200}, title = {Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale}, url = {http://doi.acm.org/10.1145/2882903.2903736}, year = 2016 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale

Comments and Reviews
(0)