Article,

SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALIZATION

H. Al-Bahadili, H. Qtishat, and R. Naoum.
International Journal on Web Service Computing (IJWSC), 4 (1): 19-37 (March 2013)
DOI: 10.5121/ijwsc.2013.4102

Full text

Abstract

A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data. This paper develops and investigates the performance of a new approach to speed up the crawling process on a multi-core processor through virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently) performing different crawling tasks on different data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in documents per unit time is computed, and the effect of the number of VMs on the speedup factor is investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.

BibTeX key: noauthororeditor
entry type: article
year: 2013
month: March
journal: International Journal on Web Service Computing (IJWSC)
number: 1
pages: 19-37
volume: 4
language: English
issn: 0976 - 9811 (Online) ; 2230 - 7702 (print)
DOI: 10.5121/ijwsc.2013.4102
Document: http://airccse.org/journal/jwsc/papers/4113ijwsc02.pdf

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{noauthororeditor, abstract = {A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data. This paper develops and investigates the performance of a new approach to speed up the crawling process on a multi-core processor through virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines (VMs) that can run in parallel (concurrently) performing different crawling tasks on different data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. In order to estimate the speedup factor achieved by the VM-based crawler over a non-virtualization crawler, extensive crawling experiments were carried-out to estimate the crawling times for various numbers of documents. Furthermore, the average crawling rate in documents per unit time is computed, and the effect of the number of VMs on the speedup factor is investigated. For example, on an Intel® Core™ i5-2300 CPU @2.80 GHz and 8 GB memory, a speedup factor of ~1.48 is achieved when crawling 70000 documents on 3 and 4 VMs.}, added-at = {2019-12-12T08:39:19.000+0100}, author = {Al-Bahadili, Hussein and Qtishat, Hamzah and Naoum, Reyadh S.}, biburl = {https://www.bibsonomy.org/bibtex/2466717bb0f50558a7c7f6bd14eae124b/ijwsc}, doi = {10.5121/ijwsc.2013.4102}, interhash = {9745af98c72222f21b185085f81a9f2e}, intrahash = {466717bb0f50558a7c7f6bd14eae124b}, issn = {0976 - 9811 (Online) ; 2230 - 7702 (print)}, journal = {International Journal on Web Service Computing (IJWSC)}, keywords = {Web crawler crawling distributed distribution engine machines methodologies methodology multi-core processor processor-farm search virtual virtualization}, language = {English}, month = {March}, number = 1, pages = {19-37}, timestamp = {2019-12-12T08:39:19.000+0100}, title = {SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALIZATION}, url = {http://airccse.org/journal/jwsc/papers/4113ijwsc02.pdf}, volume = 4, year = 2013 }

BibSonomy

SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALIZATION

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on