Implementation of A Distributed Web Community Crawler

Seonyoung Park; Youngseok Lee

Summary

Asia-Pacific Network Operations and Management Symposium

2014

Session Number:TS5

Session:

Number:TS5-3

Implementation of A Distributed Web Community Crawler

Seonyoung Park, Youngseok Lee,

pp.-

Publication Date:2014/09/17

Online ISSN:2188-5079

DOI:10.34385/proc.21.TS5-3

PDF download (863.6KB)

Summary:

A web community is an important space for online users to exchange information, ideas and thoughts. Due to collective intelligence of the web communities, marketing and advertisement activities have been highly focused on these sites. While articles in the web communities are open to the public, they cannot be easily collected and analyzed, because they are written in natural languages and their formats are diverse. Though many web crawlers are avaialble, they are not good at gathering web documents. First, the URLs of web articles are frequently changed and redundant, which will make the crawling job difficult. Second, the amount of articles is significantly large that the crawler should be designed in a scalable manner. Therefore, we propose a distributed web crawler optimized for collecting articles from popular communities. From the experiemnts we showed that our implementation achieves high throughput compared with the open-source crawler, Nutch.