ApacheCon 2016 has ended
Back To Schedule
Thursday, May 12 • 2:30pm - 3:20pm
Focused Crawling with Apache Nutch - Sujen Shah, NASA Jet Propulsion Laboratory

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Thevast nature of the Web has forced researchers to continually develop advanced data acquisition strategies that overcome a multitude of obstacles in order to acquire relevant topical content and assimilate it with their needs. Many groups have researched focused Web crawling techniques in order to better guide their data acquisition efforts, however few approaches consider the scenario where one wishes to undertake DD on the open Web for which no prior semantic knowledge resources are available. Sujen and his team have investigated and developed a new application of the cosine similarity metric (CSM) which has been implemented as part of a novel strategy for domainspecificDD.

In this presentation, Sujen would review the recent work in focused crawling and the ability to run similarity scoring within a production ready, scalable Web crawler, Apache Nutch.

avatar for Sujen Shah

Sujen Shah

Scientific Applications Software Engineer, NASA Jet Propulsion Laboratory

Thursday May 12, 2016 2:30pm - 3:20pm PDT
Plaza B