ApacheCon 2016 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Web Server [clear filter]
Thursday, May 12


Focused Crawling with Apache Nutch - Sujen Shah, NASA Jet Propulsion Laboratory
Thevast nature of the Web has forced researchers to continually develop advanced data acquisition strategies that overcome a multitude of obstacles in order to acquire relevant topical content and assimilate it with their needs. Many groups have researched focused Web crawling techniques in order to better guide their data acquisition efforts, however few approaches consider the scenario where one wishes to undertake DD on the open Web for which no prior semantic knowledge resources are available. Sujen and his team have investigated and developed a new application of the cosine similarity metric (CSM) which has been implemented as part of a novel strategy for domainspecificDD.

In this presentation, Sujen would review the recent work in focused crawling and the ability to run similarity scoring within a production ready, scalable Web crawler, Apache Nutch.

avatar for Sujen Shah

Sujen Shah

Research Intern, NASA Jet Propulsion Laboratory
Sujen is a Masters student pursuing Computer Science at the University of Southern California, Los Angeles. As a committer and member of the Apache Nutch PMC, his work includes augmenting the focused crawling capabilities of Nutch. These new scoring plugins are supporting the efforts... Read More →

Thursday May 12, 2016 2:30pm - 3:20pm
Plaza B


Multi-tier web caching with user generated content - Alan Carroll, Yahoo!
One of the major drivers of CDN capabilities is hosting user generated content. This is much more challenging than static content and although similar in many ways to video content delivery there are a number of distinct challenges.

This talk will cover the design basics of Yahoo's next generation CDN. The overall architecture will be described followed by an examination of how this supports rapid delivery of very large sets of user generated content. Trade offs with third party object storage vs. internally hosted content will be discussed and mechanisms for blending these, along with caching, to improve performance while managing costs. Multi-tenant issues will be examined, both at the level of handling differing requirements and how to build in adaptability for changing requirements.


Thursday May 12, 2016 3:30pm - 4:20pm
Plaza B