Name: Using Tika and Spark to Cluster the Crawl Output of Nutch - ThammeGowda Narayanaswamy, University of Southern California
Start: 2016-05-12T16:40:00-0700
End: 2016-05-12T17:30:00-0700

Back To Schedule

Using Tika and Spark to Cluster the Crawl Output of Nutch - ThammeGowda Narayanaswamy, University of Southern California

Most users who consume data from the web are concerned with a subset of documents in the crawlers output. Though crawlers like Nutch offer flexible configuration mechanisms to make the crawl focused on interesting pages, it is almost impossible to isolate the less interesting data from the more important information that the crawler should be focused on. In this presentation, Thamme Gowda and Chris Mattmann will describe a useful clustering technique they formulated by combining various similarity measures on DOM structure and CSS styles of web pages including Tree Edit Distance and Jaccard Similarity. The clusters can be thus used for extracting interesting data and applying special analysis based on cluster content. They also showcase an implementation of this technique which is planning to contribute to Apache Tika and shows how this can be scaled to web scale using Spark's MLlib.

Speakers

Thamme Gowda

Graduate Student, University of Southern California

Thamme Gowda is a grad student at the Univ. of Southern California, Los Angeles, CA, and also an intern at NASA Jet Propulsion Laboratory, Pasadena, CA, USA. He is a co-founder of Datoin.com, a software as a service platform built using Hadoop and Spark. He is also a committer and... Read More →

Apache Con Slides Nutch Clustering pdf

Thursday May 12, 2016 4:40pm - 5:30pm PDT
Regency B

Web Applications, Any

ApacheCon 2016

Thamme Gowda

Attendees (14)

ApacheCon 2016

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Thamme Gowda

Attendees (14)