ApacheCon 2016 has ended
Back To Schedule
Thursday, May 12 • 4:40pm - 5:30pm
Using Tika and Spark to Cluster the Crawl Output of Nutch - ThammeGowda Narayanaswamy, University of Southern California

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Most users who consume data from the web are concerned with a subset of documents in the crawlers output. Though crawlers like Nutch offer flexible configuration mechanisms to make the crawl focused on interesting pages, it is almost impossible to isolate the less interesting data from the more important information that the crawler should be focused on. In this presentation, Thamme Gowda and Chris Mattmann will describe a useful clustering technique they formulated by combining various similarity measures on DOM structure and CSS styles of web pages including Tree Edit Distance and Jaccard Similarity. The clusters can be thus used for extracting interesting data and applying special analysis based on cluster content. They also showcase an implementation of this technique which is planning to contribute to Apache Tika and shows how this can be scaled to web scale using Spark's MLlib.

avatar for Thamme Gowda

Thamme Gowda

Graduate Student, University of Southern California
Thamme Gowda is a grad student at the Univ. of Southern California, Los Angeles, CA, and also an intern at NASA Jet Propulsion Laboratory, Pasadena, CA, USA. He is a co-founder of Datoin.com, a software as a service platform built using Hadoop and Spark. He is also a committer and... Read More →

Thursday May 12, 2016 4:40pm - 5:30pm PDT
Regency B