Loading…
ApacheCon 2016 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Web Applications [clear filter]
Thursday, May 12
 

4:40pm

Using Tika and Spark to Cluster the Crawl Output of Nutch - ThammeGowda Narayanaswamy, University of Southern California
Most users who consume data from the web are concerned with a subset of documents in the crawlers output. Though crawlers like Nutch offer flexible configuration mechanisms to make the crawl focused on interesting pages, it is almost impossible to isolate the less interesting data from the more important information that the crawler should be focused on. In this presentation, Thamme Gowda and Chris Mattmann will describe a useful clustering technique they formulated by combining various similarity measures on DOM structure and CSS styles of web pages including Tree Edit Distance and Jaccard Similarity. The clusters can be thus used for extracting interesting data and applying special analysis based on cluster content. They also showcase an implementation of this technique which is planning to contribute to Apache Tika and shows how this can be scaled to web scale using Spark's MLlib.

Speakers
avatar for Thamme Gowda

Thamme Gowda

Graduate Student, University of Southern California
Thamme Gowda is a grad student at the Univ. of Southern California, Los Angeles, CA, and also an intern at NASA Jet Propulsion Laboratory, Pasadena, CA, USA. He is a co-founder of Datoin.com, a software as a service platform built using Hadoop and Spark. He is also a committer and... Read More →



Thursday May 12, 2016 4:40pm - 5:30pm
Regency B