Category Archives: big-data

Challenges in Large-Scale Web Crawling

Simple web crawling is easy but when you start crawling several hundred million pages there are a number of difficult challenges. Last Friday, I gave a talk on how to overcome some of the challenges of large-scale web crawling at Berkeley. Below are the slides from that talk. Challenges in Large-Scale Web Crawling View more [...]
Also posted in crawling | 1 Comment

hector.rb: the pleasant JRuby Cassandra client (wraps Hector)

Hector is a Java Cassandra client. It’s a nice abstraction over making raw Thrift calls. Hector’s features include: an object-oriented way to interface with Cassandra serialization helpers failover support connection pooling jmx support There is already a Ruby cassandra gem, but it uses the Ruby Thrift bindings which do not work well for JRuby. In [...]
Posted in big-data | 2 Comments

cascading-simhash a library to cluster by minhashes in Hadoop

simhashing Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site. In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, [...]
Also posted in programming | Leave a comment