Category Archives: big-data

Custom Hive UDFs in Clojure

Introduction We process all of our web-crawl data in Hadoop. If I’m writing jobs that will only be run by my team, then Cascalog is my tool of choice. But unfortunately, not everyone is going to learn Cascalog (much less Cascading or Clojure). However, many people know a little SQL and the best tool for [...]
Also posted in crawling | Leave a comment

Cascading, TF-IDF, and BufferedSum (Part 1)

Introduction A common technique in MapReduce is to input a group of records, calculate a value from that group, and emit each record with the new value attached. While this is easy to do in raw MR jobs, the solution in Cascading is not very obvious. This tutorial introduces a new operation to Cascading called [...]
Posted in big-data | Tagged , , | 1 Comment

How to use a raw MapReduce job in Cascading

Cascading is a great abstraction over MapReduce. However, sometimes you may have code for an existing MapReduce job or want to drop directly to Hadoop for efficiency. Even if you’re using raw MapReduce jobs, Cascading can still be useful in planning the overall data pipeline. The code below is an example of how to use [...]
Posted in big-data | Tagged , , , | 1 Comment