Category Archives: big-data
Cascading, TF-IDF, and BufferedSum (Part 1)
Introduction A common technique in MapReduce is to input a group of records, calculate a value from that group, and emit each record with the new value attached. While this is easy to do in raw MR jobs, the solution in Cascading is not very obvious. This tutorial introduces a new operation to Cascading called [...]
How to use a raw MapReduce job in Cascading
Cascading is a great abstraction over MapReduce. However, sometimes you may have code for an existing MapReduce job or want to drop directly to Hadoop for efficiency. Even if you’re using raw MapReduce jobs, Cascading can still be useful in planning the overall data pipeline. The code below is an example of how to use [...]
Custom Hive UDFs in Clojure