Tag Archives: hadoop

Cascading, TF-IDF, and BufferedSum (Part 1)

Introduction A common technique in MapReduce is to input a group of records, calculate a value from that group, and emit each record with the new value attached. While this is easy to do in raw MR jobs, the solution in Cascading is not very obvious. This tutorial introduces a new operation to Cascading called [...]
Posted in big-data | Also tagged , | 1 Comment

How to use a raw MapReduce job in Cascading

Cascading is a great abstraction over MapReduce. However, sometimes you may have code for an existing MapReduce job or want to drop directly to Hadoop for efficiency. Even if you’re using raw MapReduce jobs, Cascading can still be useful in planning the overall data pipeline. The code below is an example of how to use [...]
Posted in big-data | Also tagged , , | 1 Comment

“Easily” setup a monitored Hadoop / Hive Cluster in EC2 with PoolParty

Summary Setting up a scalable Hadoop cluster isn’t easy, but PoolParty makes it easier and manageable. By the time we’re done with this tutorial you’ll have a Hadoop cluster consisting of one master node and two slaves. The slaves are formatted with HDFS and process MapReduce jobs that are delegated to them from the master. [...]
Posted in cloud-computing | Also tagged | 1 Comment