<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Eigenjoy &#187; programming</title>
	<atom:link href="http://eigenjoy.com/category/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://eigenjoy.com</link>
	<description>a programming blog</description>
	<lastBuildDate>Wed, 14 Dec 2011 18:22:45 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Understanding run n with conde in the Reasoned Schemer</title>
		<link>http://eigenjoy.com/2011/12/11/understanding-run-n-with-conde-in-the-reasoned-schemer/</link>
		<comments>http://eigenjoy.com/2011/12/11/understanding-run-n-with-conde-in-the-reasoned-schemer/#comments</comments>
		<pubDate>Sun, 11 Dec 2011 21:13:38 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://eigenjoy.com/?p=567</guid>
		<description><![CDATA[I&#8217;m working through The Reasoned Schemer and had some trouble seeing clearly how results are returned when using run n. After Googling a bit, I found a clear walkthrough of the logic on an obscure message board. I&#8217;m reposting here to make sure it doesn&#8217;t disappear.
The key insight is that the n in run n [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_568" class="wp-caption alignleft" style="width: 235px"><a href="http://eigenjoy.com/wp-content/uploads/2011/12/mO1ZqWZT.jpeg"><img src="http://eigenjoy.com/wp-content/uploads/2011/12/mO1ZqWZT-225x300.jpg" alt="The Reasoned Schemer" title="The Reasoned Schemer" width="225" height="300" class="size-medium wp-image-568" /></a><p class="wp-caption-text">The Reasoned Schemer</p></div>
<p>I&#8217;m working through <a href="http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&#038;tid=10663">The Reasoned Schemer</a> and had some trouble seeing clearly how results are returned when using <code>run n</code>. After Googling a bit, I found a clear walkthrough of the logic on <a href="http://www.groupsrv.com/computers/about538892.html">an obscure message board</a>. I&#8217;m reposting here to make sure it doesn&#8217;t disappear.</p>
<p>The key insight is that the <code>n</code> in <code>run n</code> refers not to the number of <em>goals</em> that should be tried, but rather the number of <em>answers</em> that should be returned, if available. Full text below:</p>
<blockquote><p>
Posted: Mon Jan 14, 2008 2:19 pm	</p>
<p>Hi all, </p>
<p>I&#8217;m currently re-reading the Reasoned Schemer, but am having a bit of<br />
difficulty understanding how values are computed for recursive<br />
functions. For example, in chapter 2, the recursive function lolo and<br />
section 24 asks what is the value of the following: </p>
<p>(run 5 (x)<br />
(lolo ((a b) (c d) . x))) </p>
<p>The first value (), is obvious to me. The variable x is fresh, so it<br />
is associated with () via the nullo check on line 4 of the method. </p>
<p>Here&#8217;s my explanation for how the second value is derived. Since we<br />
are asking for another value, and the expressions are evaluated within<br />
the conde, we refresh x and try the next line. At this point, the caro<br />
associates the fresh variable x with the cons of the fresh variable a<br />
and the fresh variable d-prime (introduced by the caro). Then, listo<br />
is called on a. Since a is fresh, this call associates a with ().<br />
Since this question succeeds, we try the answer. The cdro associates x<br />
with the cons of fresh variable a-prime (introduced by the cdro) and<br />
the fresh variable d. Then, lolo is called on d, and so d (being<br />
fresh), is associated with (). </p>
<p>Since a-prime and a co-share, and a is (), and d and d-prime co-share,<br />
and d is (), x can be successfully associated with the cons of () and<br />
(), so the result is (()). </p>
<p>Here&#8217;s where I get confused. This invocation of run 2 has actually<br />
produced #s 3 times, once to produce () and twice to produce (())<br />
(once in each of the calls to listo and the recursive call to lolo).<br />
However, we have *asked* for only two goals to be met, in order to<br />
produce two values. If I invoke &#8220;run 3 &#8230;,&#8221; however, which conde<br />
lines in which recursive frames are being evaluated, and having their<br />
associations preserved? Or rather, how is the result expanding beyond<br />
(())? </p>
<p>At the high level, I understand why the answer is (() (()) (() ()) (()<br />
() ()) (() () () ())). But I&#8217;m still missing something at the lower<br />
level that makes it harder for me to understand how some of the more<br />
advanced examples, like the adder, work. </p>
<p>Thanks,<br />
Joe<br />
=====================</p>
<p>Posted: Tue Jan 15, 2008 3:30 pm</p>
<p>(run 2 (q) g1 g2 g3) does *not* specify that at most two goal<br />
invocations should succeed, or that at most two #s goals should be<br />
tried. Rather, run 2 specifies that we want two answers (if there are<br />
two answers to be had), regardless of how many goals must be tried,<br />
must succeed, or must fail in order to get those answers. </p>
<p>Hope this helps. </p>
<p>&#8211;Will
</p>
</blockquote>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;title=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer&amp;notes=%0A%0D%0AI%27m%20working%20through%20The%20Reasoned%20Schemer%20and%20had%20some%20trouble%20seeing%20clearly%20how%20results%20are%20returned%20when%20using%20run%20n.%20After%20Googling%20a%20bit%2C%20I%20found%20a%20clear%20walkthrough%20of%20the%20logic%20on%20an%20obscure%20message%20board.%20I%27m%20reposting%20here%20to%20make%20sure%20it%20" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;title=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;t=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;title=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer&amp;annotation=%0A%0D%0AI%27m%20working%20through%20The%20Reasoned%20Schemer%20and%20had%20some%20trouble%20seeing%20clearly%20how%20results%20are%20returned%20when%20using%20run%20n.%20After%20Googling%20a%20bit%2C%20I%20found%20a%20clear%20walkthrough%20of%20the%20logic%20on%20an%20obscure%20message%20board.%20I%27m%20reposting%20here%20to%20make%20sure%20it%20" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;t=Understanding%20%3Ctt%3Erun%20n%3C%2Ftt%3E%20with%20%3Ctt%3Econde%3C%2Ftt%3E%20in%20the%20Reasoned%20Schemer" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F12%2F11%2Funderstanding-run-n-with-conde-in-the-reasoned-schemer%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/12/11/understanding-run-n-with-conde-in-the-reasoned-schemer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Binary Search Revisited</title>
		<link>http://eigenjoy.com/2011/09/09/binary-search-revisited/</link>
		<comments>http://eigenjoy.com/2011/09/09/binary-search-revisited/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 17:11:08 +0000</pubDate>
		<dc:creator>Matt Pulver</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[C++]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=506</guid>
		<description><![CDATA[As ubiquitous as it is, the standard binary search algorithm can still be further optimized by using bit operations to iterate through its sorted list, in place of arithmetic. Admittedly, this is primarily an academic discussion, since the code improvement does not decrease the logarithmic complexity of the standard algorithm.  Nevertheless, a well-developed programming [...]]]></description>
			<content:encoded><![CDATA[<p>As ubiquitous as it is, the standard binary search algorithm can still be further optimized by using bit operations to iterate through its sorted list, in place of arithmetic. Admittedly, this is primarily an academic discussion, since the code improvement does not decrease the logarithmic complexity of the standard algorithm.  Nevertheless, a well-developed programming intuition should by default implement (or at a minimum consider) a solution similar to the one presented here, prior to &#8220;the obvious&#8221; standard arithmetic solution.</p>
<p>Here&#8217;s an example. We want to find the largest index i such that haystack[i] <= needle. Needle = 15, and haystack is a sorted list of the first 8 prime numbers, indexed by binary numbers:</p>
<p><code>000</code> &nbsp; &nbsp; 2<br />
<code>001</code> &nbsp; &nbsp; 3<br />
<code>010</code> &nbsp; &nbsp; 5<br />
<code>011</code> &nbsp; &nbsp; 7<br />
<code>100</code> &nbsp; &nbsp; 11<br />
<code>101</code> &nbsp; &nbsp; 13<br />
<code>110</code> &nbsp; &nbsp; 17<br />
<code>111</code> &nbsp; &nbsp; 19</p>
<p>Note first that the haystack index requires no more than 3 bits.  Therefore we start with the highest order bit set: b=<code>100</code>. (<code>This</code> <code>font</code> denotes binary numbers here.) Is haystack[<code>100</code>] <= 15?  Yes, 11<=15. Observe this means that the first bit of the index we are looking for is <code>1</code>. We look at the next bit. Is haystack[<code>110</code>] <= 15? No, 17 <= 15 is false. This means the 2nd bit is <code>0</code>. Finally, is haystack[<code>101</code>] <= 15? Yes, 13 <= 15. Therefore the index we are looking for is <code>101</code>.</p>
<p>In essence, the main loop to find the index i is simply:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span> <span style="color: #008080;">;</span> b <span style="color: #008080;">;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack<span style="color: #008000;">&#91;</span>i<span style="color: #000040;">|</span>b<span style="color: #008000;">&#93;</span> <span style="color: #000080;">&lt;=</span> needle <span style="color: #008000;">&#41;</span> i <span style="color: #000040;">|</span><span style="color: #000080;">=</span> b<span style="color: #008080;">;</span></pre></div></div>

<p>where b is the value with 1 bit set, that began with b=<code>100</code> and i=<code>0</code>.</p>
<p><em>It is the natural structure of the bits within the index variable that automatically tracks both the upper and lower bounds of the search window for each iteration.</em> Compare this with the less efficient and more verbose arithmetic performed in the main loops of standard binary search algorithms.</p>
<p>The full C++ code for the improved binary search algorithm fbsearch() is:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #666666;">// Binary search revisited.</span>
&nbsp;
<span style="color: #666666;">// Define only one of these:</span>
<span style="color: #666666;">// #define SETBIT_FAST</span>
<span style="color: #666666;">// #define SETBIT_FASTER</span>
<span style="color: #339900;">#define SETBIT_FASTEST</span>
&nbsp;
<span style="color: #339900;">#ifdef SETBIT_FAST</span>
<span style="color: #339900;">#include &lt;math.h&gt;</span>
<span style="color: #339900;">#endif</span>
&nbsp;
<span style="color: #666666;">// Return 1 &lt;&lt; log_2(list_size-1), or 0 if list_size == 1.</span>
<span style="color: #666666;">// This sets the initial value of b in fbsearch().</span>
<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">unsigned</span> init_bit<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">unsigned</span> list_size <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
<span style="color: #339900;">#ifdef SETBIT_FAST</span>
    <span style="color: #0000ff;">return</span> list_size <span style="color: #000080;">==</span> <span style="color: #0000dd;">1</span> <span style="color: #008080;">?</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">:</span>
        <span style="color: #0000dd;">1</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #0000ff;">int</span><span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">log</span><span style="color: #008000;">&#40;</span>list_size<span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">/</span> M_LN2 <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #339900;">#endif</span>
<span style="color: #339900;">#ifdef SETBIT_FASTER</span>
    <span style="color: #0000ff;">return</span> list_size <span style="color: #000080;">==</span> <span style="color: #0000dd;">1</span> <span style="color: #008080;">?</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">:</span>
        <span style="color: #0000dd;">1</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">unsigned</span><span style="color: #008000;">&#41;</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #0000dd;">3</span> <span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> <span style="color: #0000dd;">1</span>
             <span style="color: #000040;">-</span> __builtin_clz<span style="color: #008000;">&#40;</span>list_size<span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #339900;">#endif</span>
<span style="color: #339900;">#ifdef SETBIT_FASTEST</span>
    <span style="color: #0000ff;">unsigned</span> b<span style="color: #008080;">;</span>
    __asm__ <span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;decl %%eax;&quot;</span>
              <span style="color: #FF0000;">&quot;je DONE;&quot;</span>
              <span style="color: #FF0000;">&quot;bsrl %%eax, %%ecx;&quot;</span> <span style="color: #666666;">// BSR - Bit Scan Reverse (386+)</span>
              <span style="color: #FF0000;">&quot;movl $1, %%eax;&quot;</span>
              <span style="color: #FF0000;">&quot;shll %%cl, %%eax;&quot;</span>
              <span style="color: #FF0000;">&quot;DONE:&quot;</span> <span style="color: #008080;">:</span> <span style="color: #FF0000;">&quot;=a&quot;</span> <span style="color: #008000;">&#40;</span>b<span style="color: #008000;">&#41;</span> <span style="color: #008080;">:</span> <span style="color: #FF0000;">&quot;a&quot;</span> <span style="color: #008000;">&#40;</span>list_size<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">return</span> b<span style="color: #008080;">;</span>
<span style="color: #339900;">#endif</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #666666;">// Return the greatest unsigned i where haystack[i] &lt;= needle.</span>
<span style="color: #666666;">// If i does not exist (haystack is empty, or needle &lt; haystack[0])</span>
<span style="color: #666666;">// then return unsigned(-1). T can be any type for which the binary</span>
<span style="color: #666666;">// operator &lt;= is defined.</span>
<span style="color: #0000ff;">template</span> <span style="color: #000080;">&lt;</span><span style="color: #0000ff;">typename</span> T<span style="color: #000080;">&gt;</span>
<span style="color: #0000ff;">unsigned</span> fbsearch<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">const</span> T haystack<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span>, <span style="color: #0000ff;">unsigned</span> haystack_size,
                   <span style="color: #0000ff;">const</span> T<span style="color: #000040;">&amp;</span> needle <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack_size <span style="color: #000080;">==</span> <span style="color: #0000dd;">0</span> <span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">unsigned</span><span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">unsigned</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">unsigned</span> b <span style="color: #000080;">=</span> init_bit<span style="color: #008000;">&#40;</span>haystack_size<span style="color: #008000;">&#41;</span> <span style="color: #008080;">;</span> b <span style="color: #008080;">;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">unsigned</span> j <span style="color: #000080;">=</span> i <span style="color: #000040;">|</span> b<span style="color: #008080;">;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack_size <span style="color: #000080;">&lt;=</span> j <span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">continue</span><span style="color: #008080;">;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span> <span style="color: #000080;">&lt;=</span> needle <span style="color: #008000;">&#41;</span> i <span style="color: #000080;">=</span> j<span style="color: #008080;">;</span>
        <span style="color: #0000ff;">else</span>
        <span style="color: #008000;">&#123;</span>
            <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008080;">;</span> b <span style="color: #008080;">;</span> b <span style="color: #000080;">&gt;&gt;=</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span>
                <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> haystack<span style="color: #008000;">&#91;</span>i<span style="color: #000040;">|</span>b<span style="color: #008000;">&#93;</span> <span style="color: #000080;">&lt;=</span> needle <span style="color: #008000;">&#41;</span> i <span style="color: #000040;">|</span><span style="color: #000080;">=</span> b<span style="color: #008080;">;</span>
            <span style="color: #0000ff;">break</span><span style="color: #008080;">;</span>
        <span style="color: #008000;">&#125;</span>
    <span style="color: #008000;">&#125;</span>
    <span style="color: #0000ff;">return</span> i <span style="color: #000040;">||</span> <span style="color: #000040;">*</span>haystack <span style="color: #000080;">&lt;=</span> needle <span style="color: #008080;">?</span> i <span style="color: #008080;">:</span> <span style="color: #0000ff;">unsigned</span><span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #666666;">// Example Usage</span>
<span style="color: #339900;">#include &lt;iostream&gt;</span>
<span style="color: #0000ff;">using</span> <span style="color: #0000ff;">namespace</span> std<span style="color: #008080;">;</span>
&nbsp;
<span style="color: #0000ff;">int</span> main<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">int</span> sorted_list<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000dd;">2</span>, <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">5</span>, <span style="color: #0000dd;">7</span>, <span style="color: #0000dd;">11</span>, <span style="color: #0000dd;">13</span>, <span style="color: #0000dd;">17</span>, <span style="color: #0000dd;">19</span>, <span style="color: #0000dd;">23</span> <span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">unsigned</span> list_size <span style="color: #000080;">=</span> <span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span>sorted_list<span style="color: #008000;">&#41;</span><span style="color: #000040;">/</span><span style="color: #0000dd;">sizeof</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">int</span> needle <span style="color: #000080;">=</span> <span style="color: #0000dd;">15</span><span style="color: #008080;">;</span>
    <span style="color: #0000dd;">cout</span> <span style="color: #000080;">&lt;&lt;</span> <span style="color: #FF0000;">&quot;fbsearch(sorted_list,&quot;</span><span style="color: #000080;">&lt;&lt;</span>list_size<span style="color: #000080;">&lt;&lt;</span><span style="color: #FF0000;">','</span><span style="color: #000080;">&lt;&lt;</span>needle<span style="color: #000080;">&lt;&lt;</span><span style="color: #FF0000;">&quot;) = &quot;</span>
         <span style="color: #000080;">&lt;&lt;</span> fbsearch<span style="color: #008000;">&#40;</span>sorted_list,list_size,needle<span style="color: #008000;">&#41;</span> <span style="color: #000080;">&lt;&lt;</span> endl<span style="color: #008080;">;</span>
    <span style="color: #666666;">// fbsearch(sorted_list,9,15) = 5</span>
    <span style="color: #0000ff;">return</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p>Ancillary notes:</p>
<ul>
<li>The function init_bit() sets the initial value of b to having a single bit set in the highest position for indexing the array list. The three SETBIT_FAST, SETBIT_FASTER, and SETBIT_FASTEST code blocks within it are simply 3 different ways of calculating the initial value of b, whose speed ranks are the reverse of their cross-platform compatibility. The first method should compile everywhere, and simply calculates it from a base-2 logarithm. The second method uses the faster GNU __builtin_clz method (thanks Chao Xu for the suggestion) that counts the highest order bit from the left. The third and fastest method uses the Intel assembly instruction BSR that counts it from the right.</li>
<li>The null return value of unsigned(-1) (=4294967295 for 32-bit unsigned) is the unique value that will never otherwise be returned. Therefore it can safely be checked for by the calling function without risk of misinterpretation.</li>
<li>The one <code lang="cpp">else</code> block can safely be removed without changing the overall functionality. It exists to simply save having to check that the array index is in bounds on every iteration, since it is at a point in logic where that condition is guaranteed.</li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;title=Binary%20Search%20Revisited&amp;notes=As%20ubiquitous%20as%20it%20is%2C%20the%20standard%20binary%20search%20algorithm%20can%20still%20be%20further%20optimized%20by%20using%20bit%20operations%20to%20iterate%20through%20its%20sorted%20list%2C%20in%20place%20of%20arithmetic.%20Admittedly%2C%20this%20is%20primarily%20an%20academic%20discussion%2C%20since%20the%20code%20impro" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;title=Binary%20Search%20Revisited" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Binary%20Search%20Revisited%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;t=Binary%20Search%20Revisited" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;title=Binary%20Search%20Revisited&amp;annotation=As%20ubiquitous%20as%20it%20is%2C%20the%20standard%20binary%20search%20algorithm%20can%20still%20be%20further%20optimized%20by%20using%20bit%20operations%20to%20iterate%20through%20its%20sorted%20list%2C%20in%20place%20of%20arithmetic.%20Admittedly%2C%20this%20is%20primarily%20an%20academic%20discussion%2C%20since%20the%20code%20impro" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;t=Binary%20Search%20Revisited" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F09%2F09%2Fbinary-search-revisited%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/09/09/binary-search-revisited/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>cascading-simhash a library to cluster by minhashes in Hadoop</title>
		<link>http://eigenjoy.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/</link>
		<comments>http://eigenjoy.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/#comments</comments>
		<pubDate>Mon, 09 May 2011 22:21:11 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[big-data]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=455</guid>
		<description><![CDATA[simhashing
Say you have a large corpus of web documents and you want to group them together by some notion of &#8220;similarity&#8221;. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.  
In this scenario, it&#8217;s impractical to do a pairwise comparison of all documents. Fortunately, [...]]]></description>
			<content:encoded><![CDATA[<h2>simhashing</h2>
<p>Say you have a large corpus of web documents and you want to group them together by some notion of &#8220;similarity&#8221;. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.  </p>
<p>In this scenario, it&#8217;s impractical to do a pairwise comparison of all documents. Fortunately, we can use <em>simhashing</em>. </p>
<p>Broadly speaking, simhashing is a algorithm that calculates a &#8220;cluster id&#8221; (the minimum hash, or <em>minhash</em>) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce. </p>
<p>Ryan Moulton has written a <a href="http://knol.google.com/k/simple-simhashing">wonderful article on Simhashing</a>. I&#8217;m not going to repeat his content here, so if you&#8217;re unfamiliar with simhashing I encourage you to go and read his article first.</p>
<p>In his article, Ryan sketches the proof that the probability that any two sets (in this case, documents) share the same minhash is equal to their <a href="http://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity coefficient</a>. This is a really neat result because we are able to get the Jaccard index without having to actually compare the intersection of the two sets directly. </p>
<h2>cascading-simhash</h2>
<p>I&#8217;ve created a <a href="https://github.com/jashmenn/cascading-simhash">library for calculating simhashes in Hadoop</a>. It&#8217;s written in Clojure and Java and uses Casacalog and Cascading. </p>
<p>To use it, you 1) input tuples consisting of a <tt>(document_id, body)</tt> and 2) define how to tokenize your <tt>body</tt>. The job emits tuples of the form <tt>(minhash, document_id, body)</tt>. You can then use <tt>minhash</tt> as the key for your next phase. (All records that share a <tt>minhash</tt> are potential duplicates.) </p>
<p>The library can be called from either Clojure or Java. Additionally, <tt>Simhash</tt> returns a <tt><a href="http://www.cascading.org/javadoc/cascading/flow/Flow.html">Flow</a></tt> so you can use it in your <tt>Cascade</tt> if you want to make it part of a bigger pipeline.</p>
<h2>A Java example</h2>
<p>Here&#8217;s a quick example on how to use the library from Java:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * Simple Simhash - an example of how to use Simhash
 *
 * To run this example:
 *   lein uberjar
 *   lein classpath &gt; classpath
 *   java -cp `cat classpath`:build/cascading-simhash-1.0.0-SNAPSHOT-standalone.jar simhash.examples.SimpleSimhash &quot;test-resources/test-documents.txt&quot;
 **/</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> SimpleSimhash <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">final</span> Logger LOG <span style="color: #339933;">=</span> Logger.<span style="color: #006633;">getLogger</span><span style="color: #009900;">&#40;</span> SimpleSimhash.<span style="color: #000000; font-weight: bold;">class</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #008000; font-style: italic; font-weight: bold;">/**
   * Create a tokenizer that is a subclass of clojure.lang.AFn and
   * implements invoke(Object body)
   **/</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">class</span> Tokenizer <span style="color: #000000; font-weight: bold;">extends</span> AFn <span style="color: #009900;">&#123;</span>
&nbsp;
    <span style="color: #008000; font-style: italic; font-weight: bold;">/**
     * Your tokenization logic goes here
     *
     * @param String body
     * @return something seq-able
     */</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">Object</span> invoke<span style="color: #009900;">&#40;</span><span style="color: #003399;">Object</span> body<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #003399;">String</span> b <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#41;</span>body<span style="color: #339933;">;</span>
      <span style="color: #000000; font-weight: bold;">return</span> b.<span style="color: #006633;">split</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    Tap inputTap <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span> <span style="color: #000000; font-weight: bold;">new</span> TextDelimited<span style="color: #009900;">&#40;</span> 
                                <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;docid&quot;</span>, <span style="color: #0000ff;">&quot;body&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span> <span style="color: #009900;">&#41;</span>,
                            args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap outputTap <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> StdoutTap<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create the flow</span>
    Flow simhashFlow <span style="color: #339933;">=</span> Simhash.<span style="color: #006633;">simhash</span><span style="color: #009900;">&#40;</span>inputTap, outputTap, 
                                       <span style="color: #cc66cc;">2</span>, <span style="color: #666666; font-style: italic;">// combine n-th lowest minhashes (e.g. 2) </span>
                                       SimpleSimhash.<span style="color: #006633;">Tokenizer</span>.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    simhashFlow.<span style="color: #006633;">complete</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// or add to your Cascade, etc</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Notice a few things here:</p>
<ul>
<li>We&#8217;re inputting a tap of two fields: <tt>(docid, body)</tt></li>
<li>The <tt>2</tt> parameter is the number of minhashes to combine. In this case, we will combine the 2 lowest hashes to create one minhash. This parameter controls the overlap required for a match. In this case, the two sets much share the same 2 minhashes in order to match.</li>
<li>The Tokenizer is a subclass of <tt>clojure.lang.AFn</tt>. Override the <tt>invoke(Object)</tt> method and you will be passed the body the current record. In this case, we&#8217;re tokenizing by doing a simple <tt>String</tt> split.</li>
</ul>
<p>If you&#8217;ve <a href="https://github.com/jashmenn/cascading-simhash">checked out the source</a> you can run it like this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">lein uberjar
lein classpath <span style="color: #000000; font-weight: bold;">&gt;</span> classpath
java <span style="color: #660033;">-cp</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">cat</span> classpath<span style="color: #000000; font-weight: bold;">`</span>:build<span style="color: #000000; font-weight: bold;">/</span>cascading-simhash-1.0.0-SNAPSHOT-standalone.jar simhash.examples.SimpleSimhash <span style="color: #ff0000;">&quot;test-resources/test-documents.txt&quot;</span></pre></div></div>

<p>Given:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;"># test-resources/test-documents.txt</span>
<span style="color: #666666; font-style: italic;"># docid \t body</span>
DocA	my dog has fleas
DocB	my dog has fleas
DocC	my dog has hair
DocD	see spot run
DocE	We hold these truths</pre></div></div>

<p>We get:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">RESULTS
<span style="color: #660033;">-----------------------</span>
23fd68296bc65391799c8c441faf4403c729256f	DocE	We hold these truths
402183e1cbc52e7c87eb230c281f35e4b27c2a39	DocD	see spot run
49c31c1459a7603bd5680d11285a5716c4ba3903	DocA	my dog has fleas
49c31c1459a7603bd5680d11285a5716c4ba3903	DocB	my dog has fleas
58e5a2035461323a37102e22273c9b25cbb9df61	DocC	my dog has hair
<span style="color: #660033;">-----------------------</span></pre></div></div>

<h2>A Clojure example</h2>
<p>Similarly, here&#8217;s how to run the library from Clojure. This time we use bi-grams as the tokens.</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #66cc66;">&#40;</span>ns simhash<span style="color: #66cc66;">.</span>examples<span style="color: #66cc66;">.</span>bigrams
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">use</span> 
   <span style="color: #66cc66;">&#91;</span>simhash core util<span style="color: #66cc66;">&#93;</span>
   <span style="color: #66cc66;">&#91;</span>cascalog api testing<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">require</span> 
   <span style="color: #66cc66;">&#91;</span>simhash <span style="color: #66cc66;">&#91;</span>taps <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> t<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#91;</span>ops <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> ops<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span>
   <span style="color: #66cc66;">&#91;</span>clojure<span style="color: #66cc66;">.</span>contrib<span style="color: #66cc66;">.</span>str-utils <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> stu<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">gen-class</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn my-source <span style="color: #66cc66;">&#91;</span>path<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&lt;</span>- <span style="color: #66cc66;">&#91;</span>?docid ?body<span style="color: #66cc66;">&#93;</span>
      <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span>hfs-textline path<span style="color: #66cc66;">&#41;</span> ?line<span style="color: #66cc66;">&#41;</span>
      <span style="color: #66cc66;">&#40;</span>ops/re-split-op <span style="color: #66cc66;">&#91;</span>#<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span> <span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#93;</span> ?line <span style="color: #66cc66;">:&gt;</span> ?docid ?body<span style="color: #66cc66;">&#41;</span>
      <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">distinct</span> false<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn tokenize 
  <span style="color: #ff0000;">&quot;tokenize into bi-grams (sliding window)&quot;</span>
  <span style="color: #66cc66;">&#91;</span>body<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>map
   <span style="color: #66cc66;">&#40;</span>fn <span style="color: #66cc66;">&#91;</span>tokens<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#40;</span>stu/str-join <span style="color: #ff0000;">&quot; &quot;</span> tokens<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
   <span style="color: #66cc66;">&#40;</span>partition <span style="color: #cc66cc;">2</span> <span style="color: #cc66cc;">1</span> <span style="color: #66cc66;">&#40;</span>stu/re-split #<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\s</span>+&quot;</span> body<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn -main <span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&amp;</span> args<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>?- <span style="color: #66cc66;">&#40;</span>stdout<span style="color: #66cc66;">&#41;</span> 
      <span style="color: #66cc66;">&#40;</span>simhash-q <span style="color: #66cc66;">&#40;</span>my-source <span style="color: #66cc66;">&#40;</span>first args<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
                 <span style="color: #cc66cc;">2</span> <span style="color: #808080; font-style: italic;">;; number of minhashes</span>
                 tokenize<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>A few things to point out about the Clojure example:</p>
<ul>
<li><tt>simhash-q</tt> is just a Cascalog query. Unlike the Java example (which required a <tt>Tap</tt> as the input) <tt>simhash-q</tt> can accept any other Cascalog query as the input.</li>
<li>You must use <tt>gen-class</tt> on the namespace that holds your <tt>tokenize</tt> function. This is because Cascading will serialize your <tt>Flow</tt> and it has a hard time with functions generated at run-time. Generally speaking, if the <tt>tokenize</tt> function isn&#8217;t aot compiled into a class you&#8217;re going to run into problems.</li>
</ul>
<p>The project also includes a tokenizer for extracting text from HTML documents. For examples see <tt><a href="https://github.com/jashmenn/cascading-simhash/blob/master/src/clj/simhash/tokenizers/html_text.clj">tokenizers.html_text.clj</a></tt> for an example on how to write a tokenizer in Clojure. See <tt><a href="https://github.com/jashmenn/cascading-simhash/blob/master/src/java/simhash/examples/HtmlSimhash.java">HtmlSimhash.java</a></tt> for a Java example on how to use it.</p>
<h2>Summary</h2>
<p>Simhashing in MapReduce is a quick way to find clusters in a huge amount of data. By using Cascading and Cascalog we&#8217;re able to work with MapReduce jobs at the level of functions rather than individual map-reduce phases.</p>
<p>Have any data you need clustered? Try <tt><a href="https://github.com/jashmenn/cascading-simhash">cascading-simhash</a></tt> and let me know how it goes!</p>
<p>Learn more about big data <a href="http://twitter.com/xcombinator">by following me on twitter</a>.</p>
<p>You can get the jars via clojars:<br />
leiningen:</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;">  <span style="color: #66cc66;">&#91;</span>cascading-simhash <span style="color: #ff0000;">&quot;1.0.0-SNAPSHOT&quot;</span><span style="color: #66cc66;">&#93;</span></pre></div></div>

<p>maven</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;dependency<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;groupId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>cascading-simhash<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/groupId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;artifactId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>cascading-simhash<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/artifactId<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
  <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;version<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1.0.0-SNAPSHOT<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/version<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/dependency<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p><a href="https://github.com/jashmenn/cascading-simhash">View the source on github</a>.</p>
<p>References</p>
<ol>
<li><a href="http://knol.google.com/k/simple-simhashing">http://knol.google.com/k/simple-simhashing</a></li>
<li> <a href="http://en.wikipedia.org/wiki/Jaccard_index">http://en.wikipedia.org/wiki/Jaccard_index</a>
</li>
<li> <a href="http://en.wikipedia.org/wiki/MinHash">http://en.wikipedia.org/wiki/MinHash</a>
</li>
</ol>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;title=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop&amp;notes=simhashing%0D%0A%0D%0ASay%20you%20have%20a%20large%20corpus%20of%20web%20documents%20and%20you%20want%20to%20group%20them%20together%20by%20some%20notion%20of%20%22similarity%22.%20For%20instance%2C%20we%20may%20want%20to%20detect%20plagiarism%20or%20find%20content%20that%20appears%20on%20multiple%20pages%20of%20a%20site.%20%20%0D%0A%0D%0AIn%20this%20scena" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;title=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;t=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;title=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop&amp;annotation=simhashing%0D%0A%0D%0ASay%20you%20have%20a%20large%20corpus%20of%20web%20documents%20and%20you%20want%20to%20group%20them%20together%20by%20some%20notion%20of%20%22similarity%22.%20For%20instance%2C%20we%20may%20want%20to%20detect%20plagiarism%20or%20find%20content%20that%20appears%20on%20multiple%20pages%20of%20a%20site.%20%20%0D%0A%0D%0AIn%20this%20scena" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;t=cascading-simhash%20a%20library%20to%20cluster%20by%20minhashes%20in%20Hadoop" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F09%2Fcascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why is XOR the default way to combine hashes</title>
		<link>http://eigenjoy.com/2011/05/04/why-is-xor-the-default-way-to-combine-hashes/</link>
		<comments>http://eigenjoy.com/2011/05/04/why-is-xor-the-default-way-to-combine-hashes/#comments</comments>
		<pubDate>Wed, 04 May 2011 21:24:08 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=438</guid>
		<description><![CDATA[I&#8217;ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton&#8217;s excellent article on Simhashing. Ryan&#8217;s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton&#8217;s <a href="http://knol.google.com/k/simple-simhashing">excellent article on Simhashing</a>. Ryan&#8217;s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he <tt>XOR</tt>s them. This got me wondering:</p>
<p>Why is <tt>XOR</tt> the default way to combine hashes? </p>
<p>I posted this question on Stackoverflow and got <a href="http://stackoverflow.com/questions/5889238/why-is-xor-the-default-way-to-combine-hashes">this excellent and concise answer</a> from Greg Hewgill:</p>
<blockquote><p>
Assuming random (1-bit) inputs, the AND function output probability distribution is 25% 0 and 75% 1. Conversely, OR is 75% 0 and 25% 1.</p>
<p>The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.
</p>
</blockquote>
<p>The distributions become clear when we chart the output of each of the operations:</p>
<p><a href="http://www.xcombinator.com/wp-content/uploads/2011/05/prob.jpg"><img src="http://www.xcombinator.com/wp-content/uploads/2011/05/prob.jpg" alt="and-or-xor" title="and-or-xor" width="347" height="528" class="size-full wp-image-441" /></a></p>
<p>If you are combining two random bits with AND you have a 75% chance of <tt>1</tt>. Similarly, if you combine two random bits with OR you have a 75% chance of <tt>0</tt>.</p>
<p>My friend <a href="http://www.zinkov.com/">Rob Zinkov</a> [1] explains it by framing it in terms of entropy:</p>
<blockquote><p>
[When combining hashes] essentially you want an operation that maximizes entropy since lower entropy implies the hash is storing less data than it appears.</p>
<p>With AND and OR lots of the variation in bits is being removed, so the hash you end up with tend to be biased to towards 1s (as in the OR case) or 0s (as in the AND case)</p>
<p>That bias means each bit is storing slightly less information (since that bias is present).
</p>
</blockquote>
<p>Summary: If you need to combine two hashes, XOR them together and you&#8217;ll have a better chance at maintaining the entropy of the original hashes.</p>
<p><em><br />
[1] Rob runs the <a href="http://www.meetup.com/LA-Machine-Learning/">L.A. Machine Learning</a> meetup group. If you&#8217;re in L.A. <tt>AND</tt> interested in machine learning, stop by sometime.<br />
</em></p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;title=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes&amp;notes=I%27ve%20been%20reading%20up%20on%20Simhashing%20as%20a%20way%20to%20find%20duplicate%20content%20in%20web%20data.%20I%20came%20across%20Ryan%20Moulton%27s%20excellent%20article%20on%20Simhashing.%20Ryan%27s%20preferred%20method%20of%20Simhashing%20is%20to%20take%20the%20n-minimum%20hash%20values%20from%20the%20set.%20In%20his%20pseudocod" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;title=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;t=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;title=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes&amp;annotation=I%27ve%20been%20reading%20up%20on%20Simhashing%20as%20a%20way%20to%20find%20duplicate%20content%20in%20web%20data.%20I%20came%20across%20Ryan%20Moulton%27s%20excellent%20article%20on%20Simhashing.%20Ryan%27s%20preferred%20method%20of%20Simhashing%20is%20to%20take%20the%20n-minimum%20hash%20values%20from%20the%20set.%20In%20his%20pseudocod" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;t=Why%20is%20XOR%20the%20default%20way%20to%20combine%20hashes" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F05%2F04%2Fwhy-is-xor-the-default-way-to-combine-hashes%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/05/04/why-is-xor-the-default-way-to-combine-hashes/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Clojure&#8217;s keyword can fill up your PermGen space</title>
		<link>http://eigenjoy.com/2011/03/02/clojures-keyword-can-fill-up-your-permgen-space/</link>
		<comments>http://eigenjoy.com/2011/03/02/clojures-keyword-can-fill-up-your-permgen-space/#comments</comments>
		<pubDate>Wed, 02 Mar 2011 15:49:08 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[clojure]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=413</guid>
		<description><![CDATA[We&#8217;ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space.
If you&#8217;re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information about [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space.</p>
<p>If you&#8217;re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information about Classes and interned Strings. </p>
<p>When you intern a String the JVM stores a single copy of that String in the PermGen space.  This can save RAM because only one copy of the String will exist in the system. It can also speed up <tt>==</tt> comparisons for two interned Strings because you only have to compare the reference not the characters.</p>
<p>The problem is, the PermGen space is typically very small (64m is a common default). So if you have many classes or a lot of interned Strings, you can easily blow out the PermGen space.</p>
<p>This problem was showing up in our crawler and we traced it to how we were parsing <tt>robots.txt</tt>.</p>
<blockquote><p>
<tt>robots.txt</tt> is a convention website owners can use that will instruct<br />
crawlers how to act while they are on their site. All polite crawlers<br />
use them. For example: </p>
<pre>
Disallow: /no-crawl/
Allow: /
Sitemap: http://www.foo.com/sitemap.xml
</pre>
</blockquote>
<p>In our crawler, we&#8217;ve written a custom <tt>robots.txt</tt> parsing library: <a href="https://github.com/retiman/clj-robots">clj-robots (github)</a>.</p>
<p>In <tt>clj-robots</tt>, there was one section of the code where we were taking the left hand side of the <tt>robots.txt</tt> and converting it to a keyword. This made for cleaner code than comparing Strings. Since there are only a fixed number of <tt>robots.txt</tt> directives, this should be safe, right?</p>
<p>It turns out it isn&#8217;t safe. First, you don&#8217;t know what people are actually going to put in their <tt>robots.txt</tt>. Second, what we forgot was that many sites don&#8217;t have a <tt>robots.txt</tt>, but they don&#8217;t return an empty 404, they often return their custom 404 HTML page. What happened was we were parsing an HTML page as a <tt>robots.txt</tt> and then interning everything that looked like a <tt>robots.txt</tt> directive. </p>
<p>The result was a spectacular &#8220;java.lang.OutOfMemoryError: PermGen space&#8221; after just a few minutes. The general principle here is that you should never allow user-generated input become an interned String.</p>
<p>Lessons learned:</p>
<ul>
<li>PermGen stores Classes and interned Strings</li>
<li>Clojure&#8217;s <tt>keyword</tt> interns a String </li>
<li>Don&#8217;t call <tt>keyword</tt> on user-generated input</li>
<li>A profiling tool (e.g. <a href="http://www.ej-technologies.com/products/jprofiler/overview.html">JProfiler</a>) can be your best friend in these situations</li>
</ul>
<p>References:</p>
<ul>
<li><a href="http://java.sun.com/docs/hotspot/gc1.4.2/faq.html">Java GC FAQ</a></li>
<li><a href="https://github.com/richhickey/clojure/blob/master/src/jvm/clojure/lang/Keyword.java">Keyword.java from Clojure Core</li>
<li><a href="http://mindprod.com/jgloss/interned.html">interned Strings : Java Glossary</a></li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;title=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space&amp;notes=We%27ve%20been%20working%20on%20a%20custom%20web-crawler%20for%20a%20few%20months%20now.%20Recently%20we%20were%20having%20a%20problem%20where%20after%20a%20few%20minutes%20the%20JVM%20would%20run%20out%20of%20PermGen%20space.%0D%0A%0D%0AIf%20you%27re%20not%20familiar%20with%20PermGen%20space%2C%20it%20is%20a%20portion%20of%20memory%20reserved%20for%20" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;title=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;t=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;title=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space&amp;annotation=We%27ve%20been%20working%20on%20a%20custom%20web-crawler%20for%20a%20few%20months%20now.%20Recently%20we%20were%20having%20a%20problem%20where%20after%20a%20few%20minutes%20the%20JVM%20would%20run%20out%20of%20PermGen%20space.%0D%0A%0D%0AIf%20you%27re%20not%20familiar%20with%20PermGen%20space%2C%20it%20is%20a%20portion%20of%20memory%20reserved%20for%20" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;t=Clojure%27s%20keyword%20can%20fill%20up%20your%20PermGen%20space" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F03%2F02%2Fclojures-keyword-can-fill-up-your-permgen-space%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/03/02/clojures-keyword-can-fill-up-your-permgen-space/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>World&#8217;s Fastest Binary Search?</title>
		<link>http://eigenjoy.com/2011/01/21/worlds-fastest-binary-search/</link>
		<comments>http://eigenjoy.com/2011/01/21/worlds-fastest-binary-search/#comments</comments>
		<pubDate>Fri, 21 Jan 2011 12:17:48 +0000</pubDate>
		<dc:creator>Matt Pulver</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=379</guid>
		<description><![CDATA[(Edited 9 Sept 2011: A few minor improvements have been made; see the updated post instead: Binary Search Revisited.)
Every CS student knows how to write a binary search algorithm. The question is, are we really making full use of the fact that we are using a binary computer to do a binary search? While it [...]]]></description>
			<content:encoded><![CDATA[<p>(Edited 9 Sept 2011: A few minor improvements have been made; see the updated post instead: <a href="http://www.xcombinator.com/2011/09/09/binary-search-revisited/">Binary Search Revisited</a>.)</p>
<p>Every CS student knows how to write a <a href="http://en.wikipedia.org/wiki/Binary_search_algorithm">binary search algorithm</a>. The question is, are we really making full use of the fact that we are using a binary computer to do a binary search? While it is true that computers are good at doing arithmetic, they are even better at doing pure bit logic. Often times our grammar school arithmetic way of thinking can get in the way of solving a problem that has a more elegant solution in terms of lower-level bit logic. Binary search in an ordered list is one such problem.</p>
<p>Your homework problem: re-write the typical textbook binary search in an ordered list algorithm, making it <em>faster</em>, without doing any arithmetic within the main loop of the search algorithm (underlying pointer arithmetic by direct access of array elements is ok).<br />
<span id="more-379"></span></p>
<p>Solution in C++:<br />
<font face="monospace"><br />
<font color="#8080ff">// Fastest binary search</font><br />
<font color="#ff40ff">#include </font><font color="#ff6060">&lt;iostream&gt;</font><br />
<font color="#ff40ff">#include </font><font color="#ff6060">&lt;math.h&gt;</font></p>
<p><font color="#8080ff">// Given a value x, return the index of the largest</font><br />
<font color="#8080ff">// element in a sorted list less than or equal to x.</font><br />
<font color="#8080ff">// Return -1 if x is less than all elements, or list</font><br />
<font color="#8080ff">// is empty. T can be any type for which &lt;= is defined.</font><br />
<font color="#00ff00">template</font>&nbsp;&lt;<font color="#00ff00">typename</font>&nbsp;T&gt;<br />
<font color="#00ff00">int</font>&nbsp;fbsearch( <font color="#00ff00">const</font>&nbsp;T *sorted_list, <font color="#00ff00">size_t</font>&nbsp;list_size, T x )<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">if</font>( list_size &lt;= <font color="#ff6060">1</font>&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">return</font>&nbsp;list_size &amp;&amp; *sorted_list &lt;= x ? <font color="#ff6060">0</font>&nbsp;: -<font color="#ff6060">1</font>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">unsigned</font>&nbsp;i = <font color="#ff6060">0</font>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">unsigned</font>&nbsp;b = <font color="#ff6060">1</font>&nbsp;&lt;&lt; <font color="#00ff00">int</font>( log(list_size-<font color="#ff6060">1</font>) / <font color="#ff6060">M_LN2</font>&nbsp;);<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">for</font>( ; b ; b &gt;&gt;= <font color="#ff6060">1</font>&nbsp;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">unsigned</font>&nbsp;j = i | b;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">if</font>( j &lt; list_size &amp;&amp; sorted_list[j] &lt;= x ) i = j;<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">return</font>&nbsp;i || *sorted_list &lt;= x ? i : -<font color="#ff6060">1</font>;<br />
}</p>
<p><font color="#ffff00">using</font>&nbsp;<font color="#00ff00">namespace</font>&nbsp;std;</p>
<p><font color="#00ff00">int</font>&nbsp;main()<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">const</font>&nbsp;<font color="#00ff00">int</font>&nbsp;sorted_list[] =<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{ <font color="#ff6060">2</font>, <font color="#ff6060">3</font>, <font color="#ff6060">5</font>, <font color="#ff6060">7</font>, <font color="#ff6060">11</font>, <font color="#ff6060">13</font>, <font color="#ff6060">17</font>, <font color="#ff6060">19</font>, <font color="#ff6060">23</font>, <font color="#ff6060">29</font>&nbsp;};<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">const</font>&nbsp;<font color="#00ff00">size_t</font>&nbsp;list_size = <font color="#ffff00">sizeof</font>(sorted_list)/<font color="#ffff00">sizeof</font>(<font color="#00ff00">int</font>);<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#00ff00">int</font>&nbsp;i = <font color="#ff6060">7</font>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;cout &lt;&lt; <font color="#ff6060">&quot;fbsearch(sorted_list,&quot;</font>&lt;&lt;list_size&lt;&lt;<font color="#ff6060">&#8216;,&#8217;</font>&lt;&lt;i&lt;&lt;<font color="#ff6060">&quot;) = &quot;</font><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;&lt; fbsearch(sorted_list,list_size,i) &lt;&lt; endl;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<font color="#ffff00">return</font>&nbsp;<font color="#ff6060">0</font>;<br />
}<br />
</font></p>
<p>Technical Notes:</p>
<ul>
<li>Feel free to add a statement within the loop that returns the index if found. Note that for large n, you will spend more time on average checking for equality than the time you will save by returning early.</li>
<li>Instead of dividing by ln(2), feel free to save the constant 1/ln(2) and multiply by it instead.</li>
</ul>
<p>So what&#8217;s going on here? Take a list of the first 10 prime numbers, for example, and enumerate them using base-2:</p>
<pre>
0000   2
0001   3
0010   5
0011   7
0100  11
0101  13
0110  17
0111  19
1000  23
1001  29
</pre>
<p>If we are given one of the items on the list, say 13, then we are looking for its 4-bit index. Each bit is a yes/no question. The first bit asks, &#8220;Are you at index 1000 or higher?&#8221; If no, then the second bit asks, &#8220;Are you at index 0100 or higher?&#8221; If yes, then the third bit asks, &#8220;Are you at 0110 or higher?&#8221; If no then the last bit asks, &#8220;Are you at 0101 or higher?&#8221; By now we have asked all 4 questions, and thus have all 4 bits of the index.</p>
<p>Moral of the story: Sometimes it&#8217;s better to think like a silicon being than a carbon one.</p>
<p>p.s. Happy birthday, Nate!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;title=World%27s%20Fastest%20Binary%20Search%3F&amp;notes=%28Edited%209%20Sept%202011%3A%20A%20few%20minor%20improvements%20have%20been%20made%3B%20see%20the%20updated%20post%20instead%3A%20Binary%20Search%20Revisited.%29%0D%0A%0D%0AEvery%20CS%20student%20knows%20how%20to%20write%20a%20binary%20search%20algorithm.%20The%20question%20is%2C%20are%20we%20really%20making%20full%20use%20of%20the%20fact%20that%20we" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;title=World%27s%20Fastest%20Binary%20Search%3F" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=World%27s%20Fastest%20Binary%20Search%3F%20-%20http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;t=World%27s%20Fastest%20Binary%20Search%3F" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;title=World%27s%20Fastest%20Binary%20Search%3F&amp;annotation=%28Edited%209%20Sept%202011%3A%20A%20few%20minor%20improvements%20have%20been%20made%3B%20see%20the%20updated%20post%20instead%3A%20Binary%20Search%20Revisited.%29%0D%0A%0D%0AEvery%20CS%20student%20knows%20how%20to%20write%20a%20binary%20search%20algorithm.%20The%20question%20is%2C%20are%20we%20really%20making%20full%20use%20of%20the%20fact%20that%20we" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;t=World%27s%20Fastest%20Binary%20Search%3F" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2011%2F01%2F21%2Fworlds-fastest-binary-search%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2011/01/21/worlds-fastest-binary-search/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Extract Text from a HTML Document in Clojure</title>
		<link>http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/</link>
		<comments>http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/#comments</comments>
		<pubDate>Wed, 01 Dec 2010 21:03:32 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[clojure]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=361</guid>
		<description><![CDATA[There are many Java HTML parsers and it can be tricky to figure out which one to use. If you need to quickly extract just the text of a document I&#8217;d recommend using the Jericho HTML Parser.
Here&#8217;s a quick example on how to use it:

;; lein dependency: [net.htmlparser.jericho/jericho-html &#34;3.1&#34;]
&#40;ns foo.preprocess
  &#40;:import 
   [...]]]></description>
			<content:encoded><![CDATA[<p>There are many <a href="http://java-source.net/open-source/html-parsers">Java HTML parsers</a> and it can be tricky to figure out which one to use. If you need to quickly extract just the text of a document I&#8217;d recommend using the <a href="http://jericho.htmlparser.net/docs/index.html">Jericho HTML Parser</a>.</p>
<p>Here&#8217;s a quick example on how to use it:</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">;; lein dependency: [net.htmlparser.jericho/jericho-html &quot;3.1&quot;]</span>
<span style="color: #66cc66;">&#40;</span>ns foo<span style="color: #66cc66;">.</span>preprocess
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">import</span> 
   <span style="color: #66cc66;">&#91;</span>java<span style="color: #66cc66;">.</span>io File BufferedInputStream FileInputStream<span style="color: #66cc66;">&#93;</span>
   <span style="color: #66cc66;">&#91;</span>net<span style="color: #66cc66;">.</span>htmlparser<span style="color: #66cc66;">.</span>jericho Source TextExtractor<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn extract-text 
  <span style="color: #ff0000;">&quot;given File returns a String of the extracted text&quot;</span>
  <span style="color: #66cc66;">&#91;</span>f<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">let</span> <span style="color: #66cc66;">&#91;</span>source <span style="color: #66cc66;">&#40;</span>Source<span style="color: #66cc66;">.</span> <span style="color: #66cc66;">&#40;</span>BufferedInputStream<span style="color: #66cc66;">.</span> <span style="color: #66cc66;">&#40;</span>FileInputStream<span style="color: #66cc66;">.</span> f<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">.</span>toString <span style="color: #66cc66;">&#40;</span>TextExtractor<span style="color: #66cc66;">.</span> source<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>def filename <span style="color: #ff0000;">&quot;data/some-index.html&quot;</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#40;</span>extract-text <span style="color: #66cc66;">&#40;</span>java<span style="color: #66cc66;">.</span>io<span style="color: #66cc66;">.</span>File<span style="color: #66cc66;">.</span> filename<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>TextExtractor has sensible defaults and ignores the css and javascript by default. See the <a href="http://jericho.htmlparser.net/docs/javadoc/index.html">TextExtractor</a> class for more details.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;title=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure&amp;notes=There%20are%20many%20Java%20HTML%20parsers%20and%20it%20can%20be%20tricky%20to%20figure%20out%20which%20one%20to%20use.%20If%20you%20need%20to%20quickly%20extract%20just%20the%20text%20of%20a%20document%20I%27d%20recommend%20using%20the%20Jericho%20HTML%20Parser.%0D%0A%0D%0AHere%27s%20a%20quick%20example%20on%20how%20to%20use%20it%3A%0D%0A%0D%0A%0D%0A%3B%3B%20lein%20dep" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;title=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;t=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;title=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure&amp;annotation=There%20are%20many%20Java%20HTML%20parsers%20and%20it%20can%20be%20tricky%20to%20figure%20out%20which%20one%20to%20use.%20If%20you%20need%20to%20quickly%20extract%20just%20the%20text%20of%20a%20document%20I%27d%20recommend%20using%20the%20Jericho%20HTML%20Parser.%0D%0A%0D%0AHere%27s%20a%20quick%20example%20on%20how%20to%20use%20it%3A%0D%0A%0D%0A%0D%0A%3B%3B%20lein%20dep" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;t=Extract%20Text%20from%20a%20HTML%20Document%20in%20Clojure" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F12%2F01%2Fextract-text-from-a-html-document-in-clojure%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Boyer-Moore string search algorithm in Ruby</title>
		<link>http://eigenjoy.com/2010/10/27/boyer-moore-string-search-algorithm-in-ruby/</link>
		<comments>http://eigenjoy.com/2010/10/27/boyer-moore-string-search-algorithm-in-ruby/#comments</comments>
		<pubDate>Wed, 27 Oct 2010 14:09:46 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=318</guid>
		<description><![CDATA[Just a quick post. I&#8217;ve converted the C code from the wikipedia entry (this version) on the Boyer-Moore string search algorithm to Ruby. I&#8217;ve extended it to support searches on token arrays and regular expressions.
You can find the code on github.
Usage:

    BoyerMoore.search&#40;haystack, needle&#41;   # returns index of needle or nil

Examples:
Basic [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick post. I&#8217;ve converted the C code from the <a href="http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm">wikipedia entry</a> <small>(<a href="http://en.wikipedia.org/w/index.php?title=Boyer%E2%80%93Moore_string_search_algorithm&#038;diff=391986850&#038;oldid=391398281">this version</a>)</small> on the Boyer-Moore string search algorithm to Ruby. I&#8217;ve extended it to support searches on token arrays and regular expressions.</p>
<p>You can find the <a href="http://github.com/jashmenn/boyermoore">code on github</a>.</p>
<p>Usage:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span>haystack, needle<span style="color:#006600; font-weight:bold;">&#41;</span>   <span style="color:#008000; font-style:italic;"># returns index of needle or nil</span></pre></div></div>

<p>Examples:</p>
<p>Basic search in string:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;ANPANMAN&quot;</span>, <span style="color:#996600;">&quot;ANP&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>   <span style="color:#008000; font-style:italic;"># =&gt; 0</span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;ANPANMAN&quot;</span>, <span style="color:#996600;">&quot;ANPXX&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; nil </span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;foobar&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>     <span style="color:#008000; font-style:italic;"># =&gt; 3</span></pre></div></div>

<p>You can also search an array of tokens:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;&lt;b&gt;&quot;</span>, <span style="color:#996600;">&quot;hi&quot;</span>, <span style="color:#996600;">&quot;&lt;/b&gt;&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;hi&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>         <span style="color:#008000; font-style:italic;"># =&gt; 1 </span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;bam&quot;</span>, <span style="color:#996600;">&quot;foo&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;foo&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#008000; font-style:italic;"># =&gt; 1 </span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;bam&quot;</span>, <span style="color:#996600;">&quot;bar&quot;</span>, <span style="color:#996600;">&quot;baz&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;foo&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>        <span style="color:#008000; font-style:italic;"># =&gt; nil</span></pre></div></div>

<p>A token can be a regular expression:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;Sing&quot;</span>, <span style="color:#996600;">&quot;99&quot;</span>, <span style="color:#996600;">&quot;Luftballon&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">/</span>\d<span style="color:#006600; font-weight:bold;">+/</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span> == <span style="color:#006666;">1</span>
    BoyerMoore.<span style="color:#9900CC;">search</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;Nate Murray&quot;</span>, <span style="color:#996600;">&quot;5 Pine Street&quot;</span>, <span style="color:#996600;">&quot;Los Angeles&quot;</span>, <span style="color:#996600;">&quot;CA&quot;</span>, <span style="color:#996600;">&quot;90210&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">/</span>^\w<span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006666;">2</span><span style="color:#006600; font-weight:bold;">&#125;</span>$<span style="color:#006600; font-weight:bold;">/</span>, <span style="color:#006600; font-weight:bold;">/</span>^\d<span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006666;">5</span><span style="color:#006600; font-weight:bold;">&#125;</span>$<span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span> == <span style="color:#006666;">3</span></pre></div></div>

<p>Notes:</p>
<p>The regular-expression token matching is a bit of a hack and will be fairly slow because every hash miss is compared against every regular expression key. You probably shouldn&#8217;t use the regular expression token search for anything more than a toy.</p>
<p>Download the <a href="http://github.com/jashmenn/boyermoore">Boyer-Moore string search algorithm in Ruby</a>.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;title=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby&amp;notes=Just%20a%20quick%20post.%20I%27ve%20converted%20the%20C%20code%20from%20the%20wikipedia%20entry%20%28this%20version%29%20on%20the%20Boyer-Moore%20string%20search%20algorithm%20to%20Ruby.%20I%27ve%20extended%20it%20to%20support%20searches%20on%20token%20arrays%20and%20regular%20expressions.%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github." title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;title=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;t=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;title=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby&amp;annotation=Just%20a%20quick%20post.%20I%27ve%20converted%20the%20C%20code%20from%20the%20wikipedia%20entry%20%28this%20version%29%20on%20the%20Boyer-Moore%20string%20search%20algorithm%20to%20Ruby.%20I%27ve%20extended%20it%20to%20support%20searches%20on%20token%20arrays%20and%20regular%20expressions.%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github." title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;t=Boyer-Moore%20string%20search%20algorithm%20in%20Ruby" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F10%2F27%2Fboyer-moore-string-search-algorithm-in-ruby%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/10/27/boyer-moore-string-search-algorithm-in-ruby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Paging UIScrollView in Cocos2d (with previews)</title>
		<link>http://eigenjoy.com/2010/09/08/a-paging-uiscrollview-in-cocos2d-with-previews/</link>
		<comments>http://eigenjoy.com/2010/09/08/a-paging-uiscrollview-in-cocos2d-with-previews/#comments</comments>
		<pubDate>Thu, 09 Sep 2010 03:46:36 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=300</guid>
		<description><![CDATA[I&#8217;ve created a sample project that shows how to do a paged UIScrollView within Cocos2d. Here&#8217;s a video showing the effect:

You can find the code on github
My solution&#8217;s main ideas are adapted from these two pages:

http://getsetgames.com/2009/08/21/cocos2d-and-uiscrollview/
http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html

My contribution is combining the UIScrollView with previews with Cocos2d and cleaning it up.
If you haven&#8217;t tried to implement this [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve created a sample project that shows how to do a paged <code>UIScrollView</code> within Cocos2d. Here&#8217;s a video showing the effect:</p>
<p><object width="499" height="306"><param name="movie" value="http://www.youtube.com/v/2IgbRzGfBHk?fs=1&amp;hl=en_US&amp;rel=0&amp;hd=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/2IgbRzGfBHk?fs=1&amp;hl=en_US&amp;rel=0&amp;hd=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="499" height="306"></embed></object></p>
<p>You can find the code <a href="http://github.com/jashmenn/shapes-panels">on github</a></p>
<p>My solution&#8217;s main ideas are adapted from these two pages:</p>
<ul>
<li><a href="http://getsetgames.com/2009/08/21/cocos2d-and-uiscrollview/">http://getsetgames.com/2009/08/21/cocos2d-and-uiscrollview/</a></li>
<li><a href="http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html">http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html</a></li>
</ul>
<p>My contribution is combining the <code>UIScrollView</code> with previews with Cocos2d and cleaning it up.</p>
<p>If you haven&#8217;t tried to implement this before it might not be obvious why this is tricky to implement. Apple&#8217;s <code>UIScrollView</code> allows you to have a view which scrolls and optionally snaps to pages. The effect you see in the video above (and in Angry Birds level selection and many other apps) shows a preview of each panel on either side. This let&#8217;s you easily see if a next or previous page exists and you see a preview of that page.  </p>
<p>The problem is that Apple&#8217;s <code>UIScrollView</code> doesn&#8217;t let you set the width of the frame, so you can&#8217;t page less than a whole screen (well, a whole width of the<code>UIScrollView</code>, more on that later). </p>
<p>To get around this I originally tried writing my own paging controller. If you&#8217;ve tried this you&#8217;ll know that it is extremely tricky to get the same interaction dynamics as Apple&#8217;s. (For instance, pull out your phone and play with the Photo application. Notice if you just drag slowly you lack enough inertia to go to the next page so it will snap back to the frame your are on.  If you flick fast over a small area the page will skip to the next frame. Etc.) While at first glance the rules seem easy to reimplement, you have to cover a lot of edge cases to recreating the familiar paging interaction.</p>
<p>So ideally we need to figure out a way to use Apple&#8217;s <code>UIScrollView</code> and we should save ourselves a lot of work.</p>
<p>Like we said above, a <code>UIScrollView</code> will only page the width of the entire <code>UIScrollView</code>.  So to get this preview effect you can create a <code>UIScrollView</code> that is less than the width of the entire screen. The problem here is that any touches that lie outside of that <code>UIScrollView</code> (say on the edge of the screen won&#8217;t be sent to the <code>UIScrollView</code>.</p>
<p>Our solution, (again, borrowed largely from the above links) looks like this:</p>
<p><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/shapes-panels-post.jpg"><img src="http://www.xcombinator.com/wp-content/uploads/2010/09/shapes-panels-post.jpg" alt="Jacob's Shapes Panels Layers" title="shapes-panels-post" width="507" height="235" class="aligncenter size-full wp-image-306" /></a></p>
<p>The idea is this:</p>
<ul>
<li>We create a <code>CCMenu</code> and add it to a <code>CCLayer</code></li>
<li>The <code>UIScrollView</code> is resized to the width of our panel images (smaller than the whole screen)</li>
<li>The <code>UIScrollView</code> transforms its scrolling action into moving the position of the <code>CCLayer</code> containing our <code>CCMenu</code></li>
<li>We create a full-screen <code>TouchDelegatingView</code> that simply forwards its touches on to the <code>UIScrollView</code></li>
</ul>
<h2>More Details</h2>
<p>In <a href="http://www.littlehiccup.com">Jacob&#8217;s Shapes</a> (JS), we have a <code>GameController</code> which knows all of the levels. For the sake of this example, we&#8217;re just going to store all the level names in an <code>NSArray</code>.</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m (in onEnter)</span>
<span style="color: #400080;">NSArray</span><span style="color: #002200;">*</span> panelNames <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #400080;">NSArray</span> arrayWithObjects<span style="color: #002200;">:</span> 
    <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;amazon&quot;</span>, <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;arctic&quot;</span>,
    <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;brkfst&quot;</span>, <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;camp&quot;</span>, 
    <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;city&quot;</span>, <span style="color: #a61390;">nil</span><span style="color: #002200;">&#93;</span>;
<span style="color: #a61390;">int</span> numberOfPages <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>panelNames count<span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// create an empty layer for us to work with</span>
CCLayer<span style="color: #002200;">*</span> panels <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>CCLayer node<span style="color: #002200;">&#93;</span>;</pre></div></div>

<h2>Custom CCMenu and CCMenuItem</h2>
<p>We use a custom subclass of <code>CCMenu</code> and <code>CCMenuItem</code>, <code>NMPanelMenu</code> and <code>NMPanelMenuItem</code>, respectively. <code>NMPanelMenu</code> tweaks how the current item is determined. Overriding <code>NMPanelMenuItem</code> allows us to add metadata about the panel, play sounds, and optimize how we use the images for selected panels.</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m</span>
NMPanelMenu<span style="color: #002200;">*</span> menu <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>NMPanelMenu menuWithItems<span style="color: #002200;">:</span> <span style="color: #a61390;">nil</span><span style="color: #002200;">&#93;</span>;
<span style="color: #a61390;">float</span> onePanelWide <span style="color: #002200;">=</span> <span style="color: #002200;">-</span><span style="color: #2400d9;">1</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// Now add the panels</span>
<span style="color: #a61390;">for</span><span style="color: #002200;">&#40;</span><span style="color: #a61390;">int</span> i<span style="color: #002200;">=</span><span style="color: #2400d9;">0</span>; i <span style="color: #002200;">&amp;</span>lt; numberOfPages; i<span style="color: #002200;">++</span><span style="color: #002200;">&#41;</span> <span style="color: #002200;">&#123;</span>
    <span style="color: #400080;">NSString</span><span style="color: #002200;">*</span> currentName <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>panelNames objectAtIndex<span style="color: #002200;">:</span>i<span style="color: #002200;">&#93;</span>;
    CCSprite<span style="color: #002200;">*</span> pane2 <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>CCSprite spriteWithFile<span style="color: #002200;">:</span><span style="color: #002200;">&#91;</span><span style="color: #400080;">NSString</span> stringWithFormat<span style="color: #002200;">:</span> <span style="color: #bf1d1a;">@</span><span style="color: #bf1d1a;">&quot;%@-panel.png&quot;</span>, currentName<span style="color: #002200;">&#93;</span><span style="color: #002200;">&#93;</span>;
    NMPanelMenuItem<span style="color: #002200;">*</span> menuItem2 <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>NMPanelMenuItem alloc<span style="color: #002200;">&#93;</span> initFromNormalSprite<span style="color: #002200;">:</span>pane2 
                                                                selectedSprite<span style="color: #002200;">:</span>pane2
                                                                  activeSprite<span style="color: #002200;">:</span>pane2
                                                                disabledSprite<span style="color: #002200;">:</span>pane2
                                                                          name<span style="color: #002200;">:</span>currentName
                                                                        target<span style="color: #002200;">:</span>self selector<span style="color: #002200;">:</span><span style="color: #a61390;">@selector</span><span style="color: #002200;">&#40;</span>levelPicked<span style="color: #002200;">:</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#93;</span>;
    menuItem2.world <span style="color: #002200;">=</span> i;
    menuItem2.name <span style="color: #002200;">=</span> currentName;
    <span style="color: #002200;">&#91;</span>menu addChild<span style="color: #002200;">:</span> menuItem2<span style="color: #002200;">&#93;</span>;
    <span style="color: #002200;">&#91;</span>menuItem2 release<span style="color: #002200;">&#93;</span>;
    <span style="color: #11740a; font-style: italic;">// set onePanelWide to be the width of the first panel</span>
    <span style="color: #a61390;">if</span><span style="color: #002200;">&#40;</span>i<span style="color: #002200;">==</span><span style="color: #2400d9;">0</span><span style="color: #002200;">&#41;</span> onePanelWide <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span>pane2 textureRect<span style="color: #002200;">&#93;</span>.size.width;
<span style="color: #002200;">&#125;</span></pre></div></div>

<p>Here we used <code>CCSprite#spriteWithFile</code>, but in JS we use Zwoptex-created sprite sheets for the panels and then create sprites from those frames. This makes a huge difference in the load time of this scene when you have 20 panels. In JS, instead of loading 20 textures (one for each panel) we only load 2 textures, each containing 10 panels each. </p>
<p>JS is graphics heavy and we definitely had to pay attention to file sizes to keep it under 22MB. Originally I had created two versions of each panel, one for &#8220;off&#8221; and one with a glow for &#8220;on&#8221; (active/selected). Each of the panels as a transparent png was somewhere around 100k. So 100k x 2 for each state x 20 panels was somewhere around 4MB just for this single scene. </p>
<p>We decided to sacrifice a bit of the quality of the glow for the &#8220;on&#8221; state and just create one transparent image for the glow and reuse that for every panel.</p>
<p>To use the glow a portion of our <code>NMPanelMenuItem</code> looks like this:</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># NMPanelMenuItem.m</span>
<span style="color: #002200;">-</span><span style="color: #002200;">&#40;</span><span style="color: #a61390;">void</span><span style="color: #002200;">&#41;</span> activate
<span style="color: #002200;">&#123;</span>
    isActive_ <span style="color: #002200;">=</span> <span style="color: #a61390;">YES</span>;
    <span style="color: #11740a; font-style: italic;">// play sound here</span>
    <span style="color: #002200;">&#91;</span>super activate<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#125;</span>
&nbsp;
<span style="color: #002200;">-</span><span style="color: #002200;">&#40;</span><span style="color: #a61390;">void</span><span style="color: #002200;">&#41;</span> draw
<span style="color: #002200;">&#123;</span>
    <span style="color: #a61390;">if</span><span style="color: #002200;">&#40;</span>isActive_<span style="color: #002200;">&#41;</span> <span style="color: #002200;">&#123;</span>
        <span style="color: #002200;">&#91;</span>self.activeImage draw<span style="color: #002200;">&#93;</span>;
        <span style="color: #a61390;">if</span><span style="color: #002200;">&#40;</span>self.showGlow<span style="color: #002200;">&#41;</span> <span style="color: #002200;">&#91;</span>self.glow draw<span style="color: #002200;">&#93;</span>;
    <span style="color: #002200;">&#125;</span> <span style="color: #a61390;">else</span> <span style="color: #002200;">&#123;</span>
        <span style="color: #002200;">&#91;</span>super draw<span style="color: #002200;">&#93;</span>;
    <span style="color: #002200;">&#125;</span>
<span style="color: #002200;">&#125;</span></pre></div></div>

<p>Where <code>self.glow</code> is a <code>CCSprite</code> attached to the <code>NMPanelMenuItem</code>. </p>
<h2>Adding the Cocos2d Panels</h2>
<p>Next we need to setup some basic options for how much padding we want and what the total width of the panels layer is going to be. Then we add the panels to our scene and set the position.</p>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m</span>
<span style="color: #a61390;">float</span> padding <span style="color: #002200;">=</span> <span style="color: #2400d9;">15</span>;
<span style="color: #a61390;">float</span> totalPanelWidth <span style="color: #002200;">=</span> onePanelWide <span style="color: #002200;">+</span> padding<span style="color: #002200;">*</span><span style="color: #2400d9;">2</span>;
<span style="color: #a61390;">float</span> totalWidth <span style="color: #002200;">=</span> numberOfPages <span style="color: #002200;">*</span> totalPanelWidth; <span style="color: #11740a; font-style: italic;">// (wait, do we need padding in here?)</span>
&nbsp;
<span style="color: #a61390;">int</span> currentWorldOffset <span style="color: #002200;">=</span> <span style="color: #2400d9;">0</span>;    <span style="color: #11740a; font-style: italic;">// current world number. </span>
<span style="color: #11740a; font-style: italic;">// int currentWorldOffset = 1; // Try changing to 1 and see what happens</span>
&nbsp;
<span style="color: #002200;">&#91;</span>menu alignItemsHorizontallyWithPadding<span style="color: #002200;">:</span> padding<span style="color: #002200;">*</span><span style="color: #2400d9;">2</span><span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// add our panels layer</span>
<span style="color: #002200;">&#91;</span>panels addChild<span style="color: #002200;">:</span>menu<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>self addChild<span style="color: #002200;">:</span>panels<span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// set the position of the menu to the center of the very first panel</span>
menu.position <span style="color: #002200;">=</span> ccpAdd<span style="color: #002200;">&#40;</span>menu.position, ccp<span style="color: #002200;">&#40;</span>totalWidth<span style="color: #002200;">/</span><span style="color: #2400d9;">2</span> <span style="color: #002200;">-</span> totalPanelWidth<span style="color: #002200;">/</span><span style="color: #2400d9;">2</span>, <span style="color: #2400d9;">0</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#41;</span>;</pre></div></div>

<p>Note that the panels are the visual representation but we haven&#8217;t added in any scrolling dynamics. To do that we need to add a <code>UIScrollView</code>.</p>
<h2>Adding the UIScrollView</h2>
<p>Here we do two things: </p>
<ol>
<li>Add our <code>CocosOverlayScrollView</code> which is only one panel wide (less than the whole screen). If we had this layer only then we wouldn&#8217;t be notified of touches on the edge of the screen.</li>
<li>
<p>We add the <code>TouchDelegatingView</code> which is full screen. The <code>TouchDelegatingView</code> will delegate any touches it receives to our paging scroll view</p>
</ol>

<div class="wp_syntax"><div class="code"><pre class="objc" style="font-family:monospace;"><span style="color: #6e371a;"># HCUPPanelScene.m</span>
<span style="color: #11740a; font-style: italic;">// Note that we're only concerned with a horizontal iPhone. If your game is</span>
<span style="color: #11740a; font-style: italic;">// vertical, change accordingly</span>
touchDelegatingView <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>TouchDelegatingView alloc<span style="color: #002200;">&#93;</span> initWithFrame<span style="color: #002200;">:</span>CGRectMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">320</span>, <span style="color: #2400d9;">480</span><span style="color: #002200;">&#41;</span><span style="color: #002200;">&#93;</span>;
scrollView <span style="color: #002200;">=</span> <span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>CocosOverlayScrollView alloc<span style="color: #002200;">&#93;</span> initWithFrame<span style="color: #002200;">:</span>CGRectMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">0</span>, <span style="color: #2400d9;">320</span>, totalPanelWidth<span style="color: #002200;">&#41;</span>
                                                  numPages<span style="color: #002200;">:</span> numberOfPages
                                                     width<span style="color: #002200;">:</span> totalPanelWidth
                                                     layer<span style="color: #002200;">:</span> panels<span style="color: #002200;">&#93;</span>;
touchDelegatingView.scrollView <span style="color: #002200;">=</span> scrollView;
&nbsp;
<span style="color: #11740a; font-style: italic;">// this is just to pre-set the scroll view to a particular panel</span>
<span style="color: #002200;">&#91;</span>scrollView setContentOffset<span style="color: #002200;">:</span> CGPointMake<span style="color: #002200;">&#40;</span><span style="color: #2400d9;">0</span>, currentWorldOffset <span style="color: #002200;">*</span> totalPanelWidth<span style="color: #002200;">&#41;</span> animated<span style="color: #002200;">:</span> <span style="color: #a61390;">NO</span><span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #11740a; font-style: italic;">// Add views to cocos2d</span>
<span style="color: #11740a; font-style: italic;">// We called it a TouchDelegatingView, but it actually isn't containing anything at all.</span>
<span style="color: #11740a; font-style: italic;">// In reality it is just taking up any space under our ScrollView and delegating the touches. </span>
<span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>CCDirector sharedDirector<span style="color: #002200;">&#93;</span> openGLView<span style="color: #002200;">&#93;</span> addSubview<span style="color: #002200;">:</span>touchDelegatingView<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span><span style="color: #002200;">&#91;</span>CCDirector sharedDirector<span style="color: #002200;">&#93;</span> openGLView<span style="color: #002200;">&#93;</span> addSubview<span style="color: #002200;">:</span>scrollView<span style="color: #002200;">&#93;</span>;
&nbsp;
<span style="color: #002200;">&#91;</span>scrollView release<span style="color: #002200;">&#93;</span>;
<span style="color: #002200;">&#91;</span>touchDelegatingView release<span style="color: #002200;">&#93;</span>;</pre></div></div>

<p>You can configure your <code>UIScrollView</code> options by simply changing the code in <code>CocosOverlayScrollView#initWithFrame:numPages:width:layer</code>. (Note that this class was originally written by <a href="http://blog.proculo.de/archives/180-Paging-enabled-UIScrollView-With-Previews.html">Alexander Repty</a>)</p>
<p>The <code>TouchDelegatingView</code> simply delegates any touches it receives to the <code>CocosOverlayScrollView</code>.</p>
<p>And there you have it! Feel free to <a href="http://github.com/jashmenn/shapes-panels">fork and make any changes to the code</a> and send me a pull request.</p>
<p>What do you think? Have any ideas for cleaning it up? Leave your comments below!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;title=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29&amp;notes=I%27ve%20created%20a%20sample%20project%20that%20shows%20how%20to%20do%20a%20paged%20UIScrollView%20within%20Cocos2d.%20Here%27s%20a%20video%20showing%20the%20effect%3A%0D%0A%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github%0D%0A%0D%0AMy%20solution%27s%20main%20ideas%20are%20adapted%20from%20these%20two%20pages%3A%0D%0A%0D%0A%0D%0Ahttp%3A%2F%2Fgetsetgames.co" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;title=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;t=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;title=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29&amp;annotation=I%27ve%20created%20a%20sample%20project%20that%20shows%20how%20to%20do%20a%20paged%20UIScrollView%20within%20Cocos2d.%20Here%27s%20a%20video%20showing%20the%20effect%3A%0D%0A%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20code%20on%20github%0D%0A%0D%0AMy%20solution%27s%20main%20ideas%20are%20adapted%20from%20these%20two%20pages%3A%0D%0A%0D%0A%0D%0Ahttp%3A%2F%2Fgetsetgames.co" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;t=A%20Paging%20UIScrollView%20in%20Cocos2d%20%28with%20previews%29" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F08%2Fa-paging-uiscrollview-in-cocos2d-with-previews%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/08/a-paging-uiscrollview-in-cocos2d-with-previews/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>index and working tree do not reflect changes that are now in HEAD</title>
		<link>http://eigenjoy.com/2010/09/04/index-and-working-tree-do-not-reflect-changes-that-are-now-in-head/</link>
		<comments>http://eigenjoy.com/2010/09/04/index-and-working-tree-do-not-reflect-changes-that-are-now-in-head/#comments</comments>
		<pubDate>Sat, 04 Sep 2010 17:30:54 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=296</guid>
		<description><![CDATA[After my recent git class someone asked this question:

 I was trying some remote git repo tests last night, and when I went to push my changes, I received this warning message:
warning: updating the currently checked out branch; this may cause confusion, as the index and working tree do not reflect changes that are now [...]]]></description>
			<content:encoded><![CDATA[<p>After my recent git class someone asked this question:</p>
<blockquote><p>
 I was trying some remote git repo tests last night, and when I went to push my changes, I received this warning message:</p>
<p><code>warning: updating the currently checked out branch; this may cause confusion, as the index and working tree do not reflect changes that are now in HEAD.</code></p>
<p>On remote server I created a directory and did &#8220;<code>git init</code>&#8220;. Then cloned it from my local machine, did changes, committed, and then push. All seemed straightforward there. Any thoughts?</p>
</blockquote>
<p>The issue git is warning you about is that your remote has both the <em>repository</em> and a <em>working copy</em>. That is, on the remote server you have a directory <code>project/</code> with files in it (the working copy) and the folder <code>project/.git</code> (the repository).</p>
<p>If you push from your local machine to the remote, you will only be updating files in the repository and <strong>not</strong> the working copy. That is, the non-git files will not be changed. This can be confusing because you might log into the remote after you push an expect the working copy to be different.</p>
<p>To deal with this possible confusion <code>git init</code> provides a <code>--bare</code> option. What this does is create the repository only (no working copy).  You can then <code>push</code> and <code>pull</code> from the remote like you might a central svn server.</p>
<p>Let me show an example. Say I have an existing git repository on my local machine and I want to create a new remote to back it up. My workflow would look like this:<br />
<code><br />
ssh me@myserver.com<br />
mkdir ~/git/newproject.git<br />
cd ~/git/newproject.git<br />
git init --bare<br />
exit<br />
git remote add myserver me@myserver.com:/home/nmurray/git/newproject.git<br />
git push myserver master</code></p>
<p>If you want, you could even chain these together as a single command:</p>
<p><code>ssh me@myserver.com "mkdir ~/git/newproject.git &#038;&#038; cd ~/git/newproject.git &#038;&#038; git init --bare" &#038;&#038; echo git remote add myserver me@myservercom:/home/nmurray/git/newproject.git</code></p>
<p>Hope this helps!</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;title=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD&amp;notes=After%20my%20recent%20git%20class%20someone%20asked%20this%20question%3A%0D%0A%0D%0A%0D%0A%20I%20was%20trying%20some%20remote%20git%20repo%20tests%20last%20night%2C%20and%20when%20I%20went%20to%20push%20my%20changes%2C%20I%20received%20this%20warning%20message%3A%0D%0A%0D%0Awarning%3A%20updating%20the%20currently%20checked%20out%20branch%3B%20this%20may%20caus" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;title=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;t=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;title=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD&amp;annotation=After%20my%20recent%20git%20class%20someone%20asked%20this%20question%3A%0D%0A%0D%0A%0D%0A%20I%20was%20trying%20some%20remote%20git%20repo%20tests%20last%20night%2C%20and%20when%20I%20went%20to%20push%20my%20changes%2C%20I%20received%20this%20warning%20message%3A%0D%0A%0D%0Awarning%3A%20updating%20the%20currently%20checked%20out%20branch%3B%20this%20may%20caus" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;t=index%20and%20working%20tree%20do%20not%20reflect%20changes%20that%20are%20now%20in%20HEAD" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F04%2Findex-and-working-tree-do-not-reflect-changes-that-are-now-in-head%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/04/index-and-working-tree-do-not-reflect-changes-that-are-now-in-head/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting Cascading to Read Sequence Files Created Somewhere Else</title>
		<link>http://eigenjoy.com/2010/09/02/getting-cascading-to-read-sequence-files-created-somewhere-else/</link>
		<comments>http://eigenjoy.com/2010/09/02/getting-cascading-to-read-sequence-files-created-somewhere-else/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 20:54:09 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=291</guid>
		<description><![CDATA[Sometimes you can&#8217;t control where your data comes from or how it&#8217;s formatted. For instance, where I work a lot data is stored in SequenceFiles. Unfortunately, the files are not taking advantage of the typing SequenceFiles provide and instead each record is a single field containing delimited string.
I like to use Cascading (or cascalog) for [...]]]></description>
			<content:encoded><![CDATA[<p><em>Sometimes you can&#8217;t control</em> where your data comes from or how it&#8217;s formatted. For instance, where I work a lot data is stored in <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html"><code>SequenceFile</code>s</a>. Unfortunately, the files are not taking advantage of the typing <code>SequenceFile</code>s provide and instead each record is a single field containing delimited string.</p>
<p>I like to use Cascading (or cascalog) for my Hadoop jobs, but out of the box Cascading doesn&#8217;t support using <code>SequenceFile</code>s that were created outside of Cascading. That is to say, Cascading requires that your <code>SequenceFile</code>s values be an instance of <code>Tuple</code>.</p>
<p>The solution is to create your own <code>Scheme</code> that parses a <code>SequenceFile</code> according to your own format. In my case I just want to parse each line as the string list.</p>
<p>The code is simple but may not be obvious for a first-time Cascading user. I hope this will save someone a few minutes.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    <span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.xcombinator</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tap.Tap</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Fields</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Tuple</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.TupleEntry</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Tuples</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.scheme.SequenceFile</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.JobConf</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.OutputCollector</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.SequenceFileInputFormat</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.SequenceFileOutputFormat</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #008000; font-style: italic; font-weight: bold;">/**
     * A SequenceFileAsText is a type of {@link SequenceFile}, however the
     * SequenceFile has been created outside of Cascading and is assumed to have a
     * value of a string.
     */</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> SequenceFileAsText <span style="color: #000000; font-weight: bold;">extends</span> SequenceFile
      <span style="color: #009900;">&#123;</span>
      <span style="color: #008000; font-style: italic; font-weight: bold;">/** Field serialVersionUID */</span>
      <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">long</span> serialVersionUID <span style="color: #339933;">=</span> 1L<span style="color: #339933;">;</span>
&nbsp;
      <span style="color: #008000; font-style: italic; font-weight: bold;">/** Protected for use by TempDfs and other subclasses. Not for general consumption. */</span>
      <span style="color: #000000; font-weight: bold;">protected</span> SequenceFileAsText<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">null</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
      <span style="color: #008000; font-style: italic; font-weight: bold;">/**
       * Creates a new SequenceFileAsText instance that stores the given field names.
       *
       * @param fields
       */</span>
      <span style="color: #000000; font-weight: bold;">public</span> SequenceFileAsText<span style="color: #009900;">&#40;</span> Fields fields <span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span> fields <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
      @Override
      <span style="color: #000000; font-weight: bold;">public</span> Tuple source<span style="color: #009900;">&#40;</span> <span style="color: #003399;">Object</span> key, <span style="color: #003399;">Object</span> value <span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #000000; font-weight: bold;">instanceof</span> Tuple<span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #009900;">&#40;</span>Tuple<span style="color: #009900;">&#41;</span> value<span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #000000; font-weight: bold;">instanceof</span> <span style="color: #003399;">Comparable</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Comparable</span><span style="color: #009900;">&#41;</span> value<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>value <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span>.<span style="color: #006633;">valueOf</span><span style="color: #009900;">&#40;</span>value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #000000; font-weight: bold;">else</span>
        <span style="color: #009900;">&#123;</span>
          <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Comparable</span><span style="color: #009900;">&#41;</span><span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;title=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else&amp;notes=Sometimes%20you%20can%27t%20control%20where%20your%20data%20comes%20from%20or%20how%20it%27s%20formatted.%20For%20instance%2C%20where%20I%20work%20a%20lot%20data%20is%20stored%20in%20SequenceFiles.%20Unfortunately%2C%20the%20files%20are%20not%20taking%20advantage%20of%20the%20typing%20SequenceFiles%20provide%20and%20instead%20each%20rec" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;title=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;t=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;title=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else&amp;annotation=Sometimes%20you%20can%27t%20control%20where%20your%20data%20comes%20from%20or%20how%20it%27s%20formatted.%20For%20instance%2C%20where%20I%20work%20a%20lot%20data%20is%20stored%20in%20SequenceFiles.%20Unfortunately%2C%20the%20files%20are%20not%20taking%20advantage%20of%20the%20typing%20SequenceFiles%20provide%20and%20instead%20each%20rec" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;t=Getting%20Cascading%20to%20Read%20Sequence%20Files%20Created%20Somewhere%20Else" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F02%2Fgetting-cascading-to-read-sequence-files-created-somewhere-else%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/02/getting-cascading-to-read-sequence-files-created-somewhere-else/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>git cheatsheet and class notes</title>
		<link>http://eigenjoy.com/2010/09/01/git-cheat-sheet-and-class-notes/</link>
		<comments>http://eigenjoy.com/2010/09/01/git-cheat-sheet-and-class-notes/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 19:27:48 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=274</guid>
		<description><![CDATA[I recently gave a talk at work about git. I created a cheat sheet based on <a href="http://clojure.org/cheatsheet">Steve Tayon's Clojure Cheatsheet</a>. 

[caption id="attachment_276" align="center" width="496" caption="Git Cheat Sheet Preview"]<a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf"><img src="http://www.xcombinator.com/wp-content/uploads/2010/09/git-cheat-sheet-preview.jpg" alt="Git Cheat Sheet Preview" title="Git Cheat Sheet Preview" width="496" height="347" class="size-full wp-image-276" /></a>[/caption]

I realize there are a <a href="http://zrusin.blogspot.com/2007/09/git-cheat-sheet.html">number</a> <a href="http://github.com/guides/git-cheat-sheet">of</a> <a href="http://cheat.errtheblog.com/s/git">cheatsheets</a> for git already. However, I wanted a simple, one-page sheet specifically for my attendees. 

You can download it here:
<ul>
	<li><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf">git cheatsheet pdf</a></li>
	<li><a href="http://github.com/jashmenn/talks/raw/master/git/cheat-sheet/git-class-cheat-sheet.tex">git cheatsheet LaTeX source</a></li>
</ul>

You can find the raw notes of my talk after the jump.




 
]]></description>
			<content:encoded><![CDATA[<p>I recently gave a talk at work about git. I created a cheatsheet based on <a href="http://clojure.org/cheatsheet">Steve Tayon&#8217;s Clojure Cheatsheet</a>. </p>
<div id="attachment_276" class="wp-caption center" style="width: 506px"><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf"><img src="http://www.xcombinator.com/wp-content/uploads/2010/09/git-cheat-sheet-preview.jpg" alt="Git Cheat Sheet Preview" title="Git Cheat Sheet Preview" width="496" height="347" class="size-full wp-image-276" /></a><p class="wp-caption-text">Git Cheat Sheet Preview</p></div>
<p>I realize there are a <a href="http://zrusin.blogspot.com/2007/09/git-cheat-sheet.html">number</a> <a href="http://github.com/guides/git-cheat-sheet">of</a> <a href="http://cheat.errtheblog.com/s/git">cheatsheets</a> for git already. However, I wanted a simple, one-page sheet specifically for my attendees. </p>
<p>You can download it here:</p>
<ul>
<li><a href="http://www.xcombinator.com/wp-content/uploads/2010/09/git-class-cheat-sheet.pdf">git cheatsheet pdf</a></li>
<li><a href="http://github.com/jashmenn/talks/raw/master/git/cheat-sheet/git-class-cheat-sheet.tex">git cheatsheet LaTeX source</a></li>
</ul>
<p>Like it? Hate it? Find a typo? <a href="http://www.xcombinator.com/2010/09/01/git-cheat-sheet-and-class-notes/#comments">Leave your feedback in the comments!</a></p>
<hr/>
<p>Here are my raw notes from the talk:<br />
<code></p>
<pre>
;; -*- mode: Markdown; -*-

# How to read:
commands are indented
actions to perform while presenting are marked with @
left to right

# Welcome
see progit.org
what is version control

why use it:

  * backup/restore
  * synchronization sharing
  * track changes
  * ownership
  * branching and merging

who has used subversion 

git
  * you've heard its distributed
  * b/c branching and merging

pace - slow, no slides

leave with practical understanding

# Install &amp; Config

    sudo port install git-core +svn
    git config --global user.name "Nate Murray"
    git config --global user.email "nate@natemurray.com"

# Basic Commands

    cd ~
    mkdir -p projects/demo       # explain only a little
    cd projects/demo
    git init
    git status                   # nothing here
    ls -a                        # talk .git repository vs. working copy
    echo "version 1" > README.txt
    git status                   # untracked file
    git add README.txt
    git status                   # changes to be committed
    git commit -m "added version one of the file"
    git status                   # clean

stop, draw the picture of the local operation phases - e.g. svn vs. git

> Principle 1: (almost) everything is local

so now that you know about the staging area, lets do it again

    echo "new file" > sheep.rb
    git status                   # draw untracked
    git add sheep.rb
    git status                   # draw staged
    git commit -m "added"

    cat README.txt                 # draw unmodified
    echo "version 2" > README.txt
    git status                   # draw modified
    git commit -a -m "updated version" # shorthand for git add
    git status

Tips:

    git config --global alias.st status
    git st

# Git Internals

* Before we can talk about branching you *have* to understand how git (tried to avoid this)
* files and folders

three objects -  @ Draw first commit

  * blob        - raw data
  * tree        - folder (stores blobs and trees)
  * commit      - snapshot of the repo + meta 

You won't need to use `git cat-file` on a daily basis. however, understanding
the concepts we're going to talk about is really important for branching.

    git log # view the log
    git show ----  # first commit, whatever it is

    git cat-file -p  ---- # first commit
    git cat-file -p  ---- # tree
    git cat-file -p  ---- # blob

draw the rest using git `cat-file`

    git log           # show the log again
    git cat-file -p ---- # second commit

draw the picture. point out the parent connection.
note committer / author

    git cat-file -p ---- # tree

note here there are two blobs!

finish drawing out the second commit
* git stores reference to first file.
* snapshot of the *whole project*
* git stores each file once
* filename is in the `tree` 

draw the last commit

     git log
     git cat-file -p ---- # third commit

> Principle #2 : Git commits are snapshots

* A commit in git is a snapshot of the entire project, not just a list of diffs.
* snapshot is based on the SHA hash function. guarantees file integrity

# refs/branches

questions?

@ stop. redraw commits as *linear* . looking only at commits

ready to define a branch
a branch is a pointer to a commi
text file with a sha. thats it. 

start with one branch called `master`

    git branch

bash prompt

    # skip this
    tree .git/refs/
    cat .git/refs/heads/master
    git log
    # compare the SHAs

update diagram by adding a `ref` to our commit. (`master`). 

@ draw circle pointing to commit

create testing branch

# branching

So lets create another branch:

    git branch testing
    git branch

only created, didn't switch. just created a ref pointing to this
commit

@ update diagram

How does git know what branch we are "on"?

special ref called `HEAD` that points to the local branch
since we are still on master HEAD points to master

@ add HEAD

To switch working copy, use the `git checkout`

    git checkout testing
    git branch

HEAD moves from `master` to `testing`

@ update diagram

master and testing point to the same commit, working directory isn't changed

checkout means something different in git than it does in svn.
checkout in git to switch our working directory to a particular commit. 

now make changes:

    cat README.txt
    echo "we are on the testing branch!" > README.txt
    cat README.txt
    git commit -a -m "updated the readme"
    git log

@ update diagram, adding new commit. move the testing ref and the HEAD ref with it

add a "test"

    echo "this is a test" > test.rb
    git add test.rb                    # stage it for our commit
    git commit -m "added a test"       # now commit
    git log

@ update diagram - should have two commits

hotfix - scenario: you need to switch back to master

    git checkout master
    ls

@ move HEAD

so notice two things.
1) switching to this branch was fast - everything is local
2) our file test.rb is absent

and if we

    cat README.txt

it says 'version 2' just like we would expect

    echo "applying fix" >> sheep.rb
    cat sheep.rb
    git commit -a -m "applied important fix"
    git log
    git cat-file -p ---- # last commit

@ draw the new commit, and draw its reference back to the parent. move HEAD and master

now fixed, can push into production
and get back to work in `testing`

    git checkout testing
    cat README.txt
    cat test.rb

This is a general pattern:

> Principle #3: Branching is cheap, use it often

If you are working on a particular feature, create a branch. 

If you're coming from svn, making frequent branches might seem unnatural.
in svn, a branch is global -> namespace issues.
vs. git: private branches
name your branch 'test' and it won't collide with anyone elses

But branching itself isn't that useful unless its easy to merge.

* how many of you have merged a branch in svn?
* how many of you enjoyed it?

merging is one of git's strength and git makes it relatively easy

# merging

    cat sheep.rb

two branches: `master` and `testing` - need to merge

    git checkout master
    git merge testing
    git show HEAD

instead of a 'parent' we have a line that says 'merge'
a merge commit has more than one parent

@ draw the commit object
@ draw lines to the commits

    gitx

sometimes merging doesn't go as planned - conflicts

    git checkout -b breaker

this is shorthand for create and then checkout a new branch based on the
current HEAD

    vi sheep.rb # changing fix
    git commit -a -m "changed the fix"
    git checkout master
    vi sheep.rb # improving fix
    git commit -a -m "improved the fix"

@(update diagram, adding breaker and master refs)

    git merge breaker
    git status

there are many diff viewing tools.
* perforce
* opendiff - from apple

    git mergetool -t opendiff

I don't really like using the visual tools.
Sometimes you need character level editing

    vi sheep.rb
    git add sheep.rb
    git commit -a

talk about merge with conflicts

@ update diagram draw new merge commit

    gitx

Questions?

# Remotes

Everything so far on one machine. 

I work offline (I take the train)
If I break something I can rollback see where I was an hour ago 

want to share our changes.
might seem scary or messy because changes to totally independent lines of the code.
but in practice its not a problem.

svn version numbers are incremental - so two repos would get out of
step
no easy way of merging two separate repostories. 

git blob identifiers are a SHA of the content.
if the same content is created anywhere in the universe you'll still
have the same SHA

git doesn't care about where your commits come from or how you get them

Protocols:
  * ssh
  * git
  * http
  * local file system

sample project on our github

    cd ..
    open http://XXX/nmurray/simple-echo
    git clone git@XXX:nmurray/simple-echo.git
    cd simple-echo
    git log

svn checkout just HEAD
vs. git - whole repo

To be able to collaborate with others you have to manage 'remote repositories'.
When you clone a project, you have a default remote called 'origin'. 

    git remote -v

Remotes are pointers to other repositories that are _usually_ over the network.
'pull' and 'push' changes.

    vi README.mkd
    # make a change
    git commit -a -m "make a change"
    git push

If someone else makes a change:

    git pull origin master

This means pull from `origin` the branch `master` into local branch `master`. You can often to just

    git pull

which means pull from origin whatever branch Im on (i.e. HEAD) into this branch.

Now let's say someone pushes a change and I make a change
I can't push unless I pull first. This is good.

# remote forks

So that is while we are on the same line. What if were on different lines?

@(open up webbrowser again)

Bh also has forked my project. But when we say forked, all the means is he has
created his own development line from some of my commits

    git remote add bh git@XXX:bhenderson/simple-echo.git
    git remote -v

Now you shouldn't be surprised to learn that adding the remote doesn't change
anything. First we have to `fetch` hist changes

    git fetch bh

`fetch` brings his commits into my repo but again, doesnt change my working copy.

fetch brought branches + commits into repo
work with those branches just like any other branch.

    git branch -a

So you see here we have 

* `master`, which is our local master
* we have at the bottom `origin/master` which is the origin where we pulled from branch master
* and then we have `bh/master`, which is bhendersons master branch

These are all regular branches: they are just pointers to commits. We
can even checkout as branch 

    git checkout bh/master

scary message

    git checkout master

So how would we merge bhendersons changes with our own? I'm sure you could guess by now. Simply:

    git merge bh/master # don't press enter!!!

But lets take it up a notch.
say you didn't want to merge bh changes in your master branch.
real world, you might not know if his changes would merge cleanly
don't want to mess up your master branch.  

What we are going to do is
* create a new branch,
* merge bhs branch in THAT branch
* then we're going to merge to master.

It will make more sense when we do it. Lets try:

Okay we first want to create a new branch based on our master

    git checkout -b bh-merge
    git branch -a 

Now lets merge his changes

    cat simple-echo.rb
    git merge bh/master
    cat simple-echo.rb
    git log                  # see bh as the author of the commit

okay everything was clean! *phew* now lets go back to master

    git checkout master
    git merge bh-merge
    git log

and there we go! merged nicely.
now I don't need bhendersons merge branch anymore, so lets delete it

    git branch -d bh-merge
    git branch -a

git is distributed

Instead of one central server, that everyone has to sync to,
* independent lines of work can go on.
* If someone creates something good in their branch, they just tell people about it.
* permission-less 

you can see why it is so good for open-source development

questions about branching?

# Advanced

* tagging
* rebase
* cherry pick
* git bisect
* hooks
* tracking branches
* submodules
* interactive staging
* squashing commits
* git-svn
* setting up your own server
* patches via email
* gitjour
</pre>
<p></code></p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;title=git%20cheatsheet%20and%20class%20notes&amp;notes=I%20recently%20gave%20a%20talk%20at%20work%20about%20git.%20I%20created%20a%20cheat%20sheet%20based%20on%20Steve%20Tayon%27s%20Clojure%20Cheatsheet.%20%0D%0A%0D%0A%0D%0AI%20realize%20there%20are%20a%20number%20of%20cheatsheets%20for%20git%20already.%20However%2C%20I%20wanted%20a%20simple%2C%20one-page%20sheet%20specifically%20for%20my%20attendees.%20%0D%0A%0D%0AYou%20can%20download%20it%20here%3A%0D%0A%0D%0A%09git%20cheatsheet%20pdf%0D%0A%09git%20cheatsheet%20LaTeX%20source%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20raw%20notes%20of%20my%20talk%20after%20the%20jump.%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%20%0D%0A" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;title=git%20cheatsheet%20and%20class%20notes" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=git%20cheatsheet%20and%20class%20notes%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;t=git%20cheatsheet%20and%20class%20notes" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;title=git%20cheatsheet%20and%20class%20notes&amp;annotation=I%20recently%20gave%20a%20talk%20at%20work%20about%20git.%20I%20created%20a%20cheat%20sheet%20based%20on%20Steve%20Tayon%27s%20Clojure%20Cheatsheet.%20%0D%0A%0D%0A%0D%0AI%20realize%20there%20are%20a%20number%20of%20cheatsheets%20for%20git%20already.%20However%2C%20I%20wanted%20a%20simple%2C%20one-page%20sheet%20specifically%20for%20my%20attendees.%20%0D%0A%0D%0AYou%20can%20download%20it%20here%3A%0D%0A%0D%0A%09git%20cheatsheet%20pdf%0D%0A%09git%20cheatsheet%20LaTeX%20source%0D%0A%0D%0A%0D%0AYou%20can%20find%20the%20raw%20notes%20of%20my%20talk%20after%20the%20jump.%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%20%0D%0A" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;t=git%20cheatsheet%20and%20class%20notes" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F09%2F01%2Fgit-cheat-sheet-and-class-notes%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/09/01/git-cheat-sheet-and-class-notes/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>a simple netty HTTP server in clojure</title>
		<link>http://eigenjoy.com/2010/07/30/a-simple-netty-http-server-in-clojure/</link>
		<comments>http://eigenjoy.com/2010/07/30/a-simple-netty-http-server-in-clojure/#comments</comments>
		<pubDate>Fri, 30 Jul 2010 22:42:45 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=237</guid>
		<description><![CDATA[Recently I&#8217;ve been toying with various clojure wrappers around java web servers. My goal is to write a small evented server that can queue up HTTP requests and then kick off some long-running processes. 
So far I&#8217;ve tried aleph, compjure/ring/jetty and saturnine. 
The compojure stack is by far the cleanest, but it&#8217;s geared more
towards synchronous [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I&#8217;ve been toying with various clojure wrappers around java web servers. My goal is to write a small evented server that can queue up HTTP requests and then kick off some long-running processes. </p>
<p>So far I&#8217;ve tried <a href="http://github.com/ztellman/aleph">aleph</a>, <a href="http://mmcgrana.github.com/2010/07/develop-deploy-clojure-web-applications.html">compjure/ring/jetty</a> and <a href="http://github.com/texodus/saturnine">saturnine</a>. </p>
<p>The <code>compojure</code> stack is by far the cleanest, but it&#8217;s geared more<br />
towards synchronous request/response cycles. I&#8217;d like to use<br />
something like to EventMachine and <code>saturnine</code> seems the closest to<br />
that goal. However, <code>saturnine</code>&#8217;s current HTTP implementation is<br />
lacking.</p>
<p>Now before I could contribute to <code>saturnine</code> I first needed to understand <a href="http://jboss.org/netty">netty</a>. If you&#8217;re trying to learn netty, I recommend you first read the <a href="http://www.jboss.org/netty/documentation.html">users guide</a> and then jump straight to the API docs on <a href="http://docs.jboss.org/netty/3.2/api/org/jboss/netty/channel/ChannelPipeline.html"><code>ChannelPipeline</code></a>.</p>
<p>Next, I needed to write a basic HTTP server using netty. <a href="http://stackoverflow.com/questions/1735776/server-programming-with-clojure">This post</a> on StackOverflow and <a href="http://docs.jboss.org/netty/3.2/xref/org/jboss/netty/example/http/snoop/package-summary.html">this sample code</a> on netty&#8217;s website helped me get a basic HTTP server up an running. </p>
<p>Below is a a nieve translation of the netty sample code into clojure. Note that this is <em>not</em> listed here as an example of sexy clojure code, but rather a starting point for someone looking to get dirty with the netty libraries in clojure.  </p>
<pre><code>    (ns xcombinator.netty.server
      (:gen-class)
      (:use clojure.contrib.import-static)
      (:import
         [java.net InetSocketAddress]
         [java.util.concurrent Executors]
         [org.jboss.netty.bootstrap ServerBootstrap]
         [org.jboss.netty.channel Channels ChannelPipelineFactory
           SimpleChannelHandler SimpleChannelUpstreamHandler]
         [org.jboss.netty.channel.socket.nio NioServerSocketChannelFactory]
         [org.jboss.netty.buffer ChannelBuffers]
         [org.jboss.netty.handler.codec.http HttpRequestDecoder
           HttpResponseEncoder DefaultHttpResponse]
    ))

    (import-static org.jboss.netty.handler.codec.http.HttpVersion HTTP_1_1)
    (import-static org.jboss.netty.handler.codec.http.HttpResponseStatus OK)
    (import-static org.jboss.netty.handler.codec.http.HttpHeaders$Names CONTENT_TYPE)

    (declare make-handler)

    (defrecord Server [#^ServerBootstrap bootstrap channel])

    (defn start
      "Start a Netty server. Returns the pipeline."
      [port handler]
      (let [channel-factory (NioServerSocketChannelFactory.
                              (Executors/newCachedThreadPool)
                              (Executors/newCachedThreadPool))
            bootstrap (ServerBootstrap. channel-factory)
            pipeline (.getPipeline bootstrap)]
        (.addLast pipeline "decoder" (new HttpRequestDecoder))
        (.addLast pipeline "encoder" (new HttpResponseEncoder))
        (.addLast pipeline "handler" (make-handler))
        (.setOption bootstrap "child.tcpNoDelay", true)
        (.setOption bootstrap "child.keepAlive", true)
        (new Server bootstrap (.bind bootstrap (InetSocketAddress. port))))) 

    (defn stop-server
      {:doc "Stops a Server instance"
       :arglists '([server])}
      [{bootstrap :bootstrap channel :channel}]
      (do (.unbind channel)
          (.releaseExternalResources bootstrap)))

    (defn http-response
      [status]
      (doto (DefaultHttpResponse. HTTP_1_1 status)
        (.setHeader CONTENT_TYPE "text/plain; charset=UTF-8")
        (.setContent (ChannelBuffers/copiedBuffer
                       (str "Success: " status) "UTF-8"))))

    (defn make-handler
      "Returns a Netty handler."
      []
      (proxy [SimpleChannelUpstreamHandler] []
        (messageReceived [ctx e]
          (let [c (.getChannel e)
                cb (.getMessage e)
                ]
            (println "HTTP request from" c)
            (.write c (http-response OK))
            (-&gt; e .getChannel .close)))

        (exceptionCaught
          [ctx e]
          (let [throwable (.getCause e)]
            (println "@exceptionCaught" throwable))
          (-&gt; e .getChannel .close))))

    (comment
      (def *server* (start 3335 make-handler))
      (stop-server *server*)
    )
</code></pre>
<p>Here&#8217;s a <code>project.clj</code> that will load up the right dependencies:</p>
<pre><code>    (defproject server "0.0.1"
      :description "Simple Netty HTTP server"
      :repositories [["JBoss" "http://repository.jboss.org/maven2"]]
      :dependencies
        [[org.clojure/clojure "1.2.0-beta1"]
        [org.clojure/clojure-contrib "1.2.0-beta1"]
        [org.jboss.netty/netty "3.2.0.BETA1"]
        [log4j/log4j           "1.2.14"]]
      :dev-dependencies [[autodoc              "0.7.0"]
                         [lein-clojars         "0.5.0-SNAPSHOT"]
                         [lein-run "1.0.0-SNAPSHOT"]
                         [swank-clojure "1.2.1"]]

      :namespaces [xcombinator.netty.server])
</code></pre>
<p>Then <code>lein swank</code> and evaluate the <code>(def *server*...</code> line in your<br />
REPL. You&#8217;ll have a simple netty HTTP server running on port <code>3335</code>.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F&amp;title=a%20simple%20netty%20HTTP%20server%20in%20clojure&amp;notes=Recently%20I%27ve%20been%20toying%20with%20various%20clojure%20wrappers%20around%20java%20web%20servers.%20My%20goal%20is%20to%20write%20a%20small%20evented%20server%20that%20can%20queue%20up%20HTTP%20requests%20and%20then%20kick%20off%20some%20long-running%20processes.%20%0D%0A%0D%0ASo%20far%20I%27ve%20tried%20aleph%2C%20compjure%2Fring%2Fjett" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F&amp;title=a%20simple%20netty%20HTTP%20server%20in%20clojure" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=a%20simple%20netty%20HTTP%20server%20in%20clojure%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F&amp;t=a%20simple%20netty%20HTTP%20server%20in%20clojure" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F&amp;title=a%20simple%20netty%20HTTP%20server%20in%20clojure&amp;annotation=Recently%20I%27ve%20been%20toying%20with%20various%20clojure%20wrappers%20around%20java%20web%20servers.%20My%20goal%20is%20to%20write%20a%20small%20evented%20server%20that%20can%20queue%20up%20HTTP%20requests%20and%20then%20kick%20off%20some%20long-running%20processes.%20%0D%0A%0D%0ASo%20far%20I%27ve%20tried%20aleph%2C%20compjure%2Fring%2Fjett" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F&amp;t=a%20simple%20netty%20HTTP%20server%20in%20clojure" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F30%2Fa-simple-netty-http-server-in-clojure%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/07/30/a-simple-netty-http-server-in-clojure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>fileutils for clojure</title>
		<link>http://eigenjoy.com/2010/07/28/fileutils-for-clojure/</link>
		<comments>http://eigenjoy.com/2010/07/28/fileutils-for-clojure/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 22:37:08 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=231</guid>
		<description><![CDATA[I just pushed clj-file-utils to clojars. 
I was looking for an easy way to replicate ruby&#8217;s fileutils library in clojure and I came across a version by Mark McGranaghan in his clj-garden project which wraps around the Apache Commons IO library. 
I extended Mark&#8217;s library by using multimethods to allow the use of strings (rather [...]]]></description>
			<content:encoded><![CDATA[<p>I just pushed <a href="http://clojars.org/clj-file-utils"><code>clj-file-utils</code></a> to clojars. </p>
<p>I was looking for an easy way to replicate ruby&#8217;s <code>fileutils</code> library in clojure and I came across a version by Mark McGranaghan in his <a href="http://github.com/mmcgrana/clj-garden"><code>clj-garden</code></a> project which wraps around the Apache Commons IO library. </p>
<p>I extended Mark&#8217;s library by using <a href="http://clojure.org/multimethods">multimethods</a> to allow the use of strings (rather than requiring File objects) for parameters.</p>
<h2>Usage</h2>
<pre><code>user=&gt; (use 'clj-file-utils.core)
nil
user=&gt; (exist "foo.txt")
false
user=&gt; (touch "foo.txt")
nil
user=&gt; (exist "foo.txt")
true
user=&gt; (rm "foo.txt")
nil
user=&gt; (file "foo.txt")
#&lt;File foo.txt&gt;
user=&gt; (.getParent (file "/path/to/foo.txt"))
"/path/to"
</code></pre>
<h2>As A Dependency</h2>
<p>leiningen</p>
<pre><code>[clj-file-utils "0.1.1"]
</code></pre>
<p>maven</p>
<pre><code>&lt;dependency&gt;
  &lt;groupId&gt;clj-file-utils&lt;/groupId&gt;
  &lt;artifactId&gt;clj-file-utils&lt;/artifactId&gt;
  &lt;version&gt;0.1.1&lt;/version&gt;
&lt;/dependency&gt;
</code></pre>
<h2>Code</h2>
<ul>
<li><a href="http://github.com/jashmenn/clj-file-utils">jashmenn/clj-file-utils on github</a></li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F&amp;title=fileutils%20for%20clojure&amp;notes=I%20just%20pushed%20clj-file-utils%20to%20clojars.%20%0D%0A%0D%0AI%20was%20looking%20for%20an%20easy%20way%20to%20replicate%20ruby%27s%20fileutils%20library%20in%20clojure%20and%20I%20came%20across%20a%20version%20by%20Mark%20McGranaghan%20in%20his%20clj-garden%20project%20which%20wraps%20around%20the%20Apache%20Commons%20IO%20library.%20%0D%0A" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F&amp;title=fileutils%20for%20clojure" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=fileutils%20for%20clojure%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F&amp;t=fileutils%20for%20clojure" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F&amp;title=fileutils%20for%20clojure&amp;annotation=I%20just%20pushed%20clj-file-utils%20to%20clojars.%20%0D%0A%0D%0AI%20was%20looking%20for%20an%20easy%20way%20to%20replicate%20ruby%27s%20fileutils%20library%20in%20clojure%20and%20I%20came%20across%20a%20version%20by%20Mark%20McGranaghan%20in%20his%20clj-garden%20project%20which%20wraps%20around%20the%20Apache%20Commons%20IO%20library.%20%0D%0A" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F&amp;t=fileutils%20for%20clojure" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F07%2F28%2Ffileutils-for-clojure%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/07/28/fileutils-for-clojure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cascading, TF-IDF, and BufferedSum (Part 2)</title>
		<link>http://eigenjoy.com/2010/05/14/cascading-tf-idf-and-bufferedsum-part-2/</link>
		<comments>http://eigenjoy.com/2010/05/14/cascading-tf-idf-and-bufferedsum-part-2/#comments</comments>
		<pubDate>Fri, 14 May 2010 15:49:55 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=222</guid>
		<description><![CDATA[Introduction
The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. [1]
To calculate tf-idf we need the following four values:

The number of times a term appears [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. [<a href="http://en.wikipedia.org/wiki/Tf-idf" class="external-link">1</a>]</p>
<p>To calculate tf-idf we need the following four values:</p>
<ul>
<li>The number of times a term appears in a given document (<tt>n<sub>i,j</sub></tt>)</li>
<li>The total number of terms in a given document (<tt>sum k, n<sub>k,j</sub></tt>)</li>
<li>The number of documents that contain a given term (<tt>|{d : t<sub>i</sub> E d}|</tt>)</li>
<li>The total number of documents in the corpus (<tt>D</tt>)</li>
</ul>
<h2>Mathematical Details</h2>
<p>We want to score the importance of term <tt>t<sub>i</sub></tt> in document <tt>d<sub>j</sub></tt>.</p>
<p>Term frequency is defined by:</p>
<p><img src="http://www.xcombinator.com/wp-content/uploads/2010/05/tf.png" alt="tf" title="tf" width="119" height="43" class="size-full wp-image-225" /></p>
<p>Where <tt>n<sub>i,j</sub></tt> is the number of occurrences of term <tt>t<sub>i</sub></tt> in document <tt>d<sub>j</sub></tt>.</p>
<p>Inverse document frequency is defined by:</p>
<p><img src="http://www.xcombinator.com/wp-content/uploads/2010/05/idf.png" alt="idf" title="idf" width="189" height="47" class="size-full wp-image-223" /></p>
<p>Where:</p>
<ul>
<li><tt>D</tt> is the total number of documents in the corpus and</li>
<li><tt>|{d : t<sub>i</sub> E d}|</tt> is the number of documents in which the term <tt>t<sub>i</sub></tt> appears.</li>
</ul>
<p>Then:</p>
<p><img src="http://www.xcombinator.com/wp-content/uploads/2010/05/tf-idf.png" alt="tf-idf" title="tf-idf" width="175" height="22" class="size-full wp-image-224" /></p>
<p>Refer to [<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.8343" class="external-link" >2</a>] for more information on tf-idf.</p>
<h2>Operation Input</h2>
<p><a href="http://www.xcombinator.com/2009/12/18/cascading-tf-idf-and-bufferedsum-part-1/">Last time</a> we discussed the technique of taking a group of records, calculating a value from that group and emitting each record with the calculated value attached. We called this operation a BufferedSum. We&#8217;re going to build on our previous work and create a reusable component (called a <a href="http://www.cascading.org/javadoc/cascading/pipe/SubAssembly.html" class="external-link">SubAssembly</a>) for calculating tf-idf using Cascading on Hadoop.</p>
<p>To make our tf-idf operation we need to decide what the input arguments will be. Last time, we used an input corpus of the format <tt>(document_id, body)</tt> and emitted <tt>(document_id, term, term_count_in_document)</tt> for all terms in each document. This last tuple will be the input format to our tf-idf operation.</p>
<h2>Creating a SubAssembly</h2>
<p>The general format of a SubAssembly in Cascading is as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> MySubAssembly <span style="color: #000000; font-weight: bold;">extends</span> SubAssembly <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">public</span> MySubAssembly<span style="color: #009900;">&#40;</span>Pipe pipe<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #666666; font-style: italic;">// do something with `pipe`</span>
    setTails<span style="color: #009900;">&#40;</span>pipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// must register all assembly tails</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>In our operation we are assuming that the total number of documents in the corpus is known or could be found with a simple MapReduce job. Given that we have the total number of documents we take that number as the input to our SubAssembly:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> TfIdfIndexSubAssmbly <span style="color: #000000; font-weight: bold;">extends</span> SubAssembly <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #003399;">Integer</span> totalNumberOfDocuments<span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">public</span> TfIdfIndexSubAssmbly<span style="color: #009900;">&#40;</span>Pipe pipe, <span style="color: #003399;">Integer</span> totalNumberOfDocuments<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">totalNumberOfDocuments</span> <span style="color: #339933;">=</span> totalNumberOfDocuments<span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// do something with pipe</span>
    setTails<span style="color: #009900;">&#40;</span>pipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// must register all assembly tails</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<h2>Gathering Variables</h2>
<p>To compute our final tf-idf score, we first need to compute the intermediate variables.</p>
<h3><tt>total_terms_in_document</tt></h3>
<p>Given our input is <tt>(document_id, term, term_count_in_document)</tt> then we already have the first variable <tt>n<sub>i,j</sub></tt>. We can now calculate the total number of terms in each document:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// input: (document_id, term, term_count_in_document)</span>
<span style="color: #666666; font-style: italic;">// emits: (document_id, term, term_count_in_document, total_terms_in_document)</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe,
    <span style="color: #000000; font-weight: bold;">new</span> BufferedSum<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span><span style="color: #009900;">&#41;</span>,
                   <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span>,
                   <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>,
    Fields.<span style="color: #006633;">SWAP</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Remember that <tt>BufferedSum</tt> takes three arguments:</p>
<ul>
<li>The name of the <tt>Field</tt> to output</li>
<li>The name of the <tt>Field</tt> to sum</li>
<li>The other <tt>Fields</tt> to &#8220;pull through&#8221; the operation</li>
</ul>
<p>So here we are grouping by <tt>document_id</tt>, and summing <tt>term_count_in_document</tt> for each group and placing the value into the field <tt>total_terms_in_document</tt>.</p>
<h3><tt>number_of_documents_containing_term</tt></h3>
<p>Next we need to calculate the number of documents that contain each term. We&#8217;ve already grouped by <tt>document_id</tt> and <tt>term</tt>, therefore we know we only have one record for a given <tt>document_id</tt>/<tt>term</tt> pair.</p>
<p>Rather than counting the number of <tt>document_id</tt>/<tt>term</tt> pairs directly we are simply going to assign a count of 1 to each record and then sum that value. This allows us to reuse the code we&#8217;ve written for <tt>BufferedSum</tt>.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// calculate the number of documents containing each term</span>
<span style="color: #666666; font-style: italic;">// input: (document_id, term, term_count_in_document, total_terms_in_document)</span>
<span style="color: #666666; font-style: italic;">// emit:  (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term)</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Insert<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_in_doc&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>, Fields.<span style="color: #006633;">ALL</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// we're going to sum these, easier than creating BufferedCount</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe,
    <span style="color: #000000; font-weight: bold;">new</span> BufferedSum<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;number_of_documents_containing_term&quot;</span><span style="color: #009900;">&#41;</span>,
                   <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_in_doc&quot;</span><span style="color: #009900;">&#41;</span>,
                   <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span>, <span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span>, <span style="color: #0000ff;">&quot;term_in_doc&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>,
    Fields.<span style="color: #006633;">SWAP</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Here we group on <tt>term</tt> and for every term group we calculate the number of documents that contain that term. Note that if you have a very large corpus some groups may become memory constrained as very common words such as &#8220;the&#8221; have groups containing nearly the entire corpus (it would be a good idea to remove stop-words during pre-processing).</p>
<p>After we&#8217;ve calculated the value for <tt>number_of_documents_containing_term</tt> we don&#8217;t need the <tt>term_in_doc</tt> field any longer. Using Cascading&#8217;s <tt>Identity</tt> operation we can reorder and discard fields.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"> pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #666666; font-style: italic;">//  reorder and rm fields</span>
    <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span>, <span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span>, <span style="color: #0000ff;">&quot;number_of_documents_containing_term&quot;</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Identity</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<h3><tt>total_documents</tt></h3>
<p>Next we do a hard insert of the number of documents. Again, you can calculate this value with a relatively simple MapReduce job (e.g. use a counter), but here probably not the best place to do it. Because you have to count every document in the corpus, it would be better to calculate the number of documents when you are generating the input documents file.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000066; font-weight: bold;">int</span> D <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">totalNumberOfDocuments</span><span style="color: #339933;">;</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Insert<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_documents&quot;</span><span style="color: #009900;">&#41;</span>, D<span style="color: #009900;">&#41;</span>, Fields.<span style="color: #006633;">ALL</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>The <tt>Insert</tt> operation simply inserts the number of documents into the tuple stream.</p>
<h3>Calculating Tf-idf with a Custom Operation</h3>
<p>Now that we have all four values we can calculate tf-idf. We are going to create a custom operation to do this. We will perform this operation on each <tt>Tuple</tt> in the <tt>Pipe</tt> with the following:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// calculate tf * idf</span>
<span style="color: #666666; font-style: italic;">// input:  (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term, total_documents)</span>
<span style="color: #666666; font-style: italic;">// emit:   (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term, total_documents, tf, idf, tfidf)</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> TfIdfOperation<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, Fields.<span style="color: #006633;">ALL</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>To create an operation in Cascading you simply subclass <tt>BaseOperation</tt>. For example:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">class</span> MyOperation <span style="color: #000000; font-weight: bold;">extends</span> BaseOperation <span style="color: #000000; font-weight: bold;">implements</span> Function <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">public</span> MyOperation<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;out_field_1&quot;</span>, <span style="color: #0000ff;">&quot;out_field_2&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> operate<span style="color: #009900;">&#40;</span>FlowProcess flowProcess, FunctionCall functionCall<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    TupleEntry inputTuple <span style="color: #339933;">=</span> functionCall.<span style="color: #006633;">getArguments</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// take values form inputTuple</span>
    <span style="color: #666666; font-style: italic;">// transform them to make outputTuple</span>
    Tuple outputTuple <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;a value 1&quot;</span>, <span style="color: #0000ff;">&quot;a value 2&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    functionCall.<span style="color: #006633;">getOutputCollector</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>outputTuple<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Here are the key things you need to do to create a Cascading operation:</p>
<ul>
<li>subclass BaseOperation and implement <tt>Function</tt> (there are other types of operations)</li>
<li>call <tt>super</tt> and declare the names of the Fields this operation will be emitting. See <a href="http://www.cascading.org/javadoc/cascading/operation/BaseOperation.html">BaseOperation</a> for details</li>
<li><tt>functionCall.getArguments()</tt> returns a <a href="http://www.cascading.org/javadoc/cascading/tuple/TupleEntry.html">TupleEntry</a> containing the input <tt>Tuple</tt> and input <tt>Fields</tt>.</li>
<li><tt>functionCall.getOutputCollector()</tt> is the <tt>OutputCollector</tt> you can use to emit <tt>Tuples</tt> from this operation.</li>
<li>call <tt>outputCollector.add()</tt> to emit a <tt>Tuple</tt>. You can emit 0..n <tt>Tuples</tt>.</li>
</ul>
<p>For our tf-idf operation we want to emit three Fields: <tt>tf</tt>, <tt>idf</tt>, and <tt>tfidf</tt>. We do this with the following code.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// note that we're using a private nested class. This is not required.</span>
<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">class</span> TfIdfOperation <span style="color: #000000; font-weight: bold;">extends</span> BaseOperation <span style="color: #000000; font-weight: bold;">implements</span> Function <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">public</span> TfIdfOperation<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;tf&quot;</span>, <span style="color: #0000ff;">&quot;idf&quot;</span>, <span style="color: #0000ff;">&quot;tfidf&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #009900;">&#125;</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> operate<span style="color: #009900;">&#40;</span>FlowProcess flowProcess, FunctionCall functionCall<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    TupleEntry arguments <span style="color: #339933;">=</span> functionCall.<span style="color: #006633;">getArguments</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// tf</span>
    <span style="color: #003399;">Double</span> termCount  <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">Double</span> totalTerms <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">BigDecimal</span> tf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">BigDecimal</span><span style="color: #009900;">&#40;</span>termCount <span style="color: #339933;">/</span> totalTerms<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// idf</span>
    <span style="color: #003399;">Double</span> totalDocuments  <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_documents&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">Double</span> td <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;number_of_documents_containing_term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">BigDecimal</span> idf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">BigDecimal</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Math</span>.<span style="color: #006633;">log</span><span style="color: #009900;">&#40;</span>totalDocuments <span style="color: #339933;">/</span> <span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span> <span style="color: #339933;">+</span> td<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// tfidf</span>
    <span style="color: #003399;">BigDecimal</span> tfidf <span style="color: #339933;">=</span> tf.<span style="color: #006633;">multiply</span><span style="color: #009900;">&#40;</span>idf<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    functionCall.<span style="color: #006633;">getOutputCollector</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>
        <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span> tf.<span style="color: #006633;">toPlainString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, idf.<span style="color: #006633;">toPlainString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, tfidf.<span style="color: #006633;">toPlainString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>When we convert, say, a <tt>Double</tt> to a string then we often get an exponent (e.g. <tt>0.02948E7</tt>). The exponent can be cumbersome to work with so we use <tt>BigDecimal</tt> to convert the number into a string without an exponent using <tt>toPlainString()</tt>.</p>
<h2>Building the Index</h2>
<p>Now that we have the <tt>term</tt>, <tt>document_id</tt>, and <tt>tfidf</tt> score we can build our index. First we strip out the unnecessary fields and move the <tt>term</tt> to the front.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// input:  (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term, total_documents, tf, idf, tfidf)</span>
<span style="color: #666666; font-style: italic;">// emit:   (term, document_id, tfidf)</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #666666; font-style: italic;">// reorder and rm some fields</span>
    <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;tfidf&quot;</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Identity</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>In our last step we want to build a single row that tells us &#8220;given a term, what documents are most relevant to it&#8221;. We format this list of <tt>(document_id, score)</tt> pairs as a JSON hash (<tt>JSONObject</tt>). (Formatting our records in this way is called using &#8220;stripes&#8221;.)</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// &quot;stripe&quot; our group e.g.:</span>
<span style="color: #666666; font-style: italic;">// input: (term,</span>
<span style="color: #666666; font-style: italic;">//          (document_id_1, tfidf_1),</span>
<span style="color: #666666; font-style: italic;">//          (document_id_2, tfidf_2),</span>
<span style="color: #666666; font-style: italic;">//          ...)</span>
<span style="color: #666666; font-style: italic;">// emit: (term, {document_id_1:tfidf_1, document_id_2:tfidf_2})</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe,
    <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;tfidf&quot;</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> JSONTupleAggregator<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;scores&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;JSONObject&quot;</span><span style="color: #009900;">&#41;</span>,
    <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;scores&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>JSONTupleAggregator is an operation that can be found in the <a href="http://github.com/jashmenn/cascading.json/blob/master/src/main/java/cascading/json/operation/aggregator/JSONTupleAggregator.java" class="external-link">cascading.json</a> project. It takes a group of tuples and emits them as either a <tt>JSONArray</tt> (nested list) or <tt>JSONObject</tt> (hash).</p>
<h2>Using the Index</h2>
<p>To use the index, simply perform a <tt>CoGroup</tt> on the term in your right-hand-side to the term in our index. See Cascading&#8217;s documentation on <a href="http://www.cascading.org/javadoc/cascading/pipe/CoGroup.html" class="external-link" >CoGroup</a> for more information.</p>
<h2>Full Code Listing</h2>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.xcombinator.cascading.pipes</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.xcombinator.cascading.operations.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.xcombinator.cascading.operations.buffers.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Fields</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.pipe.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.regex.RegexSplitter</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Identity</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.text.DateParser</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Insert</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.FlowProcess</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.TupleEntry</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.BaseOperation</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Function</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.FunctionCall</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.pipe.Each</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.pipe.Every</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.pipe.GroupBy</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Tuple</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Debug</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.json.operation.aggregator.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.math.BigDecimal</span><span style="color: #339933;">;</span>
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 *
 * The goal of this SubAssembly is to create an index that can be used to find
 * the most relevant document given a term.
 * Required: you need to input the number of total documents.
 * Input: list of (document_id, term, count). e.g. :
 *
 *   (document_id,  term,  2)
 *   (document_id,  term2, 1)
 *   (document_id2, term2, 3)
 *   # etc
 *
 * where `count` is the number of times that term appears in that document
 *
 *
 * Emits: list of (term: {document_id_1:score_1,document_id_2,score_2...})
 *
 * Note that the tuple emitted is all String representation of the decimal
 * numbers. This is to allow easy (and correct) parsing if you write to a file
 * after this.
 *
 * It is assumed you've already done any normalization of the terms such as stemming etc.
 */</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> TfIdfIndexSubAssmbly <span style="color: #000000; font-weight: bold;">extends</span> SubAssembly <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">class</span> TfIdfOperation <span style="color: #000000; font-weight: bold;">extends</span> BaseOperation <span style="color: #000000; font-weight: bold;">implements</span> Function <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">public</span> TfIdfOperation<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;tf&quot;</span>, <span style="color: #0000ff;">&quot;idf&quot;</span>, <span style="color: #0000ff;">&quot;tfidf&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #009900;">&#125;</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> operate<span style="color: #009900;">&#40;</span>FlowProcess flowProcess, FunctionCall functionCall<span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#123;</span>
      TupleEntry arguments <span style="color: #339933;">=</span> functionCall.<span style="color: #006633;">getArguments</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">// tf</span>
      <span style="color: #003399;">Double</span> termCount  <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">Double</span> totalTerms <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">BigDecimal</span> tf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">BigDecimal</span><span style="color: #009900;">&#40;</span>termCount <span style="color: #339933;">/</span> totalTerms<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">// idf</span>
      <span style="color: #003399;">Double</span> totalDocuments  <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_documents&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">Double</span> td <span style="color: #339933;">=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;number_of_documents_containing_term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #003399;">BigDecimal</span> idf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">BigDecimal</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Math</span>.<span style="color: #006633;">log</span><span style="color: #009900;">&#40;</span>totalDocuments <span style="color: #339933;">/</span> <span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span> <span style="color: #339933;">+</span> td<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #666666; font-style: italic;">// tfidf</span>
      <span style="color: #003399;">BigDecimal</span> tfidf <span style="color: #339933;">=</span> tf.<span style="color: #006633;">multiply</span><span style="color: #009900;">&#40;</span>idf<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      functionCall.<span style="color: #006633;">getOutputCollector</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>
          <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span> tf.<span style="color: #006633;">toPlainString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, idf.<span style="color: #006633;">toPlainString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, tfidf.<span style="color: #006633;">toPlainString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #003399;">Integer</span> totalNumberOfDocuments<span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">public</span> TfIdfIndexSubAssmbly<span style="color: #009900;">&#40;</span>Pipe pipe, <span style="color: #003399;">Integer</span> totalNumberOfDocuments<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">totalNumberOfDocuments</span> <span style="color: #339933;">=</span> totalNumberOfDocuments<span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// calculate the total terms in each document. note that the input set is</span>
    <span style="color: #666666; font-style: italic;">// smaller because we've already counted the occurrence of each term in</span>
    <span style="color: #666666; font-style: italic;">// each document</span>
    <span style="color: #666666; font-style: italic;">// input: (document_id, term, term_count_in_document)</span>
    <span style="color: #666666; font-style: italic;">// emits: (document_id, term, term_count_in_document, total_terms_in_document)</span>
    <span style="color: #666666; font-style: italic;">// pipe = new Each(pipe, new Debug(true));</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe,
        <span style="color: #000000; font-weight: bold;">new</span> BufferedSum<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span><span style="color: #009900;">&#41;</span>,
                       <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span>,
                       <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>,
        Fields.<span style="color: #006633;">SWAP</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// calculate the number of documents containing each term</span>
    <span style="color: #666666; font-style: italic;">// input: (document_id, term, term_count_in_document, total_terms_in_document)</span>
    <span style="color: #666666; font-style: italic;">// emit:  (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term)</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Insert<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_in_doc&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span>, Fields.<span style="color: #006633;">ALL</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// we're going to sum these, easier than creating BufferedCount</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe,
        <span style="color: #000000; font-weight: bold;">new</span> BufferedSum<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;number_of_documents_containing_term&quot;</span><span style="color: #009900;">&#41;</span>,
                       <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_in_doc&quot;</span><span style="color: #009900;">&#41;</span>,
                       <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span>, <span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span>, <span style="color: #0000ff;">&quot;term_in_doc&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>,
        Fields.<span style="color: #006633;">SWAP</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #666666; font-style: italic;">//  reorder and rm some fields</span>
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span>, <span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span>, <span style="color: #0000ff;">&quot;number_of_documents_containing_term&quot;</span><span style="color: #009900;">&#41;</span>,
        <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Identity</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// here we do a hard-insert of the number of documents. again, you</span>
    <span style="color: #666666; font-style: italic;">// could/should calculate this with MR, but this is not the place. It</span>
    <span style="color: #666666; font-style: italic;">// would be better to calculate the number of documents when you are</span>
    <span style="color: #666666; font-style: italic;">// generating the input documents file.</span>
    <span style="color: #000066; font-weight: bold;">int</span> D <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">totalNumberOfDocuments</span><span style="color: #339933;">;</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Insert<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_documents&quot;</span><span style="color: #009900;">&#41;</span>, D<span style="color: #009900;">&#41;</span>, Fields.<span style="color: #006633;">ALL</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// now calculate tf * idf</span>
    <span style="color: #666666; font-style: italic;">// input:  (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term, total_documents)</span>
    <span style="color: #666666; font-style: italic;">// emit:   (document_id, term, term_count_in_document, total_terms_in_document, number_of_documents_containing_term, total_documents, tf, idf, tfidf)</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> TfIdfOperation<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, Fields.<span style="color: #006633;">ALL</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #666666; font-style: italic;">// reorder and rm some fields</span>
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;tfidf&quot;</span><span style="color: #009900;">&#41;</span>,
        <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Identity</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// &quot;stripe&quot; our group e.g.:</span>
    <span style="color: #666666; font-style: italic;">// input: (term,</span>
    <span style="color: #666666; font-style: italic;">//          (document_id_1, tfidf_1),</span>
    <span style="color: #666666; font-style: italic;">//          (document_id_2, tfidf_2),</span>
    <span style="color: #666666; font-style: italic;">//          ...)</span>
    <span style="color: #666666; font-style: italic;">// emit: (term, {document_id_1:tfidf_1, document_id_2:tfidf_2})</span>
    pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe,
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;tfidf&quot;</span><span style="color: #009900;">&#41;</span>,
        <span style="color: #000000; font-weight: bold;">new</span> JSONTupleAggregator<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;scores&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;JSONObject&quot;</span><span style="color: #009900;">&#41;</span>,
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;scores&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #666666; font-style: italic;">// must register all assembly tails</span>
    setTails<span style="color: #009900;">&#40;</span>pipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F&amp;title=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%202%29&amp;notes=Introduction%0D%0A%0D%0AThe%20tf-idf%20weight%20%28term%20frequency-inverse%20document%20frequency%29%20is%20a%20weight%20often%20used%20in%20information%20retrieval%20and%20text%20mining.%20This%20weight%20is%20a%20statistical%20measure%20used%20to%20evaluate%20how%20important%20a%20word%20is%20to%20a%20document%20in%20a%20collection" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F&amp;title=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%202%29" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%202%29%20-%20http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F&amp;t=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%202%29" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F&amp;title=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%202%29&amp;annotation=Introduction%0D%0A%0D%0AThe%20tf-idf%20weight%20%28term%20frequency-inverse%20document%20frequency%29%20is%20a%20weight%20often%20used%20in%20information%20retrieval%20and%20text%20mining.%20This%20weight%20is%20a%20statistical%20measure%20used%20to%20evaluate%20how%20important%20a%20word%20is%20to%20a%20document%20in%20a%20collection" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F&amp;t=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%202%29" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2010%2F05%2F14%2Fcascading-tf-idf-and-bufferedsum-part-2%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2010/05/14/cascading-tf-idf-and-bufferedsum-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to use Cascading with Hadoop Streaming</title>
		<link>http://eigenjoy.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/</link>
		<comments>http://eigenjoy.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/#comments</comments>
		<pubDate>Wed, 18 Nov 2009 19:45:46 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=152</guid>
		<description><![CDATA[Last time we talked about how to use a raw MapReduce job in Cascading. Now we are going to up the ante by using Hadoop Streaming as a Flow in Cascading. In this example, we hook a python streaming job into a Cascade.
Its pretty easy once you know how to do it: 

Create a JobConf [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.xcombinator.com/2009/11/11/how-to-use-a-raw-mapreduce-job-in-cascading/">Last time</a> we talked about how to use a raw MapReduce job in Cascading. Now we are going to up the ante by using Hadoop Streaming as a Flow in Cascading. In this example, we hook a python streaming job into a Cascade.</p>
<p>Its pretty easy once you know how to do it: </p>
<ul>
<li>Create a JobConf that defines the parameters for the streaming job</li>
<li>Send up the <code>hadoop-*-streaming.jar</code> with your cascading job by putting it in your <code>jar</code></li>
<li>Send up the scripts (python, in this case) by using the <code>-file</code> option</li>
<li>Send up any other dependencies, corpora, etc. by using the <code>-file</code>, <code>-cacheFile</code>, or <code>-cacheArchive</code> options (See the <a href="http://hadoop.apache.org/common/docs/r0.20.0/streaming.html">Hadoop Streaming</a> page for more details)</li>
</ul>
<h2>Resources</h2>
<h3>NLTK</h3>
<p>To generate the <code>nltkandyaml.mod</code> zip file do the following:</p>
<pre><code># download nltk and unzip
cd nltk
zip -r nltkandyaml.zip nltk yaml
mv nltkandyaml.zip nltkandyaml.mod
</code></pre>
<p>Note that this technique is taken from <a href="http://www.cloudera.com/node/48">Cloudera</a></p>
<h3>WordNet</h3>
<p>The WordNet zip file needs to be flat. e.g. don&#8217;t zip up the files with a subdirectory. You could create this file like so:</p>
<pre><code># download and unzip the wordnet corpus
cd wordnet
zip -r ../wordnet-flat.zip *
</code></pre>
<h2>Streaming Script</h2>
<p>In python, we&#8217;ll be using <code>zipimport.zipimporter</code> to import the <code>nltk</code> libraries from a zip file. In Hadoop 0.20.0, Hadoop didn&#8217;t decompress our <code>wordnet-flat.zip</code> file automatically (but we&#8217;ve heard reports that it will, but I&#8217;m not sure which versions). For us the <code>.zip</code> file was placed in <code>lib</code> relative to the <code>pwd</code> of the script.  This allowed us to keep the WordNet corpus as a zip and read it in that format.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">wn = WordNetCorpusReader<span style="color: black;">&#40;</span>nltk.<span style="color: black;">data</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'lib/wordnet-flat.zip'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>(In this code we&#8217;re not using the python reducer.)</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python </span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">os</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">zipimport</span>
&nbsp;
importer = <span style="color: #dc143c;">zipimport</span>.<span style="color: black;">zipimporter</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'nltkandyaml.mod'</span><span style="color: black;">&#41;</span>
yaml = importer.<span style="color: black;">load_module</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'yaml'</span><span style="color: black;">&#41;</span>
nltk = importer.<span style="color: black;">load_module</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'nltk'</span><span style="color: black;">&#41;</span>
punct = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'[^<span style="color: #000099; font-weight: bold;">\w</span><span style="color: #000099; font-weight: bold;">\s</span>]+'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">from</span> nltk.<span style="color: black;">corpus</span>.<span style="color: black;">reader</span> <span style="color: #ff7700;font-weight:bold;">import</span> wordnet
<span style="color: #ff7700;font-weight:bold;">from</span> nltk.<span style="color: black;">corpus</span>.<span style="color: black;">reader</span> <span style="color: #ff7700;font-weight:bold;">import</span> WordNetCorpusReader
&nbsp;
nltk.<span style="color: black;">data</span>.<span style="color: black;">path</span> += <span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;.&quot;</span><span style="color: black;">&#93;</span>
wn = WordNetCorpusReader<span style="color: black;">&#40;</span>nltk.<span style="color: black;">data</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'lib/wordnet-flat.zip'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> mapper<span style="color: black;">&#40;</span>args<span style="color: black;">&#41;</span>:
  line = <span style="color: #dc143c;">sys</span>.<span style="color: black;">stdin</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: #66cc66;">;</span>
  <span style="color: #ff7700;font-weight:bold;">try</span>:
    <span style="color: #ff7700;font-weight:bold;">while</span> line:
      line = line.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
      word = line
      all_synonyms = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
&nbsp;
      string_synsets = wn.<span style="color: black;">synsets</span><span style="color: black;">&#40;</span>word<span style="color: black;">&#41;</span>
&nbsp;
      <span style="color: #ff7700;font-weight:bold;">for</span> synset <span style="color: #ff7700;font-weight:bold;">in</span> string_synsets:
        synonyms = <span style="color: black;">&#91;</span>lemma.<span style="color: black;">name</span> <span style="color: #ff7700;font-weight:bold;">for</span> lemma <span style="color: #ff7700;font-weight:bold;">in</span> wn.<span style="color: black;">synset</span><span style="color: black;">&#40;</span>synset.<span style="color: black;">name</span><span style="color: black;">&#41;</span>.<span style="color: black;">lemmas</span><span style="color: black;">&#93;</span>
        synonyms.<span style="color: black;">pop</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> synonym <span style="color: #ff7700;font-weight:bold;">in</span> synonyms:
          synonym = <span style="color: #dc143c;">re</span>.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;_&quot;</span>, <span style="color: #483d8b;">&quot; &quot;</span>, synonym<span style="color: black;">&#41;</span>
          all_synonyms.<span style="color: black;">append</span><span style="color: black;">&#40;</span>synonym<span style="color: black;">&#41;</span> 
&nbsp;
      <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>word, <span style="color: #483d8b;">','</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>all_synonyms<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
      line = <span style="color: #dc143c;">sys</span>.<span style="color: black;">stdin</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
  <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #483d8b;">&quot;end of file&quot;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">None</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># we're not using this, but we could</span>
<span style="color: #ff7700;font-weight:bold;">def</span> reducer<span style="color: black;">&#40;</span>args<span style="color: black;">&#41;</span>:
  <span style="color: #ff7700;font-weight:bold;">for</span> line <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #dc143c;">sys</span>.<span style="color: black;">stdin</span>:
    line = line.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">print</span> line
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> __name__ == <span style="color: #483d8b;">&quot;__main__&quot;</span>:
  <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">&quot;mapper&quot;</span>:
    mapper<span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
  <span style="color: #ff7700;font-weight:bold;">elif</span> <span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">&quot;reducer&quot;</span>:
    reducer<span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span></pre></div></div>

<h2>Cascading Code</h2>
<p>Here&#8217;s the bulk of the code that will achieve the effect we want. Like last time, we&#8217;re using two intermediate taps as the input and output of the streaming job. Also, we&#8217;re just using TextLine files for simplicity.  If you don&#8217;t want the intermediate files hanging around, look at the comments towards the bottom for some example code on how to remove the files when the job is finished running. </p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.xcombinator.hadoopjobs.cascadingstreamingtest</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.cascade.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.Flow</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.FlowConnector</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.MapReduceFlow</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.aggregator.Count</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.regex.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.pipe.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.scheme.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tap.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Fields</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Identity</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Properties</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.conf.Configuration</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.conf.Configured</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.io.LongWritable</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.io.Text</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.FileInputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.FileOutputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.JobConf</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.TextInputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.TextOutputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.lib.IdentityMapper</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.lib.IdentityReducer</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.util.Tool</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.util.ToolRunner</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.log4j.Logger</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Debug</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.streaming.StreamJob</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * An example file to use a Hadoop Streaming job in cascading
 */</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> Main <span style="color: #000000; font-weight: bold;">extends</span> Configured <span style="color: #000000; font-weight: bold;">implements</span> Tool
  <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">final</span> Logger LOG <span style="color: #339933;">=</span> Logger.<span style="color: #006633;">getLogger</span><span style="color: #009900;">&#40;</span> Main.<span style="color: #000000; font-weight: bold;">class</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">int</span> run<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    JobConf conf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> JobConf<span style="color: #009900;">&#40;</span>getConf<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">Properties</span> properties <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Properties</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    FlowConnector.<span style="color: #006633;">setApplicationJarClass</span><span style="color: #009900;">&#40;</span>properties, <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    CascadeConnector cascadeConnector <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> CascadeConnector<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    FlowConnector flowConnector <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> FlowConnector<span style="color: #009900;">&#40;</span>properties<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #003399;">String</span> inputPath  <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">String</span> outputPath <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">String</span> intermediatePath1 <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;-mr-input&quot;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">String</span> intermediatePath2 <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;-mr-output&quot;</span><span style="color: #339933;">;</span>
&nbsp;
    Scheme textLineScheme <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TextLine<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    Tap sourceTap <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, inputPath<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap intermediateTap1 <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, intermediatePath1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap intermediateTap2 <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, intermediatePath2<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap sinkTap   <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, outputPath<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create our first flow, sink to the intermediateTap</span>
    Pipe wsPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;wordsplit&quot;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;line&quot;</span><span style="color: #009900;">&#41;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> RegexSplitGenerator<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\\</span>s+&quot;</span><span style="color: #009900;">&#41;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    Flow parsedLogFlow <span style="color: #339933;">=</span> flowConnector.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span>sourceTap, intermediateTap1, wsPipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// Create a pipe and set our mr job for it </span>
    Pipe importPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Pipe<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;mr pipe&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Flow mrFlow<span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
      JobConf streamConf <span style="color: #339933;">=</span> StreamJob.<span style="color: #006633;">createJob</span><span style="color: #009900;">&#40;</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#123;</span>
          <span style="color: #0000ff;">&quot;-input&quot;</span>, intermediateTap1.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, 
          <span style="color: #0000ff;">&quot;-output&quot;</span>, intermediateTap2.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>,
&nbsp;
          <span style="color: #666666; font-style: italic;">// straight unix</span>
          <span style="color: #666666; font-style: italic;">// &quot;-mapper&quot;, &quot;/bin/cat&quot;,</span>
          <span style="color: #666666; font-style: italic;">// &quot;-reducer&quot;, &quot;/usr/bin/wc&quot;</span>
&nbsp;
          <span style="color: #666666; font-style: italic;">// ruby</span>
          <span style="color: #666666; font-style: italic;">// &quot;-mapper&quot;, &quot;src/main/ruby/word_count_mapper.rb&quot;,</span>
          <span style="color: #666666; font-style: italic;">// &quot;-reducer&quot;, &quot;src/main/ruby/word_count_reducer.rb&quot;,</span>
          <span style="color: #666666; font-style: italic;">// &quot;-file&quot;, &quot;src/main/ruby/word_count_mapper.rb&quot;,</span>
          <span style="color: #666666; font-style: italic;">// &quot;-file&quot;, &quot;src/main/ruby/word_count_reducer.rb&quot;</span>
&nbsp;
          <span style="color: #666666; font-style: italic;">// python</span>
          <span style="color: #0000ff;">&quot;-mapper&quot;</span>, <span style="color: #0000ff;">&quot;python synsets.py mapper&quot;</span>,
          <span style="color: #0000ff;">&quot;-reducer&quot;</span>, <span style="color: #0000ff;">&quot;org.apache.hadoop.mapred.lib.IdentityReducer&quot;</span>,
          <span style="color: #0000ff;">&quot;-file&quot;</span>, <span style="color: #0000ff;">&quot;src/main/python/synsets.py&quot;</span>,
          <span style="color: #0000ff;">&quot;-file&quot;</span>, <span style="color: #0000ff;">&quot;resources/nltkandyaml.mod&quot;</span>,
          <span style="color: #0000ff;">&quot;-file&quot;</span>, <span style="color: #0000ff;">&quot;resources/lib/wordnet-flat.zip&quot;</span>,
          <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      mrFlow <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> MapReduceFlow<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;streaming flow&quot;</span>, streamConf, intermediateTap1,
        intermediateTap2, <span style="color: #000066; font-weight: bold;">false</span>, <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> ioe<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
       ioe.<span style="color: #006633;">printStackTrace</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
       <span style="color: #003399;">System</span>.<span style="color: #006633;">exit</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
       <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create our third &quot;regular&quot; cascading pipe. this is a bit contrived, but</span>
    <span style="color: #666666; font-style: italic;">// the idea is substitute all 'e's with 'x's. it's just here to show how to</span>
    <span style="color: #666666; font-style: italic;">// take the input of a streaming job back into cascading</span>
    Pipe subPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Pipe<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;subber&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    subPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>subPipe,
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;line&quot;</span><span style="color: #009900;">&#41;</span>,
        <span style="color: #000000; font-weight: bold;">new</span> RegexReplace<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;linx&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;e&quot;</span>, <span style="color: #0000ff;">&quot;x&quot;</span>, <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span>,
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;linx&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Flow subFlow <span style="color: #339933;">=</span> flowConnector.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span>intermediateTap2, sinkTap, subPipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    Cascade cascade <span style="color: #339933;">=</span> cascadeConnector.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span>parsedLogFlow, mrFlow, subFlow<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    cascade.<span style="color: #006633;">complete</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// to get rid of the intermediate files you could do this:</span>
    <span style="color: #666666; font-style: italic;">// Path tmp = tap.getPath();</span>
    <span style="color: #666666; font-style: italic;">// FileSystem fs = tmp.getFileSystem(conf);</span>
    <span style="color: #666666; font-style: italic;">// fs.delete(tmp, true);</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> 
  <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">int</span> res <span style="color: #339933;">=</span> ToolRunner.<span style="color: #006633;">run</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Configuration<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000000; font-weight: bold;">new</span> Main<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, args<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">System</span>.<span style="color: #006633;">exit</span><span style="color: #009900;">&#40;</span>res<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F&amp;title=How%20to%20use%20Cascading%20with%20Hadoop%20Streaming&amp;notes=Last%20time%20we%20talked%20about%20how%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading.%20Now%20we%20are%20going%20to%20up%20the%20ante%20by%20using%20Hadoop%20Streaming%20as%20a%20Flow%20in%20Cascading.%20In%20this%20example%2C%20we%20hook%20a%20python%20streaming%20job%20into%20a%20Cascade.%0D%0A%0D%0AIts%20pretty%20easy%20once%20you%20know%20" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F&amp;title=How%20to%20use%20Cascading%20with%20Hadoop%20Streaming" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=How%20to%20use%20Cascading%20with%20Hadoop%20Streaming%20-%20http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F&amp;t=How%20to%20use%20Cascading%20with%20Hadoop%20Streaming" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F&amp;title=How%20to%20use%20Cascading%20with%20Hadoop%20Streaming&amp;annotation=Last%20time%20we%20talked%20about%20how%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading.%20Now%20we%20are%20going%20to%20up%20the%20ante%20by%20using%20Hadoop%20Streaming%20as%20a%20Flow%20in%20Cascading.%20In%20this%20example%2C%20we%20hook%20a%20python%20streaming%20job%20into%20a%20Cascade.%0D%0A%0D%0AIts%20pretty%20easy%20once%20you%20know%20" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F&amp;t=How%20to%20use%20Cascading%20with%20Hadoop%20Streaming" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F18%2Fhow-to-use-cascading-with-hadoop-streaming%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Slides for &#8220;Introduction to Cascading&#8221; Presentation</title>
		<link>http://eigenjoy.com/2009/11/13/slides-for-introduction-to-cascading-presentation/</link>
		<comments>http://eigenjoy.com/2009/11/13/slides-for-introduction-to-cascading-presentation/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 00:57:39 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=140</guid>
		<description><![CDATA[This week I gave an introductory presentation to Cascading. These are the slides from that presentation.
Intro To Cascading
View more documents from Nate Murray.

Share:
	
	
	
	
	
	
	
	
	

]]></description>
			<content:encoded><![CDATA[<p>This week I gave an introductory presentation to <a href="http://www.cascading.org">Cascading</a>. These are the slides from that presentation.</p>
<div style="width:425px;text-align:left" id="__ss_2487571"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/jashmenn/intro-to-cascading" title="Intro To Cascading">Intro To Cascading</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=intro-to-cascading-091112163237-phpapp01&#038;rel=0&#038;stripped_title=intro-to-cascading" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=intro-to-cascading-091112163237-phpapp01&#038;rel=0&#038;stripped_title=intro-to-cascading" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object></p>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View more <a style="text-decoration:underline;" href="http://www.slideshare.net/">documents</a> from <a style="text-decoration:underline;" href="http://www.slideshare.net/jashmenn">Nate Murray</a>.</div>
</div>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F&amp;title=Slides%20for%20%22Introduction%20to%20Cascading%22%20Presentation&amp;notes=This%20week%20I%20gave%20an%20introductory%20presentation%20to%20Cascading.%20These%20are%20the%20slides%20from%20that%20presentation.%0D%0A%0D%0AIntro%20To%20CascadingView%20more%20documents%20from%20Nate%20Murray." title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F&amp;title=Slides%20for%20%22Introduction%20to%20Cascading%22%20Presentation" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Slides%20for%20%22Introduction%20to%20Cascading%22%20Presentation%20-%20http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F&amp;t=Slides%20for%20%22Introduction%20to%20Cascading%22%20Presentation" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F&amp;title=Slides%20for%20%22Introduction%20to%20Cascading%22%20Presentation&amp;annotation=This%20week%20I%20gave%20an%20introductory%20presentation%20to%20Cascading.%20These%20are%20the%20slides%20from%20that%20presentation.%0D%0A%0D%0AIntro%20To%20CascadingView%20more%20documents%20from%20Nate%20Murray." title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F&amp;t=Slides%20for%20%22Introduction%20to%20Cascading%22%20Presentation" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2009%2F11%2F13%2Fslides-for-introduction-to-cascading-presentation%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2009/11/13/slides-for-introduction-to-cascading-presentation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>testing erlang gen_server with gen_server_mock</title>
		<link>http://eigenjoy.com/2009/08/11/testing-erlang-gen_server-with-gen_server_mock/</link>
		<comments>http://eigenjoy.com/2009/08/11/testing-erlang-gen_server-with-gen_server_mock/#comments</comments>
		<pubDate>Tue, 11 Aug 2009 15:42:49 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/2009/08/11/testing-erlang-gen_server-with-gen_server_mock/</guid>
		<description><![CDATA[Testing by synchronous pattern matching
Testing multi-process erlang gen_servers can be tricky. Typically one relies simply on pattern matching to verify that the response matches what you would expect.

&#123;expected, Response&#125; = gen_server:call&#40;Pid, hi&#41;.

As long as the gen_server call hi returns expected as the first element of the tuple, then the tests pass.
The technique is also the [...]]]></description>
			<content:encoded><![CDATA[<h2>Testing by synchronous pattern matching</h2>
<p>Testing multi-process erlang gen_servers can be tricky. Typically one relies simply on pattern matching to verify that the response matches what you would expect.</p>

<div class="wp_syntax"><div class="code"><pre class="erlang" style="font-family:monospace;"><span style="color: #109ab8;">&#123;</span>expected<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">Response</span><span style="color: #109ab8;">&#125;</span> <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Pid</span><span style="color: #6bb810;">,</span> hi<span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">.</span></pre></div></div>

<p>As long as the <code>gen_server</code> call <code>hi</code> returns <code>expected</code> as the first element of the tuple, then the tests pass.</p>
<p>The technique is also the same when building client-server code where both client and server are <code>gen_server</code>s. The common case is to simply test one side at a time; test the response of all client calls and then (independently) test the responses of the server calls.</p>
<h2>What about asynchronous <code>cast</code>?</h2>
<p><code>gen_server:call</code> is convenient because it is synchronous and returns a value.<br />
<code>gen_server:cast</code>, on the other hand, is asynchronous and always returns the atom <code>ok</code>. This can make <code>cast</code>s difficult to test.</p>
<p><code>gen_server_mock</code> is a library to mock <code>gen_server</code> processes that expect specific, ordered sets of messages. It allows you to unit test <code>gen_server</code>s by verifying they are receiving the expected set of messages.</p>
<h2>Example 1</h2>

<div class="wp_syntax"><div class="code"><pre class="erlang" style="font-family:monospace;">     <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">Mock</span><span style="color: #109ab8;">&#125;</span> <span style="color: #014ea4;">=</span> gen_server_mock:<span style="color: #ff3c00;">new</span><span style="color: #109ab8;">&#40;</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> call<span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span><span style="color: #109ab8;">&#123;</span>foo<span style="color: #6bb810;">,</span> hi<span style="color: #109ab8;">&#125;</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_From</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_State</span><span style="color: #109ab8;">&#41;</span> <span style="color: #6bb810;">-&gt;</span> <span style="color: #006600;">ok</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect_call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span><span style="color: #109ab8;">&#123;</span>bar<span style="color: #6bb810;">,</span> bye<span style="color: #109ab8;">&#125;</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_From</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_State</span><span style="color: #109ab8;">&#41;</span> <span style="color: #6bb810;">-&gt;</span> <span style="color: #006600;">ok</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
&nbsp;
     ok <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #109ab8;">&#123;</span>foo<span style="color: #6bb810;">,</span> hi<span style="color: #109ab8;">&#125;</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>  
     ok <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #109ab8;">&#123;</span>bar<span style="color: #6bb810;">,</span> bye<span style="color: #109ab8;">&#125;</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>  
&nbsp;
     ok <span style="color: #014ea4;">=</span> gen_server_mock:<span style="color: #ff3c00;">assert_expectations</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #109ab8;">&#41;</span></pre></div></div>

<p>This <code>Mock</code> expects two <code>call</code>s: <code>{foo, hi}</code> and <code>{bar, bye}</code>. Since <code>Mock</code> receives both of these messages, <code>assert_expectations</code> does not raise any errors.</p>
<h2>What is verified</h2>
<p><code>gen_server_mock:assert_expectations(Mock)</code> verifies that:</p>
<ol>
<li>all expected messages were received</li>
<li>no messages were received that were not expected</li>
</ol>
<p>You can catch the <code>exit</code> by using the following:</p>

<div class="wp_syntax"><div class="code"><pre class="erlang" style="font-family:monospace;">     <span style="color: #45b3e6;">Result</span> <span style="color: #014ea4;">=</span> <span style="color: #186895;">try</span> gen_server_mock:<span style="color: #ff3c00;">assert_expectations</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #109ab8;">&#41;</span>
     <span style="color: #186895;">catch</span>
         exit:<span style="color: #006600;">Exception</span> <span style="color: #6bb810;">-&gt;</span> <span style="color: #45b3e6;">Exception</span>
     <span style="color: #186895;">end</span><span style="color: #6bb810;">,</span>
     <span style="color: #666666; font-style: italic;">% etc...</span></pre></div></div>

<h2>Example 2</h2>

<div class="wp_syntax"><div class="code"><pre class="erlang" style="font-family:monospace;">     <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">Mock</span><span style="color: #109ab8;">&#125;</span> <span style="color: #014ea4;">=</span> gen_server_mock:<span style="color: #ff3c00;">new</span><span style="color: #109ab8;">&#40;</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
&nbsp;
     gen_server_mock:<span style="color: #ff3c00;">expect_call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span>one<span style="color: #6bb810;">,</span>  <span style="color: #45b3e6;">_From</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_State</span><span style="color: #109ab8;">&#41;</span>            <span style="color: #6bb810;">-&gt;</span> <span style="color: #006600;">ok</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect_call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span>two<span style="color: #6bb810;">,</span>  <span style="color: #45b3e6;">_From</span><span style="color: #6bb810;">,</span>  <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#41;</span>            <span style="color: #6bb810;">-&gt;</span> <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#125;</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect_call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span>three<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_From</span><span style="color: #6bb810;">,</span>  <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#41;</span>           <span style="color: #6bb810;">-&gt;</span> <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> good<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#125;</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect_call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span><span style="color: #109ab8;">&#123;</span>echo<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">Response</span><span style="color: #109ab8;">&#125;</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">_From</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#41;</span> <span style="color: #6bb810;">-&gt;</span> <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">Response</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#125;</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect_cast</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span>fish<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#41;</span> <span style="color: #6bb810;">-&gt;</span> <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#125;</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     gen_server_mock:<span style="color: #ff3c00;">expect_info</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #ff3c00;">fun</span><span style="color: #109ab8;">&#40;</span>cat<span style="color: #6bb810;">,</span>  <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#41;</span> <span style="color: #6bb810;">-&gt;</span> <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">State</span><span style="color: #109ab8;">&#125;</span> <span style="color: #186895;">end</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
&nbsp;
     ok <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> one<span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     ok <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> two<span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     good <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> three<span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     tree <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">call</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> <span style="color: #109ab8;">&#123;</span>echo<span style="color: #6bb810;">,</span> tree<span style="color: #109ab8;">&#125;</span><span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     ok <span style="color: #014ea4;">=</span> <span style="color: #ff4e18;">gen_server</span>:<span style="color: #ff3c00;">cast</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #6bb810;">,</span> fish<span style="color: #109ab8;">&#41;</span><span style="color: #6bb810;">,</span>
     <span style="color: #45b3e6;">Mock</span> <span style="color: #014ea4;">!</span> cat<span style="color: #6bb810;">,</span>
&nbsp;
     gen_server_mock:<span style="color: #ff3c00;">assert_expectations</span><span style="color: #109ab8;">&#40;</span><span style="color: #45b3e6;">Mock</span><span style="color: #109ab8;">&#41;</span></pre></div></div>

<p>Currently three types of messages are supported: <code>call</code>, <code>cast</code>, and <code>info</code>.</p>
<p>The signature of the <code>fun</code> of each expectation is the same as the corresponding<br />
<code>gen_server:handle_*</code>. So the <code>fun</code> for <code>expect_call</code> has the same signature as <code>handle_call</code>: <code>fun(Request, From, State)</code>. See <code>man gen_server</code> for more information.</p>
<p>However, the return value of the <code>fun</code> <em>must</em> be one of:</p>

<div class="wp_syntax"><div class="code"><pre class="erlang" style="font-family:monospace;">    ok |                  
    <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">NewState</span><span style="color: #109ab8;">&#125;</span> |
    <span style="color: #109ab8;">&#123;</span>ok<span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">ResponseValue</span><span style="color: #6bb810;">,</span> <span style="color: #45b3e6;">NewState</span><span style="color: #109ab8;">&#125;</span> |</pre></div></div>

<p>Anything else will be an error. Note that you can change the state of your <code>Mock</code> by returning <code>NewState</code>.</p>
<p>Arbitrary, non-<code>gen_server</code> messages are handled with <code>expect_info</code>, e.g. <code>Mock ! cat</code> fulfills the <code>expect_info</code> in the example above.</p>
<h2>References</h2>
<ul>
<li><a href="http://github.com/jashmenn/gen_server_mock">Github Repo</a> (Patches readily accepted)</li>
<li>Work inspired by <a href="http://erlang.org/pipermail/erlang-questions/2008-April/034140.html">this post</a></li>
<li><a href="http://martinfowler.com/articles/mocksArentStubs.html">Mocks Aren&#8217;t Stubs</a></li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F&amp;title=testing%20erlang%20gen_server%20with%20gen_server_mock&amp;notes=Testing%20by%20synchronous%20pattern%20matching%20%0D%0A%20%0D%0ATesting%20multi-process%20erlang%20gen_servers%20can%20be%20tricky.%20Typically%20one%20relies%20simply%20on%20pattern%20matching%20to%20verify%20that%20the%20response%20matches%20what%20you%20would%20expect.%20%0D%0A%20%0D%0A%0D%0A%7Bexpected%2C%20Response%7D%20%3D%20gen_server%3Ac" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F&amp;title=testing%20erlang%20gen_server%20with%20gen_server_mock" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=testing%20erlang%20gen_server%20with%20gen_server_mock%20-%20http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F&amp;t=testing%20erlang%20gen_server%20with%20gen_server_mock" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F&amp;title=testing%20erlang%20gen_server%20with%20gen_server_mock&amp;annotation=Testing%20by%20synchronous%20pattern%20matching%20%0D%0A%20%0D%0ATesting%20multi-process%20erlang%20gen_servers%20can%20be%20tricky.%20Typically%20one%20relies%20simply%20on%20pattern%20matching%20to%20verify%20that%20the%20response%20matches%20what%20you%20would%20expect.%20%0D%0A%20%0D%0A%0D%0A%7Bexpected%2C%20Response%7D%20%3D%20gen_server%3Ac" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F&amp;t=testing%20erlang%20gen_server%20with%20gen_server_mock" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2009%2F08%2F11%2Ftesting-erlang-gen_server-with-gen_server_mock%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2009/08/11/testing-erlang-gen_server-with-gen_server_mock/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>temporarily undo commit(s) on a remote server</title>
		<link>http://eigenjoy.com/2009/05/20/temporarily-undo-commits-on-a-remote-server/</link>
		<comments>http://eigenjoy.com/2009/05/20/temporarily-undo-commits-on-a-remote-server/#comments</comments>
		<pubDate>Wed, 20 May 2009 23:51:20 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/2009/05/20/temporarily-undo-commits-on-a-remote-server/</guid>
		<description><![CDATA[I do not claim to be a wiz at git, and I do not ensure what I am writing about, but it seemed to work for me, and I appreciate any comments.
My goal was to temporarily revert one or many commits that I had pushed to the remote server.
http://cheat.errtheblog.com/s/git &#8212; specifically the &#8220;Fix mistakes / [...]]]></description>
			<content:encoded><![CDATA[<p>I do not claim to be a wiz at git, and I do not ensure what I am writing about, but it seemed to work for me, and I appreciate any comments.</p>
<p>My goal was to temporarily revert one or many commits that I had pushed to the remote server.</p>
<p><a href="http://cheat.errtheblog.com/s/git">http://cheat.errtheblog.com/s/git</a> &#8212; specifically the &#8220;<span highlight="Search">Fix</span> mistakes / Undo&#8221; section was helpful.</p>
<p>What I found:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">git revert <span style="color: #660033;">-n</span> <span style="color: #000000; font-weight: bold;">&amp;</span>lt;sha<span style="color: #000000; font-weight: bold;">&amp;</span>gt;</pre></div></div>

<p>#run this for each commit you would like to &#8220;undo&#8221;</p>
<p>(the -n makes it so that you are not actually creating a commit, but staging the reverse of the changes made by your &lt;sha&gt; commit in your index. git status will show you this)</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">git ci <span style="color: #660033;">-a</span> <span style="color: #666666; font-style: italic;"># apply your revision</span>
git push <span style="color: #666666; font-style: italic;">#origin to master (these steps effectively created one commit that was the product of reversing all the commits you picked in the git revert -n step)</span></pre></div></div>

<p>now your index looks like:</p>
<li>&lt;sha1&gt;&#8230; revision of &lt;commit&#8230;s&gt;</li>
<li>&lt;sha2&gt;&#8230; commit4</li>
<li>&lt;sha3&gt;&#8230; commit3</li>
<p>now, lets say, the time has come to reapply your commits. Because you didn&#8217;t just do</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">git reset <span style="color: #660033;">--hard</span> <span style="color: #000000; font-weight: bold;">&amp;</span>lt;commit3<span style="color: #000000; font-weight: bold;">&amp;</span>gt;</pre></div></div>

<p>or something like that, all you have to do is git reset &#8211;hard &lt;sha1&gt; which will &#8220;undo your undo&#8221;</p>
<p>then</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">git push <span style="color: #666666; font-style: italic;">#origin master</span></pre></div></div>

<p> again and you are back to where you were.</p>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F&amp;title=temporarily%20undo%20commit%28s%29%20on%20a%20remote%20server&amp;notes=I%20do%20not%20claim%20to%20be%20a%20wiz%20at%20git%2C%20and%20I%20do%20not%20ensure%20what%20I%20am%20writing%20about%2C%20but%20it%20seemed%20to%20work%20for%20me%2C%20and%20I%20appreciate%20any%20comments.%0D%0A%0D%0AMy%20goal%20was%20to%20temporarily%20revert%20one%20or%20many%20commits%20that%20I%20had%20pushed%20to%20the%20remote%20server.%0D%0A%0D%0Ahttp%3A%2F%2Fch" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F&amp;title=temporarily%20undo%20commit%28s%29%20on%20a%20remote%20server" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=temporarily%20undo%20commit%28s%29%20on%20a%20remote%20server%20-%20http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F&amp;t=temporarily%20undo%20commit%28s%29%20on%20a%20remote%20server" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F&amp;title=temporarily%20undo%20commit%28s%29%20on%20a%20remote%20server&amp;annotation=I%20do%20not%20claim%20to%20be%20a%20wiz%20at%20git%2C%20and%20I%20do%20not%20ensure%20what%20I%20am%20writing%20about%2C%20but%20it%20seemed%20to%20work%20for%20me%2C%20and%20I%20appreciate%20any%20comments.%0D%0A%0D%0AMy%20goal%20was%20to%20temporarily%20revert%20one%20or%20many%20commits%20that%20I%20had%20pushed%20to%20the%20remote%20server.%0D%0A%0D%0Ahttp%3A%2F%2Fch" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F&amp;t=temporarily%20undo%20commit%28s%29%20on%20a%20remote%20server" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2009%2F05%2F20%2Ftemporarily-undo-commits-on-a-remote-server%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2009/05/20/temporarily-undo-commits-on-a-remote-server/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>runaway process&#8230; on a mac</title>
		<link>http://eigenjoy.com/2009/04/18/runaway-process-on-a-mac/</link>
		<comments>http://eigenjoy.com/2009/04/18/runaway-process-on-a-mac/#comments</comments>
		<pubDate>Sat, 18 Apr 2009 23:58:59 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/2009/04/18/runaway-process-on-a-mac/</guid>
		<description><![CDATA[wresting with the command line, and some cool cli programs I learned on the way
I want desperately to know how to detach a process from one terminal, and re-attach it to another, without using screen from the get-go. The more I read about it, the more I figure that the answer (you can&#8217;t) probably has [...]]]></description>
			<content:encoded><![CDATA[<p>wresting with the command line, and some cool cli programs I learned on the way</p>
<p>I want desperately to know how to detach a process from one terminal, and re-attach it to another, without using screen from the get-go. The more I read about it, the more I figure that the answer (you can&#8217;t) probably has more to do with my lack of understanding of how processes and terminals work. I read a great post <a href="http://www.xaprb.com/blog/2008/08/01/how-to-leave-a-program-running-after-you-log-out/">here</a> that introduced me to <tt>disown -h</tt> (careful) and <tt>nohup</tt>, some really great bash builtins. I thought, ok, lets try it. This is where I got stuck.</p>
<p>I tried</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">while</span> <span style="color: #c20cb9; font-weight: bold;">true</span>; <span style="color: #000000; font-weight: bold;">do</span> <span style="color: #c20cb9; font-weight: bold;">sleep</span> <span style="color: #000000;">10</span>; <span style="color: #000000; font-weight: bold;">done</span> <span style="color: #000000; font-weight: bold;">&amp;</span>amp;
<span style="color: #7a0874; font-weight: bold;">disown</span> <span style="color: #660033;">-h</span> <span style="color: #000000; font-weight: bold;">%</span>1
<span style="color: #7a0874; font-weight: bold;">exit</span></pre></div></div>

<p>The disown builtin handles a problem with background processes: From the bash man page (and also the above blog)</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">The shell exits by default upon receipt of a SIGHUP</pre></div></div>

<p>therefore killing your process you &#8216;thought&#8217; you put in the background, and then logged out to go home for the night. Disown tells the jobspec not to accept a SIGHUP, and the -h switch tells it to remain in the jobs table. I thought, cool, maybe if it stays in the the jobs table, I could also transfer it to another jobs table of another tty. (no, you can&#8217;t&#8230;)<br />
but now I had a process on my hands that wasn&#8217;t attached to a terminal, and would just run forever unless I rebooted.</p>
<p>The while loop itself didn&#8217;t have a process ID, which is interesting, and because of the nature of while, the sleep commands PID kept changing, so a normal <tt>ps aux | grep slee[p] | awk '{print $2}' | xargs kill -9</tt> wasn&#8217;t working. (This post is loosing topic fast, but the <tt>slee[p]</tt> in the above command was a cool trick I learned so that I didn&#8217;t need a <tt>grep -v grep</tt> in there).</p>
<p>I *did* find that I could use <tt>ps</tt> to figure out the ppid (parent process ID) and just kill -9 that, but I was also interested in knowing for sure that it wasn&#8217;t in charge of doing something else important. A little digging around, and I came across the UNIX utility <tt>pstree</tt> which of course didn&#8217;t come on my mac, but I quickly figured out that it could be installed with</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">sudo</span> port <span style="color: #c20cb9; font-weight: bold;">install</span> <span style="color: #c20cb9; font-weight: bold;">pstree</span></pre></div></div>

<p>Yesterday, I had done a similar thing with the UNIX command <tt>watch</tt>, which also nicely installed using <tt>port</tt><br />
And, for those who don&#8217;t know, the UNIX command <tt>watch</tt> is a great poller utility, that will display the first screen&#8217;s worth of output of any command, and update it on a regular basis.</p>
<p>I used <tt>ps | grep</tt> to find the ppid of the sleep process, then ran this command:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">watch <span style="color: #ff0000;">&quot;pstree <span style="color: #007800;">$PPID</span>&quot;</span></pre></div></div>

<p>This was way cool, as every ten seconds, I watched as the PID of sleep (the child process of this bash process I had just found) changed.</p>
<p>Take away:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">sudo</span> port <span style="color: #c20cb9; font-weight: bold;">install</span> watch
<span style="color: #c20cb9; font-weight: bold;">sudo</span> port <span style="color: #c20cb9; font-weight: bold;">install</span> <span style="color: #c20cb9; font-weight: bold;">pstree</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F&amp;title=runaway%20process...%20on%20a%20mac&amp;notes=wresting%20with%20the%20command%20line%2C%20and%20some%20cool%20cli%20programs%20I%20learned%20on%20the%20way%0D%0A%0D%0AI%20want%20desperately%20to%20know%20how%20to%20detach%20a%20process%20from%20one%20terminal%2C%20and%20re-attach%20it%20to%20another%2C%20without%20using%20screen%20from%20the%20get-go.%20The%20more%20I%20read%20about%20it%2C%20the%20" title="del.icio.us"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F&amp;title=runaway%20process...%20on%20a%20mac" title="Reddit"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F" title="Technorati"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=runaway%20process...%20on%20a%20mac%20-%20http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F" title="Twitter"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F&amp;t=runaway%20process...%20on%20a%20mac" title="Facebook"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F&amp;title=runaway%20process...%20on%20a%20mac&amp;annotation=wresting%20with%20the%20command%20line%2C%20and%20some%20cool%20cli%20programs%20I%20learned%20on%20the%20way%0D%0A%0D%0AI%20want%20desperately%20to%20know%20how%20to%20detach%20a%20process%20from%20one%20terminal%2C%20and%20re-attach%20it%20to%20another%2C%20without%20using%20screen%20from%20the%20get-go.%20The%20more%20I%20read%20about%20it%2C%20the%20" title="Google Bookmarks"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F&amp;t=runaway%20process...%20on%20a%20mac" title="HackerNews"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Feigenjoy.com%2F2009%2F04%2F18%2Frunaway-process-on-a-mac%2F&amp;partner=sociable" title="PDF"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://eigenjoy.com/feed/" title="RSS"><img src="http://eigenjoy.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://eigenjoy.com/2009/04/18/runaway-process-on-a-mac/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

