Michael Ellsberg’s (Mixergy) Advice on Copywriting

Today we’re going to take a detour from the regular topics at eigenjoy and talk about a topic a lot of hackers need help with: copywriting.

If you’re reading this blog you probably need help copywriting.

A common trend I’ve seen with hackers is that while we’re strong on technical prowess, we’re short on communicating why our work is important. We can whip up a beautiful rails app in a few hours and craft elegant functional combinators but we struggle to explain why our product is worth someone’s attention (and money).

You may be thinking: wait a minute, copywriting is sleazy and I don’t want to stoop to that. I would argue that while some direct response marketing is scummy, copywriting, in the broader sense, is a lot more than cheap tricks. Being able to craft empathetic and direct copy can mean the difference between failure and success with your product.

Enter Michael Ellsberg.

I first took note of Michael when I read his excellent Forbes article building a social economy through helping others. (He also expounds on this idea in his talk to the Thiel Fellows. Highly recommended.)


Michael Ellsberg at the Thiel Fellowship Retreat from Michael Ellsberg on Vimeo.

Michael was recently interviewed by Andrew Warner on Mixergy and they spent a bit of time talking about how to become a better copywriter.

Michael’s advice was this: create a new email address and sign up for the email lists by master copywriters. This allows you to read and learn their techniques for free. In the interview, Michael also mentioned a number of classic books in the field of copywriting.

I’ve collected links to each of these resources and I’ve listed them for you below.*

Direct Response Copywriters

Please note! These folks range from the soft-sell to the extreme-hard-sell (e.g. Dan Kennedy). Please be aware that I’m not necessarily endorsing these hard-edge techniques, but rather, I want to make you aware of the full range of tools you have in your toolbox.

Copywriting Books

Hope this helps!

P.S.
Since we’re on the topic of copywriting, I’d suggest you checkout Marc-André Cournoyer’s Copy Writing for Geeks. It’s a great intro for hackers who want to quickly improve their writing skills.

* Note: None of these are affiliate links, I have no financial incentive with any of these folks.

Posted in marketing | 1 Comment

Understanding run n with conde in the Reasoned Schemer

I’m working through The Reasoned Schemer and had some trouble seeing clearly how results are returned when using run n. After Googling a bit, I found a clear walkthrough of the logic on an obscure message board. I’m reposting here to make sure it doesn’t disappear.

The key insight is that the n in run n refers not to the number of goals that should be tried, but rather the number of answers that should be returned, if available. Full text below:

Posted: Mon Jan 14, 2008 2:19 pm

Hi all,

I’m currently re-reading the Reasoned Schemer, but am having a bit of
difficulty understanding how values are computed for recursive
functions. For example, in chapter 2, the recursive function lolo and
section 24 asks what is the value of the following:

(run 5 (x)
(lolo ((a b) (c d) . x)))

The first value (), is obvious to me. The variable x is fresh, so it
is associated with () via the nullo check on line 4 of the method.

Here’s my explanation for how the second value is derived. Since we
are asking for another value, and the expressions are evaluated within
the conde, we refresh x and try the next line. At this point, the caro
associates the fresh variable x with the cons of the fresh variable a
and the fresh variable d-prime (introduced by the caro). Then, listo
is called on a. Since a is fresh, this call associates a with ().
Since this question succeeds, we try the answer. The cdro associates x
with the cons of fresh variable a-prime (introduced by the cdro) and
the fresh variable d. Then, lolo is called on d, and so d (being
fresh), is associated with ().

Since a-prime and a co-share, and a is (), and d and d-prime co-share,
and d is (), x can be successfully associated with the cons of () and
(), so the result is (()).

Here’s where I get confused. This invocation of run 2 has actually
produced #s 3 times, once to produce () and twice to produce (())
(once in each of the calls to listo and the recursive call to lolo).
However, we have *asked* for only two goals to be met, in order to
produce two values. If I invoke “run 3 …,” however, which conde
lines in which recursive frames are being evaluated, and having their
associations preserved? Or rather, how is the result expanding beyond
(())?

At the high level, I understand why the answer is (() (()) (() ()) (()
() ()) (() () () ())). But I’m still missing something at the lower
level that makes it harder for me to understand how some of the more
advanced examples, like the adder, work.

Thanks,
Joe
=====================

Posted: Tue Jan 15, 2008 3:30 pm

(run 2 (q) g1 g2 g3) does *not* specify that at most two goal
invocations should succeed, or that at most two #s goals should be
tried. Rather, run 2 specifies that we want two answers (if there are
two answers to be had), regardless of how many goals must be tried,
must succeed, or must fail in order to get those answers.

Hope this helps.

–Will

Posted in programming | Leave a comment

Challenges in Large-Scale Web Crawling

Simple web crawling is easy but when you start crawling several hundred million pages there are a number of difficult challenges.

Last Friday, I gave a talk on how to overcome some of the challenges of large-scale web crawling at Berkeley. Below are the slides from that talk.

introduction to
WEB CRAWLING
& extraction by Nate Murray
ask for show of hands – how many written any code to download data automatically – downloaded megabytes / gigabytes / terabytes ?
WHO AM I ?
Nate Murray
AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005
what is
WEB CRAWLING?
definition:
web crawler
a program that browses the web.
crawler downloads pages and puts them in storage somewhere
definition:
web extraction
transforming web data into data
unstructured
structured
but the reality is, if your data is really unstructured then you’re quickly talking about NLP.
definition:
web extraction
transforming web data into data
semistructured
structured
motivation
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: bookmark buddies
URL Title Users
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: business hours
Day
Openness
MMon
Closed
TTue
11:30-14:30
17:30-22:00
Wed
11:30-14:30
17:30-22:00
Thur
11:30-14:30
17:30-22:00
Fri
11:30-14:30
17:30-22:00
Sat
12:00-14:30
17:00-22:00
Sun
-
17:00-21:00
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: recommend videos
Users
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: vertical search
Image
SKU
Name
Price
Rating
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
DESIRED PROPERTIES
DESIRED PROPERTIES
SPEED
speed. really you just want to download as many pages as you can as fast as possible
• •
Politeness Distributed
it’s easy to burden
small servers (for any significant
crawl) n machines =
n*m pages-per-second every machine should
perform equal work crawl each page exactly once
• •

Linear Scalability Even partitioning
Minimum overlap
CONSTRAINTS
BASIC ALGORITHM
Initialize: UrlsDone = null
UrlFrontier = {‘google.com/index.html’, ..} Repeat
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl
If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
crawler downloads pages and puts them in storage somewhere
architecture overview
FETCHER
CRAWL
PLANNER URL
QUEUE
URLs
Web Data
INTERNET
Web Data
Web Data
STORAGE
CHALLENGES
challenges:
depends on your ambitions
as the old saying goes ‘everything is fast for small n’
challenges:
Google’s Index Size:
1998 – 26 million 2005 – 8 billion 2008 – 1 trillion

http://www.nytimes.com/2005/08/15/technology/15search.html

http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

challenges:
< 10MM
small crawls are easy
and not that interesting [click] i hate to draw too hard of a line, but i also want to give you an intuition about the numbers we’re talking about here. i’d say a small crawl is roughly less than 10mm
challenges:
large crawls are interesting
challenges:
DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs
challenges:
DNS LOOKUP
Initialize: UrlsDone = null
UrlFrontier = {‘google.com/index.html’, ..} Repeat
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl
If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
crawler downloads pages and puts them in storage somewhere
challenges:
DNS LOOKUP
can easily be a bottleneck
challenges:
DNS LOOKUP

consider running your own DNS servers
• • •
djbdns PowerDNS etc.
challenges:
DNS LOOKUP

be aware of software limitations • gethostbyaddr is synchronized

same with many “default” DNS clients
challenges:
DNS LOOKUP
You’ll know when you need it
challenges:
URLs CRAWLED
Initialize:
UrlsDone = null
UrlFrontier = {‘google.com/index.html’, ..} Repeat
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html) For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million =~ 5.4 gigabytes
doable if you have a lot of memory. but turn that 100mm into 1 billion and you’re really in trouble. can we do better?
can we do better?
BLOOM FILTERS
BLOOM FILTERS
answers the question:
is this item in the set?
BLOOM FILTERS
Have we crawled: http://www.xcombinator.com?
• •
answers either:
yes, probably definitely not
challenges:
URLs CRAWLED
1 machine, bloom filter
NAPKIN CALCULATION
100 million URLs
1 in 100 million chance of false positive
=~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
1/10th
• • •

acceptable
BLOOM FILTER
drawbacks
probabilistic – occasional errors
solutions
estimate # of items ahead of time
can’t delete

• •
not hard, see Dynamic BFs
pick granularity (days) cascade them
BLOOM FILTERS
references:

http://en.wikipedia.org/wiki/Bloom_filter

http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/
challenges:
POLITENESS
obey robots.txt
this won’t always work. especially for large sites
rule of thumb: wait 2 seconds (w.r.t. ip)
this won’t always work. especially for large sites
centralized politeness
SPOF contention
zookeeper
challenges:
POLITENESS

Options:
• • •
central database distributed locks (paxos/sigma/zookeeper) controlled URL distribution

http://en.wikipedia.org/wiki/Paxos_(computer_science)

http://zookeeper.apache.org/

challenges:
URL FRONTIER
url frontier
idea:
consistently distribute URLs based on IP
modulo
IP
SHA-1
bucket (mod 5)
174.132.225.106
4dd14b0b…
2
74.125.224.115
cf4b7594…
1
157.166.255.19
0ac4d141…
4
69.22.138.129
6c1584fa…
4
98.139.50.166
327252c5…
3
benefits:
same IP always goes to same machine simple
this won’t always work. especially for large sites
drawbacks:
susceptible to skew can’t add / remove nodes without pain
re hash the whole keyspace, every url will go to a new machine
consistent hashing
source: http://michaelnielsen.org/blog/consistent-hashing/
source: http://michaelnielsen.org/blog/consistent-hashing/
source: http://michaelnielsen.org/blog/consistent-hashing/
source: http://michaelnielsen.org/blog/consistent-hashing/
benefits:
~ 1/(n+1) URLs move on add/remove
virtual nodes help skew robust (no SOP)
drawbacks:
naive solution won’t work for large sites
further reading:
Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al.
Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007
Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.
challenges:
QUEUEING URLS
situation:
URL not recently crawled allowed by robots.txt polite
you’ve got urls…
how to you order them?
(within a single machine)
threads
hash each lane:
123

http://yachtmaintenanceco.com/

http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu
http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/
http://www.musiciansdfw.org/ http://www.ariana.org/
123
123
123
ERLANG
lookup: erlang B / C / engset
Agner Erlang
telephone lines – used in call centers to calculate number of agents you need relative to call volume
as many threads as possible
number of agents doesn’t really translate
don’t sort input URLs
fetch

http://abcnews.go.com/

http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&page=1 http://abcnews.go.com/2020/story?id=207269&page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&page=1
wait fetch wait fetch wait
no waiting!

http://yachtmaintenanceco.com/

http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu
http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/
challenges:
EXTRACTING URLS
challenges:
EXTRACTING URLS
the internet is full of garbage
challenges:
EXTRACTING URLS
enormous pages terrible markup ridiculous urls
☃.net/ “unicode snowman dot net”
challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer
SOFTWARE
• •
software advice:
goals determine scale someone else has already done it
- web crawling and scraping isn’t new
2 second crawler:
function wgetspider() { wget –html-extension –convert-links –mirror \
–page-requisites –progress=bar –level=5 \ –no-parent –no-verbose \ –no-check-certificate “$@”;
} $ wgetspider http://www.ischool.berkeley.edu/
- web crawling and scraping isn’t new
• • •
java crawlers:
Heritrix (Internet Archive) Nutch (Lucene) Bixo (Hadoop / Cascading)

http://crawler.archive.org/

http://nutch.apache.org/ http://bixo.101tec.com/
• • •
extraction packages:
mechanize BeautifulSoup & urllib2 Scrapy

http://wwwsearch.sourceforge.net/mechanize/

http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/
wrapper induction(ish)
• • • •
Ariel RoadRunner TemplateMaker scrubyt

http://ariel.rubyforge.org/index.html

http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html
QUESTIONS?
FEEDBACK:
nate@xcombinator.com
www.xcombinator.com @xcombinator

Posted in big-data, crawling | 1 Comment