NAME
AnyEvent::Net::Curl::Queued - Moose wrapper for queued downloads via
Net::Curl & AnyEvent
VERSION
version 0.010
SYNOPSIS
#!/usr/bin/env perl
package CrawlApache;
use common::sense;
use HTML::LinkExtor;
use Moose;
extends 'AnyEvent::Net::Curl::Queued::Easy';
after finish => sub {
my ($self, $result) = @_;
say $result . "\t" . $self->final_url;
if (
not $self->has_error
and $self->getinfo('content_type') =~ m{^text/html}
) {
my @links;
HTML::LinkExtor->new(sub {
my ($tag, %links) = @_;
push @links,
grep { $_->scheme eq 'http' and $_->host eq 'localhost' }
values %links;
}, $self->final_url)->parse(${$self->data});
for my $link (@links) {
$self->queue->prepend(sub {
CrawlApache->new({ initial_url => $link });
});
}
}
};
no Moose;
__PACKAGE__->meta->make_immutable;
1;
package main;
use common::sense;
use AnyEvent::Net::Curl::Queued;
my $q = AnyEvent::Net::Curl::Queued->new;
$q->append(sub {
CrawlApache->new({ initial_url => 'http://localhost/manual/' })
});
$q->wait;
DESCRIPTION
Efficient and flexible batch downloader with a straight-forward
interface:
* create a queue;
* append/prepend URLs;
* wait for downloads to end (retry on errors).
Download init/finish/error handling is defined through Moose's method
modifiers.
MOTIVATION
I am very unhappy with the performance of LWP. It's almost perfect for
properly handling HTTP headers, cookies & stuff, but it comes at the
cost of *speed*. While this doesn't matter when you make single
downloads, batch downloading becomes a real pain.
When I download large batch of documents, I don't care about cookies or
headers, only content and proper redirection matters. And, as it is
clearly an I/O bottleneck operation, I want to make as many parallel
requests as possible.
So, this is what CPAN offers to fulfill my needs:
* Net::Curl: Perl interface to the all-mighty libcurl
, is well-documented (opposite to
WWW::Curl);
* AnyEvent: the DBI of event loops. Net::Curl also provides a nice and
well-documented example of AnyEvent usage (03-multi-event.pl);
* MooseX::NonMoose: Net::Curl uses a Pure-Perl object implementation,
which is lightweight, but a bit messy for my Moose-based projects.
MooseX::NonMoose patches this gap.
AnyEvent::Net::Curl::Queued is a glue module to wrap it all together. It
offers no callbacks and (almost) no default handlers. It's up to you to
extend the base class AnyEvent::Net::Curl::Queued::Easy so it will
actually download something and store it somewhere.
OVERHEAD
Obviously, the bottleneck of any kind of download agent is the
connection itself. However, socket handling and header parsing add a
lots of overhead. The script eg/benchmark.pl compares
AnyEvent::Net::Curl::Queued against several other download agents. Only
AnyEvent::Net::Curl::Queued itself, AnyEvent::Curl::Multi and lftp
support parallel connections; thus, forks are used
to reproduce the same behaviour for the remaining agents. Both
AnyEvent::Curl::Multi and LWP::Curl are frontends for WWW::Curl. The
download target is a local copy of the Apache documentation
.
URL/s WWW::Mechanize LWP::UserAgent HTTP::Lite HTTP::Tiny AnyEvent::Curl::Multi lftp AnyEvent::Net::Curl::Queued AnyEvent::HTTP curl LWP::Curl wget
WWW::Mechanize 196 -- -60% -80% -85% -86% -88% -89% -92% -97% -97% -100%
LWP::UserAgent 484 148% -- -51% -63% -66% -70% -72% -80% -93% -93% -99%
HTTP::Lite 989 405% 104% -- -25% -32% -39% -42% -59% -85% -86% -99%
HTTP::Tiny 1312 569% 170% 33% -- -9% -19% -23% -46% -80% -82% -99%
AnyEvent::Curl::Multi 1446 638% 198% 46% 10% -- -10% -16% -41% -78% -80% -98%
lftp 1609 722% 232% 63% 23% 11% -- -6% -34% -75% -77% -98%
AnyEvent::Net::Curl::Queued 1713 773% 253% 73% 30% 18% 6% -- -30% -74% -76% -98%
AnyEvent::HTTP 2437 1144% 403% 146% 86% 69% 51% 42% -- -63% -66% -97%
curl 6512 3228% 1244% 559% 397% 351% 305% 281% 167% -- -8% -93%
LWP::Curl 7110 3524% 1364% 618% 442% 391% 341% 315% 191% 9% -- -92%
wget 88875 45240% 18215% 8877% 6675% 6045% 5418% 5092% 3544% 1262% 1151% --
AnyEvent::HTTP & LWP::Curl are actually faster, but both lack
queueing/retry.
ATTRIBUTES
allow_dups
Allow duplicate requests (default: false). By default, requests to the
same URL (more precisely, requests with the same signature are issued
only once. To seed POST parameters, you must extend the
AnyEvent::Net::Curl::Queued::Easy class. Setting "allow_dups" to true
value disables request checks.
completed
Count completed requests.
cv
AnyEvent condition variable. Initialized automatically, unless you
specify your own.
max
Maximum number of parallel connections (default: 4; minimum value: 1).
multi
Net::Curl::Multi instance.
queue
"ArrayRef" to the queue. Has the following helper methods:
* queue_push: append item at the end of the queue;
* queue_unshift: prepend item at the top of the queue;
* dequeue: shift item from the top of the queue;
* count: number of items in queue.
share
Net::Curl::Share instance.
stats
AnyEvent::Net::Curl::Queued::Stats instance.
timeout
Timeout (default: 60 seconds).
METHODS
start()
Populate empty request slots with workers from the queue.
empty()
Check if there are active requests or requests in queue.
add($worker)
Activate a worker.
append($worker)
Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the
end of the queue. For lazy initialization, wrap the worker in a "sub {
... }", the same way you do with the Moose "default => sub { ... }":
$queue->append(sub {
AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
});
prepend($worker)
Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the
beginning of the queue. For lazy initialization, wrap the worker in a
"sub { ... }", the same way you do with the Moose "default => sub { ...
}":
$queue->prepend(sub {
AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
});
wait()
Shortcut to "$queue->cv->recv".
SEE ALSO
* AnyEvent
* Moose
* Net::Curl
* WWW::Curl
* AnyEvent::Curl::Multi
AUTHOR
Stanislaw Pusep
COPYRIGHT AND LICENSE
This software is copyright (c) 2011 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.