TaskPipe is a framework for building scrapers and crawlers, written in Perl5.
TaskPipe was created to take as much of the effort as possible out of building directory style scraping systems. Such systems can be assembled quickly as a series of modular tasks, and tasks can be rearranged or reused elsewhere.
TaskPipe aims to be lightweight in terms of its own footprint, but heavyweight in terms of capability, allowing (depending on settings),
A command line tool is included to assist with quick project creation, and project management.
The main purpose of this project is to act as the data gathering component of a web analytics software package. We are releasing this open source via the GNU Public License (GPL) v3.0. The usual disclaimer applies: this is experimental software in a relatively early stage of development; use at your own risk.
Note that what follows is the first part of a tutorial series, covering some basic TaskPipe concepts. The complete series is not yet available, but coming soon - please bear with us, and watch this space!
TaskPipe was really designed for those instances where you want to scrape online data arranged in the format of an online directory, and create your own cross referenced database of the results. For example, there may be some kind of list page which you want to refer to initially; each list entry may provide a link that points to a page with a sublist; and each sublist entry might point to a detail page.
Consider the accompanying diagram, which shows a simple scenario where a website is displaying some basic information about a list of companies. In our example each item in the list has a link to a company detail page, and a link to a sublist page, showing job postings that are associated with the company. Each item on the jobs sublist has a further detail page.
Unfortunately TaskPipe can't design your table schema for you - so it's important you can already do this in a way that makes sense for the data you are trying to collect. As a quick exercise, try writing down a database schema for our example situation. Specify which tables you would create, and for each table which columns you would include. You should pick up all of the available data. Then refer to our suggested schema in the solution below.
table | columns |
---|---|
company |
|
job_post |
|
jobs
are not the same as job
posts
, and you designed your schema with two
separate tables for each of these (so your
jobs
table would contain things like
job_title
, job_description
and
salary
, whereas your job_posts
table would only contain things specific to the post
itself, such as post_date
) then this is even
better. Good work!
Going one step further, having a specific table called
something like job_category
- which would be
a dedicated list of allowed job categories - might a smart
idea for the long term. You could then link category to
job via a foreign key category_id
on job.
If you noticed both these things, and your schema has a
total of 4 tables then that's great. However, in the
interests of keeping this example simple, we will stick
with our basic schema, and pretend we only care about
having company
and job_posts
tables.
id
column. For
example, you may have used the rationale that we expect
company names to be unique, and thus defined
name
as a primary key on your
job
table. This is a legitimate approach
provided we are sure we will never encounter distinct
companies with the same name. Again, since this is just an
example, we won't agonise over it too much.
Let's say we have created these tables in a MySQL database. TaskPipe is designed to work with any database that supports SQLSTATE - however it was built using MySQL, and not much testing has yet taken place with other database systems. It is probably safer to use MySQL if possible.
We tell TaskPipe how to pick up the data by providing a plan. A plan is a YAML file which basically outlines the tasks which are going to be performed, which order they are going to happen in, and which data is passed between them. It makes sense for our scrape to start with the "companies list" page, so the first item in our plan might look like this:
---
- _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com/
The three dashes ---
at the top are the YAML way of
marking the top of the file. Below that we specify our first task
as a list element (using the dash -
to indicate a
list element. If you are not familiar with YAML markup, then can
refer to the documentation – or
alternatively, just accept that a dash indicates a list element,
and keep reading!
In TaskPipe scraping tasks generally require the URL of the page
to scrape, and a Referer
header. Carefully specifying
a Referer
header helps to make sure the scraper
proceeds between pages in a way that more closely resembles a
human, and thus is less likely to raise red flags on the target
website. However, you can adjust settings so your scraping task
does not require a Referer
header. Or indeed you can
create your own custom task which takes whatever parameters you
decide (but one step at a time…)
You'll notice that _name
begins with an underscore.
This is because an underscore indicates that it is a label.
A label is something that allows tasks to refer to each
other (which is usually the point of labels!) However, in general
a TaskPipe label also has the following requirement: changing or
removing the label does not affect the operation of the task.
Consider the following task specification:
---
- _name: Scrape_CompaniesList
_id: my_id
url: http://example.com/companies
headers:
Referer: http://example.com
You'll note the extra _id
parameter. Because this
starts with an underscore, TaskPipe knows to ignore it e.g. when
caching results. So it knows the added _id
label will
make no difference to the output for a given input.
The only exception to this rule is the _name
label,
which is special because it works both as a label (ie it can be
used to refer to tasks) and it also affects the task output.
(Actually we couldn't decide if _name
should get an
underscore. Will this change in future? Maybe! Are we making this
stuff up as we go along? Absolutely!)
We'll start out by creating a plan for just the right side of the diagram – ie the "company list page", the "company jobs list page" and the "job description" page. These things happen sequentially, so it's no surprise that we can put our tasks in a line:
---
- _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com
- _name: Scrape_JobsList
url: $this{jobs_url}
headers:
Referer: http://example.com/companies
- _name: Scrape_JobDescription
url: $this{jd_url}
headers:
Referer: $this[1]{jobs_url}
In general a task takes a single set of inputs, and
generates a list of (sets of) outputs. So in general it is a one
to many operation. For example, when we scrape
example.com/companies
we provide the URL and the
Referer header (a single set of inputs) and we hope that the
scraping task produces a list of outputs which look something
like:
{
company => 'Yahoo',
location => 'US',
jobs => '3',
company_url => 'http://example.com/info?company=yahoo'
jobs_url => 'http://example.com/jobs?company=yahoo'
},
{
company => 'BP',
location => 'UK',
jobs => 5',
company_url => 'http://example.com/info?company=BP'
jobs_url => 'http://example.com/jobs?company=BP'
},
{
company => 'Honda',
location => 'Japan',
jobs => '2',
company_url => 'http://example.com/info?company=honda'
jobs_url => 'http://example.com/jobs?company=honda'
}
So our scraping task picks somehow picks up the visible
information (company, location, jobs) as well as the target URLs –
which will probably be in the href
attribute of
<a>
tags. (If you are wondering how exactly the
Scrape_Companies
task produces this output, we'll get
to that in due course. Hold on to your hat!)
So let's say our Scrape_CompaniesList
task produces
the output above. For each set of outputs the next task in line
gets executed. ie the outputs of Scrape_CompaniesList
get fed into Scrape_JobsList
, and in this case the
Scrape_JobsList
task gets executed 3 times.
When the inputs to the second task ( Scrape_JobsList
)
are
{
company => 'Yahoo',
location => 'US',
jobs => '3'
}
What do we expect the outputs from this task to be?
{
category => 'IT',
job_title => 'Coder',
date_posted => '2 June',
jd_url => 'http://example.com/job?company=yahoo&job=coder'
},
{
category => 'Sales',
job_title => 'Salesman',
date_posted => '2 June',
jd_url => 'http://example.com/job?company=yahoo&job=salesman'
},
{
category => 'Media',
job_title => 'Journalist',
date_posted => '28 May',
jd_url => 'http://example.com/job?company=yahoo&job=journalist'
}
category
,
job_title
, date_posted
and
job_url
are arbitrary. We decide what we are
going to call each piece of information – but obviously we
need to be consistent. If we are giving the parameter
corresponding to IT
, Sales
,
Media
etc. the name category
(with a small "c") and the next task in line is looking
for a parameter called Category
(with a big
"C") or job_category
(or whatever) then
you'll end up with some nulls on your database.
Hopefully you will have noticed that tasks closer to the bottom of the plan tend to get executed more often than tasks nearer the top - and that's true of TaskPipe plans in general. So in this case, our task specifications form a straight line (ie one after another), but if we look at executed tasks, then these look more like a tree.
In TaskPipe its often useful to think about "executed tasks" as
well as plain tasks. For this reason we shorten "executed tasks"
to xtasks
. A loose definition of an
xtask
is the combination task + inputs
.
So in our example, Scrape_JobsList
is a task, but the
combination of the task Scrape_JobsList
plus the
input company=yahoo
is an xtask
.
Can you draw up an "xtask diagram" that corresponds to the plan so
far(ie the 3 sequential tasks Scrape_CompaniesList
,
Scrape_Jobs_list
and
Scrape_JobDescription
)? How many times do we expect
the Scrape_JobDescription
task to be executed (in
total)?
We expect the Scrape_JobDescription
task to be
executed exactly 10 times – because we know Yahoo has 3
jobs in total, BP has 5 jobs and Honda has 2
jobs. Of course, in a real situation we might not know in
advance how many times a particular task was going to get
executed.
Going back to our plan, we have a first task specification which looks like this:
---
- _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com
And the first set of results it produces look like this:
{
company => 'Yahoo',
location => 'US',
jobs => '3',
company_url => 'http://example.com/info?company=yahoo'
jobs_url => 'http://example.com/jobs?company=yahoo'
}
Our second scraping task needs that jobs_url
. We can
tell TaskPipe to take jobs_url
from the first task
and insert it into the url
parameter in the second
task by using the $this
parameter variable.
$this
means use the input of this task (remember that
the input of this task is just the same as the output of the last
task).
Lets take a moment to clarify some definitions, which will make discussing TaskPipe plans easier:
url
and
headers
are task parameters. _name
and other labels are not task parameters.
$this
which start with a dollar sign (similar to
Perl variables), and are used to indicate that the word should
be replaced by data coming from some other task (which exact
task, and which specific data item, depends on the parameter
variable and how this is specified. We will discuss parameter
variables in more detail later).
somevar: $this
then the value of the
somevar
parameter is just the word
$this
, but the value of the somevar
pinterp is the data
which is actually accepted by the parameter, ie after
$this
has been interpolated.
Here's a practical example of this language use. Our second task specification looks like this:
- _name: Scrape_JobsList
url: $this{jobs_url}
headers:
Referer: http://example.com/companies
In the declaration url: $this{jobs_url}
, we are using
the $this
parameter variable.
$this{jobs_url}
is the value of the
somevar
parameter.
Lets run our task against a set of inputs:
{
company => 'Yahoo',
location => 'US',
jobs => '3',
company_url => 'http://example.com/info?company=yahoo'
jobs_url => 'http://example.com/jobs?company=yahoo'
}
This will make the pinterp
value of url
become http://example.com/jobs?company=yahoo
. ie the
pinterp
of url
becomes the value of the
input named jobs_url
. Remember that, in general
pinterp
values are the things that are absorbed and
used in the task.
It is worth mentioning that a pinterp
value does not
have to refer to a parameter which is defined as a parameter
variable. For example, in our first task, we declared url:
http://example.com/companies
. In this case there is no
parameter variable. We are saying we want url
to be
equal to the fixed value of
http://example.com/companies
whatever. This means the
value of the parameter is
http://example.com/companies
but the value of the
pinterp
is also
http://example.com/companies
(since there is no
variable in there, it just "interpolates" statically and stays as
it is).
A final observation on the subject of task "parameters" vs. task "pinterp": we could talk about the parameters in the task without needing inputs, but we needed a specific set of inputs to be able to discuss pinterp. Putting this another way, "parameters" are really a feature of tasks whereas "pinterp" are a feature of xtasks.
Earlier we said that a "loose definition" of an xtask ("executed task") was the combination of a task and a specific set of inputs. The reason the definition was "loose" was because we neglected to mention input history. When a task completes and invokes the next task in line, it not only hands over its outputs (which become the inputs of the next task, remember) but it also hands over a complete history of the values of all inputs which have taken place beforehand. So when any task is invoked for execution, it is aware of everything that has happened previously.
The mechanics of this are not something you generally need to worry about when creating a scraper using TaskPipe. You just need to know how to instruct TaskPipe to grab values from earlier tasks using parameter variables.
One way of doing this may be seen in the third task specification of our example:
- _name: Scrape_JobDescription
url: $this{jd_url}
headers:
Referer: $this[1]{jobs_url}
See the [1]
between $this
and
{jobs_url}
in the Referer
declaration?
That [1]
is called a match offset, and indicates that
instead of using the inputs of this task, count one extra task
back and take the value from those inputs instead. So in this case
$this[1]{jobs_url}
means "take the value of the input
named jobs_url
that was fed to the
Scrape_JobsList
task".
This is, of course, the same value that
Scrape_JobsList
accepted into the parameter
url
. It makes sense to arrange the
Referer
header this way; when you are clicking
through webpages in a browser, the Referer
is almost
always the last page you visited. So it makes sense to keep
Referer
one step behind url
in your
scraping tasks. For a series of back-to-back scraping tasks, this
effect can be achieved by specifying $this
for
url
and $this[1]
for
Referer
.
Suppose, somewhere in the middle of your plan, you were going to
run the scraping tasks Scrape_Something
,
Scrape_SomethingElse
and
Scrape_SomethingFurtherStill
(in that order, one
after another). Suppose all of your tasks (including the ones that
occur before Scrape_Something
) are designed so they
each output the url
that the next scraping task is
going to use – and they all use url
as the name of
the output. Write down this part of the plan. ie write down the 3
task specifications, including the task name, url
and
Referer
header for each task, together with the
relevant parameter values, and parameter variables where
appropriate.
# ...
- _name: Scrape_Something
url: $this
headers:
Referer: $this[1]{url}
- _name: Scrape_SomethingElse
url: $this
headers:
Referer: $this[1]{url}
- _name: Scrape_SomethingFurtherStill
url: $this
headers:
Referer: $this[1]{url}
# ...
Lets go back to our url: $this{jobs_url}
declaration.
We noted that the $this
parameter variable means
"take the value from the inputs of this task". You may have
already gather that adding the {jobs_url}
suffix
tells taskpipe "use the input named jobs_url
".
In this case we are putting the value of an input named
jobs_url
into a parameter named url
–
the name of the parameter is different to the name of the input,
so we need to explicitly tell TaskPipe which input to use.
However, if we were expecting an input whose name was the same as
the parameter – so e.g. our input was also named simply
url
(instead of jobs_url
) – then we
could have omitted {jobs_url}
completely and just
written url: $this
.
Written in complete form, parameter variable declarations generally involve several parts – but most are optional. Those parts are (usually) as follows:
# general format:
$<label_key>:<label_value>(<match_count>)[<match_offset>]{input_key}
# example:
url: $name:Scrape_Companies(2)[1]{jobs_url}
Here's a summary of what each of those parts means:
part of parameter variable | meaning | Required or optional? |
---|---|---|
label_key | the name of the parameter variable. e.g.
this or
name
| always required |
label_value | At the time of writing
In the example above, we are telling TaskPipe to look for the task which has a
|
Required in all cases except for
$this . |
match_count | If
In the example above, TaskPipe will look for the third task which has a
|
Optional |
match_offset | Once a task matching
In the example above
|
Optional |
input_key | The other parts of the parameter variable identify which set of inputs to use. The final step is to provide the name of the desired specific input within the set. This is the
In the example above, TaskPipe looks for an input named
|
Optional |
Congratulations! You have reached the end of part 1 of the TaskPipe tutorial.
Unfortunately Part 2 of the series is not yet available - but it's coming soon, so watch this space! Alternatively get in touch if you have questions.
Have a great scrape!