Introduction
TaskPipe is a framework for building scrapers and crawlers, written in Perl5.
TaskPipe was created to take as much of the effort as possible out of building directory style scraping systems. Such systems can be assembled quickly as a series of modular tasks, and tasks can be rearranged or reused elsewhere.
TaskPipe aims to be lightweight in terms of its own footprint, but heavyweight in terms of capability, allowing (depending on settings),
- a desired number of parallel download threads to be specified
- auto launch of 1 TOR instance per thread
- the collection and use of open proxies
- auto page rendering via PhantomJS
A command line tool is included to assist with quick project creation, and project management.
The main purpose of this project is to act as the data gathering component of a web analytics software package. We are releasing this open source via the GNU Public License (GPL) v3.0. The usual disclaimer applies: this is experimental software in a relatively early stage of development; use at your own risk.
Note that what follows is the first part of a tutorial series, covering some basic TaskPipe concepts. The complete series is not yet available, but coming soon - please bear with us, and watch this space!
Overview
TaskPipe was really designed for those instances where you want to scrape online data arranged in the format of an online directory, and create your own cross referenced database of the results. For example, there may be some kind of list page which you want to refer to initially; each list entry may provide a link that points to a page with a sublist; and each sublist entry might point to a detail page.
Consider the accompanying diagram, which shows a simple scenario where a website is displaying some basic information about a list of companies. In our example each item in the list has a link to a company detail page, and a link to a sublist page, showing job postings that are associated with the company. Each item on the jobs sublist has a further detail page.
Quick Exercise
Unfortunately TaskPipe can't design your table schema for you - so it's important you can already do this in a way that makes sense for the data you are trying to collect. As a quick exercise, try writing down a database schema for our example situation. Specify which tables you would create, and for each table which columns you would include. You should pick up all of the available data. Then refer to our suggested schema in the solution below.
Solution
table | columns |
---|---|
company |
|
job_post |
|
Solution Notes
-
Actually this probably isn't the best way to design a
schema for this scenario. If you spotted that really
jobs
are not the same asjob posts
, and you designed your schema with two separate tables for each of these (so yourjobs
table would contain things likejob_title
,job_description
andsalary
, whereas yourjob_posts
table would only contain things specific to the post itself, such aspost_date
) then this is even better. Good work! Going one step further, having a specific table called something likejob_category
- which would be a dedicated list of allowed job categories - might a smart idea for the long term. You could then link category to job via a foreign keycategory_id
on job. If you noticed both these things, and your schema has a total of 4 tables then that's great. However, in the interests of keeping this example simple, we will stick with our basic schema, and pretend we only care about havingcompany
andjob_posts
tables. -
Another thing you might have done differently is not to
relate tables using an
id
column. For example, you may have used the rationale that we expect company names to be unique, and thus definedname
as a primary key on yourjob
table. This is a legitimate approach provided we are sure we will never encounter distinct companies with the same name. Again, since this is just an example, we won't agonise over it too much.
TaskPipe Plan Basics
Let's say we have created these tables in a MySQL database. TaskPipe is designed to work with any database that supports SQLSTATE - however it was built using MySQL, and not much testing has yet taken place with other database systems. It is probably safer to use MySQL if possible.
We tell TaskPipe how to pick up the data by providing a plan. A plan is a YAML file which basically outlines the tasks which are going to be performed, which order they are going to happen in, and which data is passed between them. It makes sense for our scrape to start with the "companies list" page, so the first item in our plan might look like this:
---
- _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com/
The three dashes ---
at the top are the YAML way of
marking the top of the file. Below that we specify our first task
as a list element (using the dash -
to indicate a
list element. If you are not familiar with YAML markup, then can
refer to the documentation – or
alternatively, just accept that a dash indicates a list element,
and keep reading!
In TaskPipe scraping tasks generally require the URL of the page
to scrape, and a Referer
header. Carefully specifying
a Referer
header helps to make sure the scraper
proceeds between pages in a way that more closely resembles a
human, and thus is less likely to raise red flags on the target
website. However, you can adjust settings so your scraping task
does not require a Referer
header. Or indeed you can
create your own custom task which takes whatever parameters you
decide (but one step at a time…)
You'll notice that _name
begins with an underscore.
This is because an underscore indicates that it is a label.
A label is something that allows tasks to refer to each
other (which is usually the point of labels!) However, in general
a TaskPipe label also has the following requirement: changing or
removing the label does not affect the operation of the task.
Consider the following task specification:
---
- _name: Scrape_CompaniesList
_id: my_id
url: http://example.com/companies
headers:
Referer: http://example.com
You'll note the extra _id
parameter. Because this
starts with an underscore, TaskPipe knows to ignore it e.g. when
caching results. So it knows the added _id
label will
make no difference to the output for a given input.
The only exception to this rule is the _name
label,
which is special because it works both as a label (ie it can be
used to refer to tasks) and it also affects the task output.
(Actually we couldn't decide if _name
should get an
underscore. Will this change in future? Maybe! Are we making this
stuff up as we go along? Absolutely!)
Building our plan
We'll start out by creating a plan for just the right side of the diagram – ie the "company list page", the "company jobs list page" and the "job description" page. These things happen sequentially, so it's no surprise that we can put our tasks in a line:
---
- _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com
- _name: Scrape_JobsList
url: $this{jobs_url}
headers:
Referer: http://example.com/companies
- _name: Scrape_JobDescription
url: $this{jd_url}
headers:
Referer: $this[1]{jobs_url}
In general a task takes a single set of inputs, and
generates a list of (sets of) outputs. So in general it is a one
to many operation. For example, when we scrape
example.com/companies
we provide the URL and the
Referer header (a single set of inputs) and we hope that the
scraping task produces a list of outputs which look something
like:
{
company => 'Yahoo',
location => 'US',
jobs => '3',
company_url => 'http://example.com/info?company=yahoo'
jobs_url => 'http://example.com/jobs?company=yahoo'
},
{
company => 'BP',
location => 'UK',
jobs => 5',
company_url => 'http://example.com/info?company=BP'
jobs_url => 'http://example.com/jobs?company=BP'
},
{
company => 'Honda',
location => 'Japan',
jobs => '2',
company_url => 'http://example.com/info?company=honda'
jobs_url => 'http://example.com/jobs?company=honda'
}
So our scraping task picks somehow picks up the visible
information (company, location, jobs) as well as the target URLs –
which will probably be in the href
attribute of
<a>
tags. (If you are wondering how exactly the
Scrape_Companies
task produces this output, we'll get
to that in due course. Hold on to your hat!)
So let's say our Scrape_CompaniesList
task produces
the output above. For each set of outputs the next task in line
gets executed. ie the outputs of Scrape_CompaniesList
get fed into Scrape_JobsList
, and in this case the
Scrape_JobsList
task gets executed 3 times.
Quick Exercise
When the inputs to the second task ( Scrape_JobsList
)
are
{
company => 'Yahoo',
location => 'US',
jobs => '3'
}
What do we expect the outputs from this task to be?
Solution
{
category => 'IT',
job_title => 'Coder',
date_posted => '2 June',
jd_url => 'http://example.com/job?company=yahoo&job=coder'
},
{
category => 'Sales',
job_title => 'Salesman',
date_posted => '2 June',
jd_url => 'http://example.com/job?company=yahoo&job=salesman'
},
{
category => 'Media',
job_title => 'Journalist',
date_posted => '28 May',
jd_url => 'http://example.com/job?company=yahoo&job=journalist'
}
Solution Notes
-
Those labels
category
,job_title
,date_posted
andjob_url
are arbitrary. We decide what we are going to call each piece of information – but obviously we need to be consistent. If we are giving the parameter corresponding toIT
,Sales
,Media
etc. the namecategory
(with a small "c") and the next task in line is looking for a parameter calledCategory
(with a big "C") orjob_category
(or whatever) then you'll end up with some nulls on your database.
Tasks vs xtasks
Hopefully you will have noticed that tasks closer to the bottom of the plan tend to get executed more often than tasks nearer the top - and that's true of TaskPipe plans in general. So in this case, our task specifications form a straight line (ie one after another), but if we look at executed tasks, then these look more like a tree.
In TaskPipe its often useful to think about "executed tasks" as
well as plain tasks. For this reason we shorten "executed tasks"
to xtasks
. A loose definition of an
xtask
is the combination task + inputs
.
So in our example, Scrape_JobsList
is a task, but the
combination of the task Scrape_JobsList
plus the
input company=yahoo
is an xtask
.
Quick Exercise
Can you draw up an "xtask diagram" that corresponds to the plan so
far(ie the 3 sequential tasks Scrape_CompaniesList
,
Scrape_Jobs_list
and
Scrape_JobDescription
)? How many times do we expect
the Scrape_JobDescription
task to be executed (in
total)?
Solution
We expect the Scrape_JobDescription
task to be
executed exactly 10 times – because we know Yahoo has 3
jobs in total, BP has 5 jobs and Honda has 2
jobs. Of course, in a real situation we might not know in
advance how many times a particular task was going to get
executed.
Passing data between tasks
Going back to our plan, we have a first task specification which looks like this:
---
- _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com
And the first set of results it produces look like this:
{
company => 'Yahoo',
location => 'US',
jobs => '3',
company_url => 'http://example.com/info?company=yahoo'
jobs_url => 'http://example.com/jobs?company=yahoo'
}
Our second scraping task needs that jobs_url
. We can
tell TaskPipe to take jobs_url
from the first task
and insert it into the url
parameter in the second
task by using the $this
parameter variable.
$this
means use the input of this task (remember that
the input of this task is just the same as the output of the last
task).
Lets take a moment to clarify some definitions, which will make discussing TaskPipe plans easier:
- task inputs – We already mentioned these are the same as the outputs from the last task. This is a raw list (ie an array) of sets of data (ie Perl hashrefs).
-
task parameters – These are the variables that the task
accepts. For example, in our first task,
url
andheaders
are task parameters._name
and other labels are not task parameters. -
plan parameter variables – these are words like
$this
which start with a dollar sign (similar to Perl variables), and are used to indicate that the word should be replaced by data coming from some other task (which exact task, and which specific data item, depends on the parameter variable and how this is specified. We will discuss parameter variables in more detail later). -
task pinterp – this may sound like a strange name, but
there is a logical reason! "pinterp" really means "parameter
that has been interpolated". So e.g. if our task specification
declares
somevar: $this
then the value of thesomevar
parameter is just the word$this
, but the value of thesomevar
pinterp is the data which is actually accepted by the parameter, ie after$this
has been interpolated.
Here's a practical example of this language use. Our second task specification looks like this:
- _name: Scrape_JobsList
url: $this{jobs_url}
headers:
Referer: http://example.com/companies
In the declaration url: $this{jobs_url}
, we are using
the $this
parameter variable.
$this{jobs_url}
is the value of the
somevar
parameter.
Lets run our task against a set of inputs:
{
company => 'Yahoo',
location => 'US',
jobs => '3',
company_url => 'http://example.com/info?company=yahoo'
jobs_url => 'http://example.com/jobs?company=yahoo'
}
This will make the pinterp
value of url
become http://example.com/jobs?company=yahoo
. ie the
pinterp
of url
becomes the value of the
input named jobs_url
. Remember that, in general
pinterp
values are the things that are absorbed and
used in the task.
It is worth mentioning that a pinterp
value does not
have to refer to a parameter which is defined as a parameter
variable. For example, in our first task, we declared url:
http://example.com/companies
. In this case there is no
parameter variable. We are saying we want url
to be
equal to the fixed value of
http://example.com/companies
whatever. This means the
value of the parameter is
http://example.com/companies
but the value of the
pinterp
is also
http://example.com/companies
(since there is no
variable in there, it just "interpolates" statically and stays as
it is).
A final observation on the subject of task "parameters" vs. task "pinterp": we could talk about the parameters in the task without needing inputs, but we needed a specific set of inputs to be able to discuss pinterp. Putting this another way, "parameters" are really a feature of tasks whereas "pinterp" are a feature of xtasks.
Inputs and Input history
Earlier we said that a "loose definition" of an xtask ("executed task") was the combination of a task and a specific set of inputs. The reason the definition was "loose" was because we neglected to mention input history. When a task completes and invokes the next task in line, it not only hands over its outputs (which become the inputs of the next task, remember) but it also hands over a complete history of the values of all inputs which have taken place beforehand. So when any task is invoked for execution, it is aware of everything that has happened previously.
The mechanics of this are not something you generally need to worry about when creating a scraper using TaskPipe. You just need to know how to instruct TaskPipe to grab values from earlier tasks using parameter variables.
One way of doing this may be seen in the third task specification of our example:
- _name: Scrape_JobDescription
url: $this{jd_url}
headers:
Referer: $this[1]{jobs_url}
See the [1]
between $this
and
{jobs_url}
in the Referer
declaration?
That [1]
is called a match offset, and indicates that
instead of using the inputs of this task, count one extra task
back and take the value from those inputs instead. So in this case
$this[1]{jobs_url}
means "take the value of the input
named jobs_url
that was fed to the
Scrape_JobsList
task".
This is, of course, the same value that
Scrape_JobsList
accepted into the parameter
url
. It makes sense to arrange the
Referer
header this way; when you are clicking
through webpages in a browser, the Referer
is almost
always the last page you visited. So it makes sense to keep
Referer
one step behind url
in your
scraping tasks. For a series of back-to-back scraping tasks, this
effect can be achieved by specifying $this
for
url
and $this[1]
for
Referer
.
Quick Exercise
Suppose, somewhere in the middle of your plan, you were going to
run the scraping tasks Scrape_Something
,
Scrape_SomethingElse
and
Scrape_SomethingFurtherStill
(in that order, one
after another). Suppose all of your tasks (including the ones that
occur before Scrape_Something
) are designed so they
each output the url
that the next scraping task is
going to use – and they all use url
as the name of
the output. Write down this part of the plan. ie write down the 3
task specifications, including the task name, url
and
Referer
header for each task, together with the
relevant parameter values, and parameter variables where
appropriate.
Show solution
# ...
- _name: Scrape_Something
url: $this
headers:
Referer: $this[1]{url}
- _name: Scrape_SomethingElse
url: $this
headers:
Referer: $this[1]{url}
- _name: Scrape_SomethingFurtherStill
url: $this
headers:
Referer: $this[1]{url}
# ...
More about Parameter Variables
Lets go back to our url: $this{jobs_url}
declaration.
We noted that the $this
parameter variable means
"take the value from the inputs of this task". You may have
already gather that adding the {jobs_url}
suffix
tells taskpipe "use the input named jobs_url
".
In this case we are putting the value of an input named
jobs_url
into a parameter named url
–
the name of the parameter is different to the name of the input,
so we need to explicitly tell TaskPipe which input to use.
However, if we were expecting an input whose name was the same as
the parameter – so e.g. our input was also named simply
url
(instead of jobs_url
) – then we
could have omitted {jobs_url}
completely and just
written url: $this
.
Written in complete form, parameter variable declarations generally involve several parts – but most are optional. Those parts are (usually) as follows:
# general format:
$<label_key>:<label_value>(<match_count>)[<match_offset>]{input_key}
# example:
url: $name:Scrape_Companies(2)[1]{jobs_url}
Here's a summary of what each of those parts means:
part of parameter variable | meaning | Required or optional? |
---|---|---|
label_key | the name of the parameter variable. e.g.
this or
name
| always required |
label_value | At the time of writing
In the example above, we are telling TaskPipe to look for the task which has a
|
Required in all cases except for
$this . |
match_count | If
In the example above, TaskPipe will look for the third task which has a
|
Optional |
match_offset | Once a task matching
In the example above
|
Optional |
input_key | The other parts of the parameter variable identify which set of inputs to use. The final step is to provide the name of the desired specific input within the set. This is the
In the example above, TaskPipe looks for an input named
|
Optional |
End of part one
Congratulations! You have reached the end of part 1 of the TaskPipe tutorial.
Unfortunately Part 2 of the series is not yet available - but it's coming soon, so watch this space! Alternatively get in touch if you have questions.
Have a great scrape!