TaskPipe Scraping Framework

Introduction

TaskPipe is a framework for building scrapers and crawlers, written in Perl5.

TaskPipe was created to take as much of the effort as possible out of building directory style scraping systems. Such systems can be assembled quickly as a series of modular tasks, and tasks can be rearranged or reused elsewhere.

TaskPipe aims to be lightweight in terms of its own footprint, but heavyweight in terms of capability, allowing (depending on settings),

a desired number of parallel download threads to be specified
auto launch of 1 TOR instance per thread
the collection and use of open proxies
auto page rendering via PhantomJS

A command line tool is included to assist with quick project creation, and project management.

The main purpose of this project is to act as the data gathering component of a web analytics software package. We are releasing this open source via the GNU Public License (GPL) v3.0. The usual disclaimer applies: this is experimental software in a relatively early stage of development; use at your own risk.

Note that what follows is the first part of a tutorial series, covering some basic TaskPipe concepts. The complete series is not yet available, but coming soon - please bear with us, and watch this space!

Overview

TaskPipe was really designed for those instances where you want to scrape online data arranged in the format of an online directory, and create your own cross referenced database of the results. For example, there may be some kind of list page which you want to refer to initially; each list entry may provide a link that points to a page with a sublist; and each sublist entry might point to a detail page.

Consider the accompanying diagram, which shows a simple scenario where a website is displaying some basic information about a list of companies. In our example each item in the list has a link to a company detail page, and a link to a sublist page, showing job postings that are associated with the company. Each item on the jobs sublist has a further detail page.

Quick Exercise

Unfortunately TaskPipe can't design your table schema for you - so it's important you can already do this in a way that makes sense for the data you are trying to collect. As a quick exercise, try writing down a database schema for our example situation. Specify which tables you would create, and for each table which columns you would include. You should pick up all of the available data. Then refer to our suggested schema in the solution below.

Solution

table	columns
company	id name location description employees
job_post	id category job_title job_description salary commitment date_posted

Solution Notes

Actually this probably isn't the best way to design a schema for this scenario. If you spotted that really jobs are not the same as job posts, and you designed your schema with two separate tables for each of these (so your jobs table would contain things like job_title, job_description and salary, whereas your job_posts table would only contain things specific to the post itself, such as post_date) then this is even better. Good work! Going one step further, having a specific table called something like job_category - which would be a dedicated list of allowed job categories - might a smart idea for the long term. You could then link category to job via a foreign key category_id on job. If you noticed both these things, and your schema has a total of 4 tables then that's great. However, in the interests of keeping this example simple, we will stick with our basic schema, and pretend we only care about having company and job_posts tables.
Another thing you might have done differently is not to relate tables using an id column. For example, you may have used the rationale that we expect company names to be unique, and thus defined name as a primary key on your job table. This is a legitimate approach provided we are sure we will never encounter distinct companies with the same name. Again, since this is just an example, we won't agonise over it too much.

TaskPipe Plan Basics

Let's say we have created these tables in a MySQL database. TaskPipe is designed to work with any database that supports SQLSTATE - however it was built using MySQL, and not much testing has yet taken place with other database systems. It is probably safer to use MySQL if possible.

We tell TaskPipe how to pick up the data by providing a plan. A plan is a YAML file which basically outlines the tasks which are going to be performed, which order they are going to happen in, and which data is passed between them. It makes sense for our scrape to start with the "companies list" page, so the first item in our plan might look like this:

---
-   _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com/

The three dashes --- at the top are the YAML way of marking the top of the file. Below that we specify our first task as a list element (using the dash - to indicate a list element. If you are not familiar with YAML markup, then can refer to the documentation – or alternatively, just accept that a dash indicates a list element, and keep reading!

In TaskPipe scraping tasks generally require the URL of the page to scrape, and a Referer header. Carefully specifying a Referer header helps to make sure the scraper proceeds between pages in a way that more closely resembles a human, and thus is less likely to raise red flags on the target website. However, you can adjust settings so your scraping task does not require a Referer header. Or indeed you can create your own custom task which takes whatever parameters you decide (but one step at a time…)

You'll notice that _name begins with an underscore. This is because an underscore indicates that it is a label. A label is something that allows tasks to refer to each other (which is usually the point of labels!) However, in general a TaskPipe label also has the following requirement: changing or removing the label does not affect the operation of the task. Consider the following task specification:

---
-   _name: Scrape_CompaniesList
_id: my_id
url: http://example.com/companies
headers:
Referer: http://example.com

You'll note the extra _id parameter. Because this starts with an underscore, TaskPipe knows to ignore it e.g. when caching results. So it knows the added _id label will make no difference to the output for a given input.

The only exception to this rule is the _name label, which is special because it works both as a label (ie it can be used to refer to tasks) and it also affects the task output.

(Actually we couldn't decide if _name should get an underscore. Will this change in future? Maybe! Are we making this stuff up as we go along? Absolutely!)

Building our plan

We'll start out by creating a plan for just the right side of the diagram – ie the "company list page", the "company jobs list page" and the "job description" page. These things happen sequentially, so it's no surprise that we can put our tasks in a line:

---
-   _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com

-   _name: Scrape_JobsList
url: $this{jobs_url}
headers:
Referer: http://example.com/companies

-   _name: Scrape_JobDescription
url: $this{jd_url}
headers:
Referer: $this[1]{jobs_url}

In general a task takes a single set of inputs, and generates a list of (sets of) outputs. So in general it is a one to many operation. For example, when we scrape example.com/companies we provide the URL and the Referer header (a single set of inputs) and we hope that the scraping task produces a list of outputs which look something like:

{
    company => 'Yahoo',
    location => 'US',
    jobs => '3',
    company_url => 'http://example.com/info?company=yahoo'
    jobs_url => 'http://example.com/jobs?company=yahoo'
},

{
    company => 'BP',
    location => 'UK',
    jobs => 5',
    company_url => 'http://example.com/info?company=BP'
    jobs_url => 'http://example.com/jobs?company=BP'
},

{
    company => 'Honda',
    location => 'Japan',
    jobs => '2',
    company_url => 'http://example.com/info?company=honda'
    jobs_url => 'http://example.com/jobs?company=honda'
}

So our scraping task picks somehow picks up the visible information (company, location, jobs) as well as the target URLs – which will probably be in the href attribute of <a> tags. (If you are wondering how exactly the Scrape_Companies task produces this output, we'll get to that in due course. Hold on to your hat!)

So let's say our Scrape_CompaniesList task produces the output above. For each set of outputs the next task in line gets executed. ie the outputs of Scrape_CompaniesList get fed into Scrape_JobsList, and in this case the Scrape_JobsList task gets executed 3 times.

Quick Exercise

When the inputs to the second task ( Scrape_JobsList) are

{
    company => 'Yahoo',
    location => 'US',
    jobs => '3'
}

What do we expect the outputs from this task to be?

Solution

{
    category => 'IT',
    job_title => 'Coder',
    date_posted => '2 June',
    jd_url => 'http://example.com/job?company=yahoo&job=coder'
},

{
    category => 'Sales',
    job_title => 'Salesman',
    date_posted => '2 June',
    jd_url => 'http://example.com/job?company=yahoo&job=salesman'
},

{
    category => 'Media',
    job_title => 'Journalist',
    date_posted => '28 May',
    jd_url => 'http://example.com/job?company=yahoo&job=journalist'
}

Solution Notes

Those labels category, job_title, date_posted and job_url are arbitrary. We decide what we are going to call each piece of information – but obviously we need to be consistent. If we are giving the parameter corresponding to IT, Sales, Media etc. the name category (with a small "c") and the next task in line is looking for a parameter called Category (with a big "C") or job_category (or whatever) then you'll end up with some nulls on your database.

Tasks vs xtasks

Hopefully you will have noticed that tasks closer to the bottom of the plan tend to get executed more often than tasks nearer the top - and that's true of TaskPipe plans in general. So in this case, our task specifications form a straight line (ie one after another), but if we look at executed tasks, then these look more like a tree.

In TaskPipe its often useful to think about "executed tasks" as well as plain tasks. For this reason we shorten "executed tasks" to xtasks. A loose definition of an xtask is the combination task + inputs. So in our example, Scrape_JobsList is a task, but the combination of the task Scrape_JobsList plus the input company=yahoo is an xtask.

Quick Exercise

Can you draw up an "xtask diagram" that corresponds to the plan so far(ie the 3 sequential tasks Scrape_CompaniesList, Scrape_Jobs_list and Scrape_JobDescription)? How many times do we expect the Scrape_JobDescription task to be executed (in total)?

Solution

We expect the Scrape_JobDescription task to be executed exactly 10 times – because we know Yahoo has 3 jobs in total, BP has 5 jobs and Honda has 2 jobs. Of course, in a real situation we might not know in advance how many times a particular task was going to get executed.

Passing data between tasks

Going back to our plan, we have a first task specification which looks like this:

---
-   _name: Scrape_CompaniesList
url: http://example.com/companies
headers:
Referer: http://example.com

And the first set of results it produces look like this:

{
    company => 'Yahoo',
    location => 'US',
    jobs => '3',
    company_url => 'http://example.com/info?company=yahoo'
    jobs_url => 'http://example.com/jobs?company=yahoo'
}

Our second scraping task needs that jobs_url. We can tell TaskPipe to take jobs_url from the first task and insert it into the url parameter in the second task by using the $this parameter variable.

$this means use the input of this task (remember that the input of this task is just the same as the output of the last task).

Lets take a moment to clarify some definitions, which will make discussing TaskPipe plans easier:

task inputs – We already mentioned these are the same as the outputs from the last task. This is a raw list (ie an array) of sets of data (ie Perl hashrefs).
task parameters – These are the variables that the task accepts. For example, in our first task, url and headers are task parameters. _name and other labels are not task parameters.
plan parameter variables – these are words like $this which start with a dollar sign (similar to Perl variables), and are used to indicate that the word should be replaced by data coming from some other task (which exact task, and which specific data item, depends on the parameter variable and how this is specified. We will discuss parameter variables in more detail later).
task pinterp – this may sound like a strange name, but there is a logical reason! "pinterp" really means "parameter that has been interpolated". So e.g. if our task specification declares somevar: $this then the value of the somevar parameter is just the word $this, but the value of the somevar pinterp is the data which is actually accepted by the parameter, ie after $this has been interpolated.

Here's a practical example of this language use. Our second task specification looks like this:

-   _name: Scrape_JobsList
url: $this{jobs_url}
headers:
Referer: http://example.com/companies

In the declaration url: $this{jobs_url}, we are using the $this parameter variable. $this{jobs_url} is the value of the somevar parameter.

Lets run our task against a set of inputs:

{
    company => 'Yahoo',
    location => 'US',
    jobs => '3',
    company_url => 'http://example.com/info?company=yahoo'
    jobs_url => 'http://example.com/jobs?company=yahoo'
}

This will make the pinterp value of url become http://example.com/jobs?company=yahoo. ie the pinterp of url becomes the value of the input named jobs_url. Remember that, in general pinterp values are the things that are absorbed and used in the task.

It is worth mentioning that a pinterp value does not have to refer to a parameter which is defined as a parameter variable. For example, in our first task, we declared url: http://example.com/companies. In this case there is no parameter variable. We are saying we want url to be equal to the fixed value of http://example.com/companies whatever. This means the value of the parameter is http://example.com/companies but the value of the pinterp is also http://example.com/companies (since there is no variable in there, it just "interpolates" statically and stays as it is).

A final observation on the subject of task "parameters" vs. task "pinterp": we could talk about the parameters in the task without needing inputs, but we needed a specific set of inputs to be able to discuss pinterp. Putting this another way, "parameters" are really a feature of tasks whereas "pinterp" are a feature of xtasks.

Inputs and Input history

Earlier we said that a "loose definition" of an xtask ("executed task") was the combination of a task and a specific set of inputs. The reason the definition was "loose" was because we neglected to mention input history. When a task completes and invokes the next task in line, it not only hands over its outputs (which become the inputs of the next task, remember) but it also hands over a complete history of the values of all inputs which have taken place beforehand. So when any task is invoked for execution, it is aware of everything that has happened previously.

The mechanics of this are not something you generally need to worry about when creating a scraper using TaskPipe. You just need to know how to instruct TaskPipe to grab values from earlier tasks using parameter variables.

One way of doing this may be seen in the third task specification of our example:

-   _name: Scrape_JobDescription
url: $this{jd_url}
headers:
Referer: $this[1]{jobs_url}

See the [1] between $this and {jobs_url} in the Referer declaration? That [1] is called a match offset, and indicates that instead of using the inputs of this task, count one extra task back and take the value from those inputs instead. So in this case $this[1]{jobs_url} means "take the value of the input named jobs_url that was fed to the Scrape_JobsList task".

This is, of course, the same value that Scrape_JobsList accepted into the parameter url. It makes sense to arrange the Referer header this way; when you are clicking through webpages in a browser, the Referer is almost always the last page you visited. So it makes sense to keep Referer one step behind url in your scraping tasks. For a series of back-to-back scraping tasks, this effect can be achieved by specifying $this for url and $this[1] for Referer.

Quick Exercise

Suppose, somewhere in the middle of your plan, you were going to run the scraping tasks Scrape_Something, Scrape_SomethingElse and Scrape_SomethingFurtherStill (in that order, one after another). Suppose all of your tasks (including the ones that occur before Scrape_Something) are designed so they each output the url that the next scraping task is going to use – and they all use url as the name of the output. Write down this part of the plan. ie write down the 3 task specifications, including the task name, url and Referer header for each task, together with the relevant parameter values, and parameter variables where appropriate.

Show solution

# ...

-   _name: Scrape_Something
url: $this
headers:
Referer: $this[1]{url}

-   _name: Scrape_SomethingElse
url: $this
headers:
Referer: $this[1]{url}

-   _name: Scrape_SomethingFurtherStill
url: $this
headers:
Referer: $this[1]{url}

# ...

More about Parameter Variables

Lets go back to our url: $this{jobs_url} declaration. We noted that the $this parameter variable means "take the value from the inputs of this task". You may have already gather that adding the {jobs_url} suffix tells taskpipe "use the input named jobs_url".

In this case we are putting the value of an input named jobs_url into a parameter named url – the name of the parameter is different to the name of the input, so we need to explicitly tell TaskPipe which input to use. However, if we were expecting an input whose name was the same as the parameter – so e.g. our input was also named simply url (instead of jobs_url) – then we could have omitted {jobs_url} completely and just written url: $this.

Written in complete form, parameter variable declarations generally involve several parts – but most are optional. Those parts are (usually) as follows:

# general format:
$<label_key>:<label_value>(<match_count>)[<match_offset>]{input_key}

# example:
url: $name:Scrape_Companies(2)[1]{jobs_url}

Here's a summary of what each of those parts means:

part of parameter variable	meaning	Required or optional?
label_key	the name of the parameter variable. e.g. `this` or `name`	always required
label_value	At the time of writing `label_value` is required for all parameter variables except `$this`. Most of the time the `label_key` tells TaskPipe which label to use to identify the task (e.g. `_name` or `_id`. We then narrow down to the task where that label has the value of `label_value`. In the example above, we are telling TaskPipe to look for the task which has a `_name` of `Scrape_Companies`.	Required in all cases except for `$this`.
match_count	If `label_key:label_value` matches more than one task, TaskPipe will take last task that matched (ie the first task that matches tracing upwards through the plan). However, specifying `match_count` explicitly tells TaskPipe which of the matches to use. `match_count` is zero based, so a `match_count` of 1 means use the second matching task. In the example above, TaskPipe will look for the third task which has a `_name` of `Scrape_Companies`	Optional
match_offset	Once a task matching `<label_key>:<label_vbalue>(<match_count>)` has been identified, `match_offset` can be used to count back a further number of individual tasks to get the final match. In the example above `match_count` is set to 1. The `$name` parameter variable normally points to the outputs of the matching task. Adding in a `match_count` of 1 means the match now points to the inputs of that task.	Optional
input_key	The other parts of the parameter variable identify which set of inputs to use. The final step is to provide the name of the desired specific input within the set. This is the `input_key`. If `input_key` is omitted, TaskPipe will assume the name of the input is the same as the name of the parameter in question. In the example above, TaskPipe looks for an input named `jobs_url` to insert into the `url` parameter. However, if `{jobs_url}` had been omitted, TaskPipe would look for an input named `url`	Optional

TaskPipe

Introduction

Overview

Quick Exercise

Solution Notes

TaskPipe Plan Basics

Building our plan

Quick Exercise

Solution Notes

Tasks vs xtasks

Quick Exercise

Passing data between tasks

Inputs and Input history

Quick Exercise

More about Parameter Variables

End of part one

virtual.blue

Software House