TaskPipe
TaskPipe is a framework for building scrapers and crawlers, written in Perl5.
TaskPipe was created to take as much of the effort as possible out of building directory style scraping systems. Such systems can be assembled quickly as a series of modular tasks, and tasks can be rearranged or reused elsewhere.
TaskPipe aims to be lightweight in terms of its own footprint, but heavyweight in terms of capability, allowing (depending on settings),
- a desired number of parallel download threads to be specified
- auto launch of 1 TOR instance per thread
- the collection and use of open proxies
- auto page rendering via PhantomJS
A command line tool is included to assist with quick project creation, and project management.
The main purpose of this project is to act as the data gathering component of a web analytics software package. We are releasing this open source via the GNU Public License (GPL) v3.0. The usual disclaimer applies: this is experimental software in a relatively early stage of development; use at your own risk.
Overview
TaskPipe was really designed for those instances where you want to scrape online data arranged in the format of an online directory, and create your own cross referenced database of the results. For example, there may be some kind of list page which you want to refer to initially; each list entry may provide a link that points to a page with a sublist; and each sublist entry might point to a detail page.
Consider the accompanying diagram, which shows a simple scenario where a website is displaying some basic information about a list of companies. In our example each item in the list has a link to a company detail page, and a link to a sublist page, showing job postings that are associated with the company. Each item on the jobs sublist has a further detail page.
Unfortunately TaskPipe can’t design your table schema for you – so it’s important you can already do this in a way that makes sense for the data you are trying to collect. As a quick exercise, try writing down a database schema for our example situation. Specify which tables you would create, and for each table which columns you would include. You should pick up all of the available data. Then refer to our suggested schema in the solution below.
TaskPipe Plan Basics
Let’s say we have created these tables in a MySQL database. TaskPipe is designed to work with any database that supports SQLSTATE – however it was built using MySQL, and not much testing has yet taken place with other database systems. It is probably safer to use MySQL if possible.
We tell TaskPipe how to pick up the data by providing a plan. A plan is a YAML file which basically outlines the tasks which are going to be performed, which order they are going to happen in, and which data is passed between them. It makes sense for our scrape to start with the “companies list” page, so the first item in our plan might look like this:
1 2 3 4 5 |
--- - _name: Scrape_CompaniesList url: http://example.com/companies headers: Referer: http://example.com/ |
The three dashes --- at the top are the YAML way of marking the top of the file. Below that we specify our first task as a list element (using the dash - to indicate a list element. If you are not familiar with YAML markup, then can refer to the documentation – or alternatively, just accept that a dash indicates a list element, and keep reading!
In TaskPipe scraping tasks generally require the URL of the page to scrape, and a Referer header. Carefully specifying a Referer header helps to make sure the scraper proceeds between pages in a way that more closely resembles a human, and thus is less likely to raise red flags on the target website. However, you can adjust settings so your scraping task does not require a Referer header. Or indeed you can create your own custom task which takes whatever parameters you decide (but one step at a time…)
You’ll notice that _name begins with an underscore. This is because an underscore indicates that it is a label. A label is something that allows tasks to refer to each other (which is usually the point of labels!) However, in general a TaskPipe label also has the following requirement: changing or removing the label does not affect the operation of the task. Consider the following task specification:
1 2 3 4 5 6 |
--- - _name: Scrape_CompaniesList _id: my_id url: http://example.com/companies headers: Referer: http://example.com |
You’ll note the extra _id parameter. Because this starts with an underscore, TaskPipe knows to ignore it e.g. when caching results. So it knows the added _id label will make no difference to the output for a given input.
The only exception to this rule is the _name label, which is special because it works both as a label (ie it can be used to refer to tasks) and it also affects the task output.
(Actually we couldn’t decide if _name should get an underscore. Will this change in future? Maybe! Are we making this stuff up as we go along? Absolutely!)
Building Our Plan
We’ll start out by creating a plan for just the right side of the diagram – ie the “company list page”, the “company jobs list page” and the “job description” page. These things happen sequentially, so it’s no surprise that we can put our tasks in a line:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
--- - _name: Scrape_CompaniesList url: http://example.com/companies headers: Referer: http://example.com - _name: Scrape_JobsList url: $this{jobs_url} headers: Referer: http://example.com/companies - _name: Scrape_JobDescription url: $this{jd_url} headers: Referer: $this[1]{jobs_url} |
In general a task takes a single set of inputs, and generates a list of (sets of) outputs. So in general it is a one to many operation. For example, when we scrape example.com/companies we provide the URL and the Referer header (a single set of inputs) and we hope that the scraping task produces a list of outputs which look something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
{ company => 'Yahoo', location => 'US', jobs => '3', company_url => 'http://example.com/info?company=yahoo' jobs_url => 'http://example.com/jobs?company=yahoo' }, { company => 'BP', location => 'UK', jobs => 5', company_url => 'http://example.com/info?company=BP' jobs_url => 'http://example.com/jobs?company=BP' }, { company => 'Honda', location => 'Japan', jobs => '2', company_url => 'http://example.com/info?company=honda' jobs_url => 'http://example.com/jobs?company=honda' } |
So our scraping task picks somehow picks up the visible information (company, location, jobs) as well as the target URLs – which will probably be in the href attribute of <a> tags. (If you are wondering how exactly the Scrape_Companies task produces this output, we’ll get to that in due course. Hold on to your hat!)
So let’s say our Scrape_CompaniesList task produces the output above. For each set of outputs the next task in line gets executed. ie the outputs of Scrape_CompaniesList get fed into Scrape_JobsList, and in this case the Scrape_JobsList task gets executed 3 times.
When the inputs to the second task ( Scrape_JobsList) are
1 2 3 4 5 |
{ company => 'Yahoo', location => 'US', jobs => '3' } |
What do we expect the outputs from this task to be?
Tasks Vs XTasks
Hopefully you will have noticed that tasks closer to the bottom of the plan tend to get executed more often than tasks nearer the top – and that’s true of TaskPipe plans in general. So in this case, our task specifications form a straight line (ie one after another), but if we look at executed tasks, then these look more like a tree.
In TaskPipe its often useful to think about “executed tasks” as well as plain tasks. For this reason we shorten “executed tasks” to xtasks. A loose definition of an xtask is the combination task + inputs. So in our example, Scrape_JobsList is a task, but the combination of the task Scrape_JobsList plus the input company=yahoo is an xtask
Can you draw up an “xtask diagram” that corresponds to the plan so far(ie the 3 sequential tasks Scrape_CompaniesList, Scrape_Jobs_list and Scrape_JobDescription)? How many times do we expect the Scrape_JobDescription task to be executed (in total)?
Passing Data Between Tasks
Going back to our plan, we have a first task specification which looks like this:
1 2 3 4 5 |
--- - _name: Scrape_CompaniesList url: http://example.com/companies headers: Referer: http://example.com |
And the first set of results it produces look like this:
1 2 3 4 5 6 7 |
{ company => 'Yahoo', location => 'US', jobs => '3', company_url => 'http://example.com/info?company=yahoo' jobs_url => 'http://example.com/jobs?company=yahoo' } |
Our second scraping task needs that jobs_url. We can tell TaskPipe to take jobs_url from the first task and insert it into the url parameter in the second task by using the $this parameter variable.
$this means use the input of this task (remember that the input of this task is just the same as the output of the last task).Lets take a moment to clarify some definitions, which will make discussing TaskPipe plans easier:
- task inputs – We already mentioned these are the same as the outputs from the last task. This is a raw list (ie an array) of sets of data (ie Perl hashrefs).
- task parameters – These are the variables that the task accepts. For example, in our first task, url and headers are task parameters. _name and other labels are not task parameters.
- plan parameter variables – these are words like $this which start with a dollar sign (similar to Perl variables), and are used to indicate that the word should be replaced by data coming from some other task (which exact task, and which specific data item, depends on the parameter variable and how this is specified. We will discuss parameter variables in more detail later).
- task pinterp – this may sound like a strange name, but there is a logical reason! “pinterp” really means “parameter that has been interpolated”. So e.g. if our task specification declares somevar: $this then the value of the somevar parameter is just the word $this, but the value of the somevar pinterp is the data which is actually accepted by the parameter, ie after $this has been interpolated.
Here’s a practical example of this language use. Our second task specification looks like this:
1 2 3 4 |
- _name: Scrape_JobsList url: $this{jobs_url} headers: Referer: http://example.com/companies |
In the declaration url: $this{jobs_url}, we are using the $this parameter variable. $this{jobs_url} is the value of the somevar parameter.
Lets run our task against a set of inputs:
1 2 3 4 5 6 7 |
{ company => 'Yahoo', location => 'US', jobs => '3', company_url => 'http://example.com/info?company=yahoo' jobs_url => 'http://example.com/jobs?company=yahoo' } |
This will make the pinterp value of url become http://example.com/jobs?company=yahoo. ie the pinterp of url becomes the value of the input named jobs_url. Remember that, in general pinterp values are the things that are absorbed and used in the task.
It is worth mentioning that a pinterp value does not have to refer to a parameter which is defined as a parameter variable. For example, in our first task, we declared url: http://example.com/companies. In this case there is no parameter variable. We are saying we want url to be equal to the fixed value of http://example.com/companies whatever. This means the value of the parameter is http://example.com/companies but the value of the pinterp is also http://example.com/companies (since there is no variable in there, it just “interpolates” statically and stays as it is).
A final observation on the subject of task “parameters” vs. task “pinterp”: we could talk about the parameters in the task without needing inputs, but we needed a specific set of inputs to be able to discuss pinterp. Putting this another way, “parameters” are really a feature of tasks whereas “pinterp” are a feature of xtasks.
Inputs and Input History
Earlier we said that a “loose definition” of an xtask (“executed task”) was the combination of a task and a specific set of inputs. The reason the definition was “loose” was because we neglected to mention input history. When a task completes and invokes the next task in line, it not only hands over its outputs (which become the inputs of the next task, remember) but it also hands over a complete history of the values of all inputs which have taken place beforehand. So when any task is invoked for execution, it is aware of everything that has happened previously
The mechanics of this are not something you generally need to worry about when creating a scraper using TaskPipe. You just need to know how to instruct TaskPipe to grab values from earlier tasks using parameter variables.
One way of doing this may be seen in the third task specification of our example:
1 2 3 4 |
- _name: Scrape_JobDescription url: $this{jd_url} headers: Referer: $this[1]{jobs_url} |
See the [1] between $this and {jobs_url} in the Referer declaration? That [1] is called a match offset, and indicates that instead of using the inputs of this task, count one extra task back and take the value from those inputs instead. So in this case $this[1]{jobs_url} means “take the value of the input named jobs_url that was fed to the Scrape_JobsList task”.
This is, of course, the same value that Scrape_JobsList accepted into the parameter url. It makes sense to arrange the Referer header this way; when you are clicking through webpages in a browser, the Referer is almost always the last page you visited. So it makes sense to keep Referer one step behind url in your scraping tasks. For a series of back-to-back scraping tasks, this effect can be achieved by specifying $this for url and $this[1] for Referer.
Suppose, somewhere in the middle of your plan, you were going to run the scraping tasks Scrape_Something, Scrape_SomethingElse and Scrape_SomethingFurtherStill (in that order, one after another). Suppose all of your tasks (including the ones that occur before Scrape_Something) are designed so they each output the url that the next scraping task is going to use – and they all use url as the name of the output. Write down this part of the plan. ie write down the 3 task specifications, including the task name, url and Referer header for each task, together with the relevant parameter values, and parameter variables where appropriate.
More about Parameter Variables
Lets go back to our url: $this{jobs_url} declaration. We noted that the $this parameter variable means “take the value from the inputs of this task”. You may have already gather that adding the {jobs_url} suffix tells taskpipe “use the input named jobs_url”.
In this case we are putting the value of an input named jobs_url into a parameter named url – the name of the parameter is different to the name of the input, so we need to explicitly tell TaskPipe which input to use. However, if we were expecting an input whose name was the same as the parameter – so e.g. our input was also named simply url (instead of jobs_url) – then we could have omitted {jobs_url} completely and just written url: $this.
Written in complete form, parameter variable declarations generally involve several parts – but most are optional. Those parts are (usually) as follows:
1 2 3 4 5 |
# general format: $<label_key>:<label_value>(<match_count>)[<match_offset>]{input_key} # example: url: $name:Scrape_Companies(2)[1]{jobs_url} |
Here’s a summary of what each of those parts means:
part of parameter variable | meaning | Required or optional? |
---|---|---|
label_key | the name of the parameter variable. e.g. this or name | always required |
label_value | At the time of writing label_value is required for all parameter variables except $this. Most of the time the label_key tells TaskPipe which label to use to identify the task (e.g. _name or _id. We then narrow down to the task where that label has the value of label_value. In the example above, we are telling TaskPipe to look for the task which has a _name of Scrape_Companies. |
Required in all cases except for $this. |
match_count | If label_key:label_value matches more than one task, TaskPipe will take last task that matched (ie the first task that matches tracing upwards through the plan). However, specifying match_count explicitly tells TaskPipe which of the matches to use. match_count is zero based, so a match_count of 1 means use the second matching task. In the example above, TaskPipe will look for the third task which has a _name of Scrape_Companies |
Optional |
match_offset | Once a task matching <label_key>:<label_vbalue>(<match_count>) has been identified, match_offset can be used to count back a further number of individual tasks to get the final match. In the example above match_count is set to 1. The $name parameter variable normally points to the outputs of the matching task. Adding in a match_count of 1 means the match now points to the inputs of that task. |
Optional |
input_key | The other parts of the parameter variable identify which set of inputs to use. The final step is to provide the name of the desired specific input within the set. This is the input_key. If input_key is omitted, TaskPipe will assume the name of the input is the same as the name of the parameter in question. In the example above, TaskPipe looks for an input named jobs_url to insert into the url parameter. However, if {jobs_url} had been omitted, TaskPipe would look for an input named url |
Optional |
End of Part One
Congratulations! You have reached the end of part 1 of the TaskPipe tutorial.
Unfortunately Part 2 of the series is not yet available – but it’s coming soon, so watch this space! Alternatively get in touch if you have questions.
Have a great scrape!