Consultancy

If you find our advice valuable, but would prefer to keep development in house, then our consultancy may be for you. Did we mention your first consultation is free?

Tell us the nature of your problem and we’ll get back to you with our honest opinion. We won’t push you into services you don’t need, and will bring to your attention any simpler solution if we can think of it.

Send us as much detail as you can, including as much of the code as you feel comfortable with – because the more information we have, the better the advice we can give. We won’t take any payment details and there is no obligation to pay for any further services.

Examples of Our Free Advice

  • Our initial advice is free of charge – and there is no obligation to sign up for any services afterwards.
  • If you ask for something we think you don’t need, we’ll tell you straight. We will refer you to other products/services if we think they will save you time/money.
  • We don’t expect you to be experts – that’s the part we are here for. At the same time we know long term reputation is built on complete transparency – so you can expect a candid, plain-spoken reply
  • In some cases we may even send you a free solution – if it turns out to be simple. We like to keep our coders busy!
  • Just have the odd question? We don’t bite! Send it through and we’ll be glad to help.
  • A sample selection of our advice is given below (presented with the permission of the enquirer where applicable).

Can you let me know the feasibility of the following project, and how much it would cost to build: I am looking to build an automatic system to create e-mail accounts randomly, at various email providers.

The system will search and identify possible email providers, and find out emails with less security.

We need at least one system with 5 email providers. The program must have proxy support.

Some quick thoughts on the feasibility of what you are suggesting:

An email service provider is not going to last long if they have poor sign up security. Either they will upgrade their security, or they will fail and disappear in a reasonably short time. This is because a system with poor security will be exploited by spammers, and the email service will quickly be blacklisted.

Of course no system is impenetrable, and even gmail is used for sending spam. But the security needs to be good enough so that the ratio of sending spam/sending normal emails is below blacklisting limits. This means successful email service providers must have security that all but the most sophisticated spammers cannot beat. For what you are suggesting, this means your system is going to need to be as clever as the most sophisticated spammers.

That, unfortunately is not all: email security (as with web security in general) is a constantly evolving arms race. You might beat a website’s security today, only to discover they introduce new security measures tomorrow.

Even without these issues – say you were just trying to scrape 5 websites – you still would not be able to spend [the budget specified] and have a system that is guaranteed to work into the future. You would need to be ready to make constant changes, because detail you are grabbing from the websites is going to change too.

Some thoughts on how this could be done with today’s security standards:

Email providers now tend to ask for a phone number to send an SMS to for verification. I know gmail allows 5 email addresses to be registered with the same number.

There are websites which present a certain number of phone numbers/per period of time (perhaps per day) which can be used to send SMS messages to. The number of phone numbers tends to be small, and the services may require a subscription payment.

One possible solution might be to continually poll these websites for new numbers. As soon as a new number becomes available, attempt to register an email address using it. No doubt other people are already doing this, so it would be a race to get an email successfully before the maximum is used up (e.g. 5 for gmail). You will probably win this sometimes. It is difficult to say how many email addresses you could register per day using this method – but my guess is it would only be a few. (What are your requirements?)

So this may be doable, but I’m afraid not guaranteed to work into the future. It will also need continuous adjustment, and is likely to be a continuous cash drain rather than being over with a one-off payment. Unfortunately I think it will end up costing considerably more than [the budget you specified].

I need a skilled developer to make some changes to a site I built years ago. We need to, first & foremost, use a more reliable email module and improve form input to better deal with special characters.

1) I appreciate that without investing some time to get fully compatible with the site, you can’t provide an estimate with a high degree of precision, but please provide an approximate hours estimate for the time create a testing environment – a clone of the site/database for development.

2) The current forms on the site don’t do a great job of converting inputted data to safe formatted data and back again when sent out via emails. Can you suggest whether we need a better library or to develop a custom routine to do a better job of this? Then all the various input forms on the site will require some updates.

3) The current library the site uses to send email has a fairly high delivery fail rate, so we should also implement a better solution for emails from the site. Once we have a more reliable module in for emailing, we need to consolidate and clean up the various routines which are currently sending out email – they are scattered and overly complex. Again, what is your opinion on the best way to solve this problem?

1. It does depend on what software you have installed, but if your setup is fairly standard (apache server, mysql db, scripts and modules) it can be done in a couple of hours. Usually one or two things don’t work, however, and the bottleneck can be figuring out what went wrong. However we won’t charge for anything we can’t resolve in reasonable time, as we don’t believe in charging for our own stupidity! Let’s say 4 hours tops for this.

2. I am guessing you are right when you say ‘better library’ – there’s generally no need for custom routines to make formatted data safe as web languages are generally saturated with great modules in this space.

3. Are you sure it’s the code that’s the problem? If mails aren’t making it then it’s more likely the problem is the ever increasing security measures that mail servers now look for – SPF, DKIM etc. Are you running your own mail server? If you are, then you just need to set these things up by adding the correct records to your mail server DNS. If you are just using local sendmail then your mail headers probably have values taken from your hosting provider, and incorrect or absent SPF etc – and this is why they are not making it through.

Setting up mail can be a real pain – which is why most people these days use a third party service like [some examples]. Many of these have a free option if you are not sending a lot of mails. Most website owners we work with who use these services say they save a lot of trouble.

Our login is not very secure – the password reminder emails the username and password (separate emails) to user. The registration process doesn’t prevent users from registering multiple accounts with same email, etc. It either needs some improvements or to be replaced with an account management module. Please explain to me, do you think it would be better to patch up what’s there or is there a module you would suggest using instead?

In terms of the password reminder – normal protocol is to email a link which will take the user to a page where they can reset their password. This is considered a better approach because the user is forced to choose a password which is not written down anywhere, and the link expires in an hour or so – which means even if the information in the email was somehow intercepted, the information in it would quickly be useless. However, emailing a temporary password is not actually much worse provided you force the user to choose a new password once they log in. The simplest fix here might be to direct the user to the admin page where they can change their password immediately on login – but it does depend on how your site is structured and how the existing code works.

Re. being able to register multiple accounts with the same email – normally you would have email specified as a unique field in the database to make this situation impossible. Then just check the email does not exist prior to attempting the insert – if it exists, deliver a message back to the user saying the username is already taken. (Actually most sites do this with ajax now so the user knows prior to submission whether the name is available)

Answering the final part of your question: there is no module which deals with all aspects of security, so the answer is that we would need to patch up/enhance what you already have. However, there are modules which can help us along the way.

I haven’t fully analysed what I’ve seen – and obviously I’ve only had a small snapshot – but it seems clear there are a number of security issues which do need addressing quite urgently. The biggest problem I think is… (Redacted)

Your site may also be vulnerable to sql injection as… (Redacted)

Our site is not designed to prevent re-submission of forms on page refresh, this causes some trouble. It would be beneficial to identify the various places where this is an issue and prevent re-submissions. What do you see as the best solution for this?

I don’t know if this is a problem. There is no way to tell if a second form submission is an accident or a deliberate attempt by the user to submit something a second time. Normally you make form submission method=’POST’, which means if the user tries to refresh a page that’s been posted they will get a warning from the browser. There’s not much can be done beyond this. What’s important on the server side is that you have unique key fields properly set up on the database, so you can’t have the user inadvertently add records that shouldn’t be added.

We have an old web scraper which currently gathers certain information from a small number of websites. We rebuild this so it scrapes millions of websites. It is important that the sites are rendered to get all the information.

Also, we want to be able to scale to 10s of servers that can all connect to the same database to see which sites to scrape next. We need help building a general scraping engine – we are hoping you have the expertise to understand the original and expand it. We expect the initial build to take a few months, but we are looking for a long-term commitment to help further build the product.

Interesting poject! It does sound quite ambitious, however, and so we have a few questions:

– I am wondering what it is you going to scrape? Is the information you are looking for going to be presented in a known format on the webpages you are scraping? This can happen where the website invokes a third party service – e.g. you can find the same piece of google analytics code on a huge number of websites. If you were trying to find out what proportion of sites used google analytics, you could indeed scrape millions of websites to find this information. If on the other hand you are trying to scrape websites to find something that could be presented in arbitrary format, like “the price of gold” – then you are going to need an AI level of intelligence, and it is likely to be expensive and take considerably longer than 3 months.

– How will you get the list of websites to scrape? Will there be a predefined list which you loop through, or will the system be a “crawler” which pulls URLs from the sites it scraped already?

– Are you sure the websites need to be rendered? If the system is a crawler, and you absolutely insist it picks up 100% of the URLs it encounters, then you will indeed need to render the page, because no doubt some sites will use javascript to construct their links. You’ll also need to render it if you are looking for a certain block of HTML which may or may not be created using javascript, and may not be created using the same piece of javascript. However, I would say in many other situations there are ways of avoiding needing to render – generally by looking at what the js is doing and following the flow. I just mention this because on more than one occasion I’ve seen rendering systems in a terrible mess, and after investigating what they were scraping it seemed they didn’t actually need to render at all.

Your project sounds similar to a job we worked on sometime last year, which was (redacted)… They were crawling, rendering and had multiple instances working in parallel. When we came to look at their system it had already basically ground to a halt, and I’m afraid we weren’t able to rescue it. The main problem was that they had made a series of poor design choices from the outset (I think for a job like this some real thought needs to go into the architecture from the beginning, before writing any code at all) – and then the code was not well organised. Eventually the technical debt piled up, the code had become to confusing, and it became non-operational. They’d also been working on it for a lot longer than they expected. So I think my advice would be to be really very careful with the design, and make sure the coding is very clean and well-organised. Even in this case, it may take longer than you are hoping, so you should probably also anticipate this.

We have an old information gathering process which gathers data from [a website]. However it is currently suffering from rate-limiting problems. We had a developer look at this recently, but he was not able to significantly improve the rate of data collection. [My manager] is convinced with the number of proxies we are using, it should be possible to gather the data without this limiting issue… Could you take a look at the code, and give me your assessment?

We’ve had a look over the code, but have not tried any of it out yet. (I am wondering if there is a version set up in a development environment somewhere – which might save us time setting this up?)

Generally we are quite wary of jobs involving [this structure], as it’s easy to fall into a trap of trying to rescue something which ends up being too tangled to comprehend – so that’s why we like to inspect things carefully before making any recommendations.

On the bright side I would say this is in a good enough state to make amendments to. It’s obviously written in a more traditional style and there are a few bad practice violations here and there but nothing too serious. (I am relieved to see that Parallel::ForkManager was chosen over the terrible ‘threads’ module).

In terms of maintenance, there’s a fair amount of hardcoding (urls,keys etc) which means you are always going to need a programmer to make adjustments. If you are hoping to keep using the code long term, it would be sensible to pull parameters out and put them in separate config files. Normally we would also recommend refactoring into oo style modules during this process, as it makes the code more understandable and manageable. You could possibly bypass this as the code is not too badly organised, but this is often a good exercise in revealing how the code works – and it might take as long to read and understand the existing code anyway.

I cannot say definitively just from looking at the code whether the algorithm in place is efficient – but my feeling is that it probably isn’t, because it doesn’t seem well organised and the values appear arbitrarily chosen.

To improve this we would likely work along the following lines:

  1. investigate [the website] and determine what the bottleneck factors are in making requests from their site
  2. create a simulator package that mock’s [the website’s] rate restriction and banning behaviour
  3. use the package to run tests to create an algorithm that maximises efficiency
  4. implement the algorithm in the existing code

I hope this makes sense but of course do get back to us with any questions.