Solving Common Crawling Obstacles
All decisions made in SEO are based on information that requires careful collection. Guidance must come from educated guesses, informed by trustworthy data.
One of the best tools for retrieving this data is Screaming Frog, a crawler which can be tweaked in many different ways to accomplish specific tasks of data gathering. However, ensuring the integrity of your data is rarely a straightforward path.
There are many potential obstacles in the journey towards information enlightenment — here, we’ll discuss some of the more common issues you might run into.
The Site is Too BIG
Screaming Frog runs off of your system’s memory (RAM), not hard disk, which means it is blistering fast but can encounter problems crawling on a large scale. You might have 1TB of hard disk space on your computer, but a typical computer won’t have more than 16GB of RAM.
Generally speaking, I run into issues with a plain text crawl after around ~75k-~100k URLs using a machine with 16GB of RAM and a 2.2 i7 processor.
It’s possible to run Screaming Frog on a remote server to increase the power available. This is most easily accomplished through Amazon Web Servers EC2 instances, which scale in processor and RAM to your desired definition.
For a full walkthrough, check out Mike King’s post here on ipullrank.com.
Other cloud services such as Google’s Cloud Computer Engine are available, but from my perspective AWS is the simplest solution.
Another possibly quicker solution, if you only occasionally deal with large domains, is to crawl in segments using Screaming Frog’s include/exclude features, disallow “crawl outside of start folder” in settings, and engage in several crawls of all main subfolders, presuming each subfolder is within your RAM’s capacity.
As a side note, you can increase Screaming Frog’s available RAM allocation (the standard is 1GB for a 32-bit machine, and 2GB for 64-bit). If you use all your allocated memory, you’ll receive a warning to increase your memory or it will become unstable.
The Crawler is Blocked By the Server
There are many evil crawlers on the internet, full of malice and server load. Large domains tend to take an inexorable approach, allowing entry only to those defined as pure by the royal domain. All others shall receive the forbidden gates of 403.
There are many potential causes to this, and it behooves you to determine exactly which cause is to blame before approaching the domain’s gatekeeper with a request for entry — specificity is appreciated by technical individuals. Below are the most common causes to test for.
User-Agent Whitelist or Blacklist
Try crawling as Chrome, but always use a VPN first to ensure you haven’t already been blacklisted by IP after approaching the server identified as a crawler. I always prefer to identify as a crawler in my first crawl in case user-agent blacklisting is inactive, because then the server logs will be cleaner later. It’s not a huge pain to filter your IP out later, but I’m constantly on VPNs and it can be a mild pain.
Server Rate Limiting
Crawling too quickly can upset a server, as bots can be used for distributed denial of service (DDOS) to crash the system. You may receive 403 messages after a decent number of URLs initially responded 200 OK — this is usually because of rate limiting.
Try slowing your crawl to under 5 URLs per second and limit threads to 1 or 2: this seems to be the most common acceptable crawl rate I’ve found in my work. This method takes much longer for huge domains; do the math and run it overnight, the weekend, or on a server.
For example, a 1 million page site at an average of 5 URLs per second will take a couple of days to crawl: 86,400 seconds per day x 5 URLs per second = 432,000 URLs crawled per day.
IP Whitelist or Blacklist
If you’ve already tried changing user-agent, limiting your crawl rate, and changing your IP via VPN, you’re probably dealing with a server that has a whitelist. It’s also possible that you’ve already landed yourself on the site’s blacklist by being too liberal with your crawl testing.
At this point, you can approach the gatekeeper with the steps you’ve attempted, and they will most likely happily put your IP on a whitelist, because you know how to crawl respectfully.
Unless, of course, you did get yourself blacklisted. Then you may be heckled.
The Crawler Can’t See the Desired Content
This is a conundrum when specific data, like a view count per post, needs to be extracted from a large number of URLs from an external domain; for example, perhaps one you have a guest column or competitive analysis interest in.
Enter artoo.js, a client-side scraping companion. This is the cleanest solution I have found to extracting rendered DOM quickly and at scale. Artoo automatically injects jQuery and exports in pretty JSON.
GG, WP. Happy crawlings.