#mm-preview-settings-bar { z-index: 100000; }
Solving Common Crawling Obstacles

Solving Common Crawling Obstacles

All decisions made in SEO are based on information that requires careful collection. Guidance must come from educated guesses, informed by trustworthy data.

One of the best tools for retrieving this data is Screaming Frog, a crawler which can be tweaked in many different ways to accomplish specific tasks of data gathering. However, ensuring the integrity of your data is rarely a straightforward path.

There are many potential obstacles in the journey towards information enlightenment — here, we’ll discuss some of the more common issues you might run into.

The Site is Too BIG

Screaming Frog runs off of your system’s memory (RAM), not hard disk, which means it is blistering fast but can encounter problems crawling on a large scale. You might have 1TB of hard disk space on your computer, but a typical computer won’t have more than 16GB of RAM.

Generally speaking, I run into issues with a plain text crawl after around ~75k-~100k URLs using a machine with 16GB of RAM and a 2.2 i7 processor.

It’s possible to run Screaming Frog on a remote server to increase the power available. This is most easily accomplished through Amazon Web Servers EC2 instances, which scale in processor and RAM to your desired definition.

For a full walkthrough, check out Mike King’s post here on ipullrank.com.

Other cloud services such as Google’s Cloud Computer Engine are available, but from my perspective AWS is the simplest solution.

Another possibly quicker solution, if you only occasionally deal with large domains, is to crawl in segments using Screaming Frog’s include/exclude features, disallow “crawl outside of start folder” in settings, and engage in several crawls of all main subfolders, presuming each subfolder is within your RAM’s capacity.

As a side note, you can increase Screaming Frog’s available RAM allocation (the standard is 1GB for a 32-bit machine, and 2GB for 64-bit). If you use all your allocated memory, you’ll receive a warning to increase your memory or it will become unstable.

The Crawler is Blocked By the Server

There are many evil crawlers on the internet, full of malice and server load. Large domains tend to take an inexorable approach, allowing entry only to those defined as pure by the royal domain. All others shall receive the forbidden gates of 403.

There are many potential causes to this, and it behooves you to determine exactly which cause is to blame before approaching the domain’s gatekeeper with a request for entry — specificity is appreciated by technical individuals. Below are the most common causes to test for.

User-Agent Whitelist or Blacklist

Try crawling as Chrome, but always use a VPN first to ensure you haven’t already been blacklisted by IP after approaching the server identified as a crawler. I always prefer to identify as a crawler in my first crawl in case user-agent blacklisting is inactive, because then the server logs will be cleaner later. It’s not a huge pain to filter your IP out later, but I’m constantly on VPNs and it can be a mild pain.

Server Rate Limiting

Crawling too quickly can upset a server, as bots can be used for distributed denial of service (DDOS) to crash the system. You may receive 403 messages after a decent number of URLs initially responded 200 OK — this is usually because of rate limiting.

Try slowing your crawl to under 5 URLs per second and limit threads to 1 or 2: this seems to be the most common acceptable crawl rate I’ve found in my work. This method takes much longer for huge domains; do the math and run it overnight, the weekend, or on a server.

For example, a 1 million page site at an average of 5 URLs per second will take a couple of days to crawl: 86,400 seconds per day x 5 URLs per second = 432,000 URLs crawled per day.

IP Whitelist or Blacklist

If you’ve already tried changing user-agent, limiting your crawl rate, and changing your IP via VPN, you’re probably dealing with a server that has a whitelist. It’s also possible that you’ve already landed yourself on the site’s blacklist by being too liberal with your crawl testing.

At this point, you can approach the gatekeeper with the steps you’ve attempted, and they will most likely happily put your IP on a whitelist, because you know how to crawl respectfully.

Unless, of course, you did get yourself blacklisted. Then you may be heckled.

The Crawler Can’t See the Desired Content

Sometimes the HTML source is different than the rendered document object model (DOM). Observe the differences between “view page source” and “inspect element” (which reveals the rendered DOM). Usually, this is because of fancy new JavaScript libraries which modify the DOM, like AngularJS, ReactJS, and jQuery.

Fortunately, Screaming Frog can utilize the Chromium project library as a rendering engine; modify Screaming Frog’s configurations in the “rendering” tab to render JavaScript. It is slower to crawl this way, so only do it when necessary, but it’s becoming more necessary with more domains every day.

However, Screaming Frog cannot do it all, and its JavaScript rendering solution is not perfect.

Certain data will not appear in the rendered DOM unless the “check external links” setting is enabled when specific elements you need to extract from the page are delivered through JavaScript. When combined with major domains that implement “welcome pages” and/or custom tag tracking URI (like Forbes), it’s impossible to ignore JavaScript-based redirects to content while also allowing JavaScript-rendered content to appear in the extractable HTML during crawl.

This is a conundrum when specific data, like a view count per post, needs to be extracted from a large number of URLs from an external domain; for example, perhaps one you have a guest column or competitive analysis interest in.

Enter artoo.js, a client-side scraping companion. This is the cleanest solution I have found to extracting rendered DOM quickly and at scale. Artoo automatically injects jQuery and exports in pretty JSON.

GG, WP. Happy crawlings.

Nicholas Chimonas on Twitter
Nicholas Chimonas
Nicholas Chimonas
For most of the 2010s, Nicholas Chimonas was Head of Technical SEO projects at Page One Power. He recently moved on from the agency life to an in-house role as Director of SEO at WTP Inc, a holding company of investors and tech strategists. You’ll find him in technicalSEO.slack.com or a national forest in the pacific northwest.
You’re Not Ready for Link Building…Yet!

You’re Not Ready for Link Building…Yet!

Your site isn’t ready for link building.

It can’t handle it. The penguins are hungry.

You think you want links, because someone once told you that links lead to page one rankings.

The thing is, making it to the first page of Google’s search results and staying there is about so much more than just links. And links are about so much more than just getting to the first page of Google’s search results! Even if backlinks are among some of the most heavily weighted ranking factors, and indeed, a democratic system of votes that can certainly affect rankings, links are not a magic silver bullet.

That’s why we run many of our new clients through the narrative I’m about to present to you, and quite frequently their varied situations require us to provide more than just link development.

Part One: Humans

If your site isn’t good enough for people, it isn’t good enough for links either.

There is a proverb at Page One Power that is fairly simple to understand: “The links you get will only be as good as your site.”

It’s the truth. To earn a worthwhile link, you’ve got to deserve it: your website needs to provide real value to visitors. Why would anyone want to link to an e-commerce page that’s filled purely with products and their prices? Would you give out free advertising for nothing in return?

Does your site really deserve to rank on the first page of Google?

Links are an essential part of a strong website, but links alone won’t ensure success. It’s true that whether your goals include a higher ranking in Google or more visitors to your website, links can help you reach them. But the links you get will only be as good as your website.

People want to link to pages that spark their interest — something that holds value for them or the people who visit their site. In the SEO industry, we call these pages linkable assets. Product pages are rarely a linkable asset, unless you provide a truly unique or rare set of products, like hoverboards or lightsabers.

Think about what makes your business unique, common questions people have about your industry or company, or information that’s new and exciting, then build content based on those ideas.

If you don’t have anything special or unique to offer, it’s easy to get lost in the noise of the web. Taking a step back, if a business doesn’t have a USP, they’re probably not ready for marketing at all. It might be time to re-evaluate, create something worth marketing, and then present it to a target audience that could find value in it.

Simple marketing basics that need to be more integral to link building:

  • What is your USP?
  • Is your content solving a problem for the customer?
  • Who is your target audience/demographic?
  • Will seasonality play a role in the success of this campaign?
  • What sites and people are in a position of authority and influence?
  • What kinds of relationships are worth nurturing?

After answering those questions, how can you align your marketing (and link building) efforts to intrinsically reflect the answers?

The websites that rank on the first page of Google are special because they’ve invested work into creating worthwhile content and brands that people want to see — there’s no shortcut to that. To join their high ranks, you’ll need to do the same.

Links are connections between people and ideas. That has always been their true purpose, long before the dawn of Google. When you connect the right people with the right ideas, especially with your brand attached, you’re winning. Relationships are being built with real individuals, people that could become brand evangelists for you. When you connect the right people with the right ideas, everybody wins.

Part Two: The Technical Stuff

Links act as a signal of authority and trustworthiness. The more trusted links that point to your site, the more likely it is that Google will see your website as valuable and trustworthy.

Links also help boost the exposure that Google’s crawlers — the robots that run the search engine — have with your website. Links direct those crawlers to your site; the more links that lead to your website, the more opportunities you have to be discovered by Google, and in turn, Google’s users.

If these crawlers show up on your page and discover content that no actual human would ever read, sloppy technical SEO, and lackluster design, that great link has lost its potential value. All the links in the world will not turn trash into treasure.

Of course, if a site is bad, the likelihood of obtaining good links in the first place is very low. People are willing to endorse pages on the web which they find valuable. Making a single quality page that offers something new or helpful is more worthwhile than creating many pages of mediocre content.

Worse yet, if you have to force a link to be created or use dubious link building tactics, you could actually be harming your site by inviting a manual or algorithmic penalty.

If you want to sustainably improve your search engine rankings and help your customers find you more easily on the web, make sure you’re ready by running through the checklist below.

Part Three: The Pre-Flight Checklist

Do you have a website?
You can’t build links without a website.

Have you implemented Google Search Console and checked if you have any manual penalties?
Whether your site’s rankings are held back by an algorithm or a manual penalty (that’s a penalty given to you by a person, not a robot), these issues need to be addressed before you worry about links. Otherwise, the links you gain might not have much impact on organic SERP positions. Marie Haynes is the lady with the powerful knowledge you need.

Is your technical and on-page SEO tight or a blight?
Before you start working on external SEO, leverage the potential your site already has:

Do you know who your target audience is?
Whether you’re developing your marketing strategy, working on your website, or researching target sites for link building opportunities, you must be performing marketing research. Don’t “trust your gut.”

Defining your audience and buyer personas is a crucial task for gaining an understanding of where in the marketing funnel your campaign is operating within. Mike King has created an extensive write-up on these processes

Do you know how to connect with your target audience?

What’s important to the people in your target audience? Discovering who is in a position of authority or influence and understanding what topics resonate with them is not a step you should skip. BuzzSumo is a tremendously helpful tool for accomplishing this task.

Can people easily navigate your site?

Don’t just say yes to this. Test, test, test…with real humans. Consider these helpful resources.

Will people find value in your site?

We’re not talking about your products. When it comes to building real links, you need to have content on your website that offers real value for free. When you’re contacting strangers in the wilds of the web, the value of the page you’re hoping to secure links to is usually all you have to offer in exchange for the link.

Will the links you build help real people other than you, the client, or the brand?

Yes? You’re ready to win.

Nicholas Chimonas on Twitter
Nicholas Chimonas
Nicholas Chimonas
For most of the 2010s, Nicholas Chimonas was Head of Technical SEO projects at Page One Power. He recently moved on from the agency life to an in-house role as Director of SEO at WTP Inc, a holding company of investors and tech strategists. You’ll find him in technicalSEO.slack.com or a national forest in the pacific northwest.