A 12-Step Guide to Follow Before Crawling

Crawl-first SEO focuses on two of the main parts of the search engine infrastructure: crawling and indexing.

If all the pages on a site aren’t crawled, they can’t be indexed. And if your pages can’t be indexed, they won’t appear in search engine results pages (SERPs).

This is what makes crawl-first SEO is important – it’s all about making sure all the pages present on a site are crawled and indexed, so they will position well in the SERPs.

Crawl-first SEO can:

Help you understand how Google crawls your sites.
Identify incompatibilities.
Help Google access useful pages.
Help Google understand the content.

But before you crawl your clients’ sites, make sure you follow this 12-step guide.

Before Configuring the Crawl

Collect Information & Data from the Client

1. Send a Crawl Questionnaire Document to Your Client

In this document, you should ask the following questions:

How many products do you have on your site?

This is a question which you can’t answer. You can’t know the number of products in their databases or how many of them are provided exactly online.

On the contrary, your client usually knows the answer to this question by heart and they can answer you most of the time easily.

I am saying most of the time because I came across some clients who don’t know how many products they have on their sites. This can happen, too.

Knowing how many products the client has, is one of the most important pieces of information you need to know before crawling the site. Whereas, this is one of the most important reasons you are going to conduct a “Crawl-first SEO Audit” on their site.

You have to know the number of their available products online since you would absolutely like to answer two essential questions at the end of your SEO audit:

Can the crawler access to all the product pages on the site? For the first question, if the crawler can not access all the product pages on the site, the best is to investigate the web server logs. This will help you to understand whether the search engine bot, let’s say Googlebot, can access the product pages but not your crawler. Otherwise, there may be many reasons causing this problem, including JavaScript.
Is the crawler accessing more product pages on the site than it should? If in your crawl there are excessive product URLs than there should be, then it indicates a problem with the site’s crawl. In the worst case, there may be a crawler trap which is good to find out with your audit.

I have been asked before if we can think of articles as products for other types of sites, the answer to this question is, yes.

When we ask for the number of available products on their sites to our clients, mainly by products we mean what the site proposes as long tail. They can provide articles, news, podcasts, videos etc… other than products.

Do the pages on your site return different content based on user-agent?

You are asking if the content on the pages changes with user-agent.

Do the pages on your site return different content based on perceived country or preferred language?

You would like to learn if the content on the pages changes with geolocalized IPs or languages.

Are there crawl blocking accesses or restrictions on your site?

First, you are asking if they are blocking some kind of IPs, user-agents from crawling. Second, you would like to learn if there are some crawl restrictions on the site.

As an example to the crawl restriction, it is possible that the server is responding with HTTP status code other than 200 through exceeding a certain number of requests per second.

For instance, the server may respond with HTTP status code 503 (Service Temporarily Unavailable) when a crawler’s requests exceed 10 pages per second.

What’s the bandwidth of your server?

Usually, they don’t know the answer to this question.

Basically, you should explain to your client that you are asking how many pages per second you can crawl on their site.

Anyhow, I recommend you to agree on the number of pages per second which you can crawl their site with your client.

This will be a good deal for you so you do not end up in uncomfortable situations afterward, such as causing server failure because of your crawl requests.

Do you have preferred crawl days or hours?

Your client may have some preferred crawl days or hours. For example, they would like that their sites get crawled on the weekends or in the evenings.

However, if the client has such preferences and the number of crawl days and hours are very limited, it is important to let them know that as a result, performing the SEO audit will take longer due to days or hours of limited crawl.

2. Access & Collect SEO Data

Ask your client to get access to:

You should also download the sitemaps of the site, where available.

Verify the Crawler

3. Follow up Search Engine Bots’ HTTP Headers

As an SEO consultant, you should follow up what HTTP headers search engine bots request in their crawls.

If your SEO audit concerns Googlebot, in this case, you should know what HTTP headers Googlebot is requesting from an HTTP server or HTTPS server.

This is vital because when you say to your clients you will be crawling their sites for instance, as Googlebot crawls then you should be sure of requesting the same HTTP headers as Googlebot from their servers.

The response information and later data you collect from a server depend on what you request in your crawlers HTTP headers.

For example, imagine a server which supports brotli and your crawler requests:

Accept-Encoding: gzip,deflate

but not:

Accept-Encoding: gzip,deflate,br

At the end of your SEO audit, you may say to your client that there are crawl performance problems on their site but this may not be true.

In this example, it is your crawler which doesn’t support brotli and the site may not have any crawl performance problems.

4. Check Your Crawler

What HTTP headers the crawler requests?

Maximum number of pages per second you can crawl with your crawler?
Maximum number of links per page the crawler takes into account?
Does your crawler respect crawl instructions in:
- Robots.txt?
- Source code?
- HTTP headers?
How does the crawler handle the redirections?
How many number of redirections can it follow?

Verify & Analyze Collected Information & Data While Taking Decisions for the Crawl Configuration

5. Request Sample URLs from Your Client’s Site with Various:

User-agents.
Geolocated IPs.
Languages.

Do not trust the answers you have collected from your client with the crawl questionnaire document in the beginning. This is not because your clients can lie to you, but simply because they don’t know all about their sites.

I recommend you to carry out your own site-specific crawl tests on the site before crawling the site.

I have been asked before whether this part is really important. Yes, it is because the content on a site can change by user-agent, IP, or language.

For example, some sites can practice cloaking. In your site-specific crawl tests, you should check if the content on the site changes by Googlebot user-agent or not.

On the other hand, some sites may send different content on the same URL, based on languages or geolocated IPs. Google calls them as “locale-adaptive pages”, support document on which has been modified recently “How Google crawls locale-adaptive pages“.

In the future, Googlebot’s crawling behavior concerning locale-adaptive pages may again be modified. The best is knowing if content changes on a site by perceived country or preferred language of the visitor and how Googlebot or other search engine bots handle them at that time and adapt your crawlers correspondingly.

Furthermore, these tests can help you identify a crawl problem on the site before crawling and it can be an important finding in your SEO audit.

6. Get to Know the Server

Gather information about the server and the crawl performances of the site. It is good to know what kind of server you are going to send your crawl request and have an idea about the crawl performances of the site before crawling.

To have an idea about the crawl performances, you can examine the site-specific crawl requests you have performed in step 5. This part is necessary in order to find out the optimum crawl rate to define in your crawl configuration file.

From my standpoint, the most difficult element to set up in a crawl is crawl rate.

7. Pre-Identify the Crawl Waste

Before preparing an efficient crawl configuration file, it is important to pre-identify the crawl waste on your client’s site.

You can identify crawl waste on the site by the collected SEO data from web server logs, Google Analytics, Google Search Console, and the sitemaps.

8. Decide to Follow, Not to Follow or Else Just Keep in the Crawl Database

URLs blocked by robots.txt.
The links to other websites including subdomains of the client’s site.
The URLs with a specific scheme(protocol) (for example, HTTP).
Content type (for instance, PDFs or images).

Your choices depend on the type of SEO audit you are going to perform.

In your SEO audit, for example, once you may want to analyze the URLs blocked by robots.txt, later, links given to other websites or subdomains of the site, next, links to URLs with a specific scheme (protocol).

However, keep in mind that if you follow or just conserve them in your crawl database, as it increases the volume of data, it will also increase the complexity of the data analysis later.

Crawl Configuration

9. Create an Efficient Crawl Configuration File

Select the right:
- User-agent.
- Geolocated IP.
- Language.
Set the optimum:
Choose wisely the initial URL:
- Scheme(protocol)?
- Which TLD?
- With or without subdomain in hostname?
Choose to follow, not to follow or keep in the crawl database the URLs:
- Blocked by robots.txt.
- With a specific scheme(protocol).
- Belonging to subdomains of the client’s domain.
- Of other domains.
- Content type with extensions (for example, pdf, zip, jpg, doc, ppt)`?
Avoid crawling crawl waste (especially if you have limited resources).

About the crawl depth, I recommend you to select a small crawl depth in the beginning and increase the crawl depth in your crawl configuration progressively.

This will be helpful especially if you are going to crawl a big website. Of course, you can do it if your crawler allows you to increase crawl depth step by step.

Additionally, there are intelligent crawlers which can identify crawl waste alone while crawling so that you don’t need to take care of it manually in your crawl configuration. If you have such a crawler then you do not need to bother with this point.

After Configuring the Crawl

10. Inform the Client About Your User-agent & IP

This is principally crucial, if you are going to crawl a big website in order to prevent them from blocking your crawl. For the small sites, it is not that important however I recommend you to practice this anyway.

In my opinion, this is a good professional habit. Moreover, it shows to your clients that you are an expert in crawling.

11. Run a Rest Crawl

Analyze your test crawl data. Find out if there are unexpected results.
Some issues to check:
- What is the crawl rate? Does it need adjustments?
- Is the crawler interpreting the robots.txt of the domains you are following in your crawl configuration correctly?
- Are you receiving a lot of HTTP status code other than 200, especially 503 HTTP status code? What may be the reason?
- Is there a crawl restriction?
- Do you have expected results of your crawl configuration? Begin with the domains you want to follow, then the links you want to keep in your crawl database, and finally the links you don’t want to follow.
- Is there an issue with the crawl depth? Does the crawl depth which the crawler indicates for the crawled URLs convince you? There are a couple of reasons which may affect the crawl depth for instance first the crawler itself, second what you have already crawled like an html sitemap.
- Are you crawling some unwanted content type?
- Are there some crawl waste in your crawl data?
Take actions accordingly by modifying your crawl configuration file or sometimes even you may want to change your crawler.

Finally, Launch the Crawl

12. If All Is Fine

Calculate how long the crawl will take.
Finally, launch your crawl.

More Resources:

Subscribe to SEJ

Get our daily newsletter from SEJ’s Founder Loren Baker about the latest news in the industry!

Source link