With the customised settings of the OnPage.org crawler, you can decide what your crawler should and should not do. We’ll show you how in this blog post.
OnPage.org has its own crawler which examines the to-be-analysed page. The crawling technology closely resembles Google’s technology of crawling websites on the internet. The OnPage.org Crawler starts with a certain page (usually the home page) and makes its way from page to page by following the internal link path. Just like the Google Crawler, the OnPage.org Crawler can be controlled: Meaning, it’s possible to tell the crawler to exclude certain directories or pages from the crawl (and with that, from the analysis). Subdomains can also be analysed in-depth.
Customising the settings of the OnPage.org Crawler creates the following benefits:
- Sites that are irrelevant for the ranking (e.g. the Login section) are not being analysed
- Crawling of single subdomains
- Crawling of sites that are actually indexed in Google
- Plus many more
The crawler settings are divided into four categories (tabs): Crawls, Basic Settings, Advanced Settings and Test Settings. They can be found below “Settings” within a selected project, therefore they can be modified individually for every single project. Let’s take a closer look at the settings!
All recent performed crawls within a project and who executed them are listed in “Crawls”. In the table, the Crawling Limit (learn more about extending your crawl budget), the number of found URLs, the number of crawled URLs and excluded URLs as defined by you beforehand are also listed.
Tip:If the number of found URLs is significantly higher than the number of crawled URLs, it would be a good idea to extend the Crawling Limit. Make sure the number of excluded URLs isn’t too high as this could hint at false robots.txt settings.
You can make several adaptations in the Basic Settings. Let’s go through them step-by-step.
Maximum number of URLs
Set the limit of the number of URLs that are supposed to be crawled. If you don’t know the number of URLs on your site at the very beginning, try the site query in Google, for example: “site:en.ryte.com. The number that appears on top will tell you how many of the domain’s pages are listed in the Google index – you can orient yourself towards it. The OnPage.org crawler is able to crawl from 100 to 21.000.000 URLs.
Here you best insert the URL the crawler is supposed to start from and which it should regard as homepage. It should be the actual homepage. If that’s not what you want, just leave the box blank.
If your website has a lot of cookies, you can at this point allow the crawler to accept cookies. This option is disabled by default in order to reveal problems that occur as soon as users (or other crawlers) deny cookies – as for instance session IDs, cloaking etc. These errors are often overlooked as browsers have enabled cookies by default. This is an advanced option and should be enabled with caution.
The OnPage.org crawler regards pictures as autonomous resources and crawls them by default. If you prefer to stint on resources and only allow the crawling of HTML content, you should untick it. However, flawed and deleted images won’t be displayed in the reports any more. We recommend to get the images crawled in order to receive a thorough error analysis of your website.
If your website features a lot of subdomains, you can crawl all subdomains by ticking this box. This option is disabled by default, meaning subdomains are not factored in and regarded as external links. Please note that crawling subdomains requires more resources.
Tip: You can also define a subdomain address as the crawler’s starting point (see point 2). The Crawler will then start off directly from the given subdomain. The prerequisite is, however, that the setting is ticked off.
KPIs and reports for indexable sites
You only want to analyse the sites that are indexed in Google? Tick it off then. We don’t really recommend doing it and disable it by default as there could be important sites which are not listed in the index yet. Moreover, you can set the filter in the reports to “only indexable sites”.
You can tell the OnPage.org crawler to regard or disregard the robots.txt. If you deliberately exclude content in the robots.txt from Google, you can also exclude them from OnPage.org. If you would like to include all available content, all sites will be crawled.
If you have the Business, Agency or Enterprise package, let us know your adjusted robots.txt. The crawler can then check your site with the adjusted robots.txt settings. This works very well for relaunches, as we can fabricate the crawler performance of Google.
In this first part of the crawler settings you have learned about ways to adjust the robots.txt and other useful settings. In the second part, we will take a look at the extended settings in order to adjust reports to your individual needs.
Keep on optimising!