OnPage.org’s extended settings allow you to individually design your crawler in order to best suit your projects. Boilerplate settings and live testing of the settings are now possible.
In Part 1 of this post, we showed you in what way the basic settings can be adapted in order to receive even better analysis of your website. In the second part, we’ll talk about the extended settings which allow you to have even more control over the crawler. Let’s start right away!
Extended Crawler Settings
Apart from the Basic Settings, you can further customise your settings in order to individually design the crawling of your project. In the following, we’ll explain all the settings in-depth.
Crawler Settings in OnPage.org Zoom
When OnPage.org crawls a website, it yields stress on your server, as it has to process even more requests than usual. Especially if your server isn’t very powerful, meaning it can’t “cope” with the crawling, and therefore slows down and signals timeouts, we recommend choosing Parallel Requests: Like that, we crawl your site considerably faster. We offer you 10 parallel requests by default – however, up to 100 Parallel Requests are possible.
Unfortunately, the unit “Parallel Requests” is often misinterpreted: Some think of it as “requests per second”. It’s simplest to visualise the setting as “people who are clicking through my site”. Setting “Parallel Requests” to 1, this would mean one user is on your website waiting for it to be loaded. After that, he will follow the first link. If this one’s loaded, he will again follow the next link – and so on. Increasing this value results in a respective number of “simulated” users clicking through your site. That’s why this value can be raised without fearing to bring the website to a standstill.
The OnPage.org crawler can crawl the site even if it isn’t “public” and password protected. If your website requires such a login, simply insert your username and password. If you don’t need this feature, just leave it blank. Please note that this feature only works for “htcaccess” protected sections, self built login systems are not yet being supported.
Crawler User Agent
Crawling the site, our crawler by default pretends to be the GoogleBot. Like that, we can simulate the Google Crawling and try to look at a website with Google’s eyes. In rare cases, it may be necessary to use a different user agent for the crawler. It can be determined separately. You can also put the user agent to “OnPage Crawler” in order to know it’s us.
Watch out though: there may run desired or undesired scripts that exclusively display content for the Googlebot (Cloaking) – that’s why we recommend to crawl the site as Googlebot in order to reveal problems of that kind.
Additional Request Header: x-Request-with
Some webservers have IP based blocking systems. In scenarios as such, the “x-Request-with” header can be applied in order to keep on crawling with the GoogleBot user agent while telling the webserver to be the OnPage Crawler. You can insert any value which we will then send out with every request and which can for instance be used for whitelists.
Remove the Boilerplate *new*
The Boilerplate settings are all new! Our text statistics are by default based on the entire page content. If you prefer the reports to be based on the “Main Content”, you can remove the so-called Boilerplate. In doing so, we’ll try to ignore your site’s header, footer and sidebar – so that your site’s main content will be interpreted exclusively. However, this algorithm’s quality strongly depends on your website code’s quality – that’s why this feature should be treated with caution.
The Boilerplate settings
Adapting the detection of headers
Due to certain Content Management Systems or for other reasons, it is possible that the H1 Tag is not being used as main head tag. Here you can apply a different tag. CSS classes can be used as well. Valid inputs would be ‘h2’, ‘h3’ or CSS selectors such as ‘div.headline’.
With the Business, Agency or Enterprise package, you can also choose to only choose crawling a particular subfolder. With this setting, you can insert the respective URL (relative path) and we will ONLY examine that particular subfolder.
If your server displays content based on the Accept-Language Request Header, you can set your desired value here. Leave the box empty in order to use standard values. Valid figures are ISO language codes, such as “de” for German or “en-us” for American English. We don’t send Accept-Language parameters by default and let the web server decide on the language.
Would you like the crawler to download and analyse the sitemap.xml(s)? This option is needed for the “sitemap.xml” report. If your website has a lot of sitemaps (20+), deactivating this option can increase your crawler’s performance.
If you’re not using the standard file name for your sitemap.xml and haven’t linked it in the robots.txt, you can deposit the URL of your sitemap.xml here. Alternatively, you can also use this option in order to display a different sitemap.xml as different bot (for instance: testing a new sitemap.xml). You can also indicate various sitemaps (such as video sitemap, image sitemap) here. We of course also support different sitemap index files and gzip compressed sitemap files.
Crawling the sitemap files
If the OnPage.org crawler comes across a sitemap, not all of the therein contained URLs are being crawled automatically. We only crawl URLs containing links or redirects. Tick the box, if you want to have the URLs in the sitemap crawled anyway.
Testing the Crawler Settings
The OnPage.org Crawler settings can be tested directly and live with the tab “Test Settings”. Insert any URL and we’ll show you the Crawling result – live! With this, you can for instance test whether your login data or the adjusted header detection work properly.
Live testing of the Crawler Settings
As you can see, there are various options to improve and individualise your reports. Make use of the Crawler Settings in order to get a feel for the way Google crawls your site and how this affects your website.
Keep on optimising!