We included a new subfeature into the Duplicate Content Report: From now on, our customers can analyse their site’s content even more in-depth – with the help of the “Near Duplicate Content Report”. In this blog post, we’ll explain the difference between Duplicate Content and Near Duplicate Content.
Now our customers can dive deeper into their websites’ depths. Especially for Panda-affected websites, the Near Duplicate Content Report can help to reveal content-related trouble spots.
What’s our definition for “Duplicate Content”?
In order to point out the difference, I’ll quickly explain our definition of “Duplicate Content”. In the past two years, this report experienced a lot of modifications. The first version of OnPage.org still worked out the “Text Uniqueness” – this was a fantastic metric, as it could work out a site’s percentage of “Unique Content”. The problem though was: It didn’t scale. When increasing the number of analysed sites to the million range, the metric was no longer able to keep up. That’s why we started looking for a substitute. The result is a fingerprint which is calculated based on the site’s content – just like other tools do.
By the way, we only use the fingerprint for a site’s content and not for the source code. Additionally, we get rid of all numbers before calculating the fingerprint, as a single sign could already change the fingerprint. If one site displays different metrics, as for example “How fast was the site loaded?”, this would result in a different fingerprint at each time – this shall be prevented by excluding the numbers.
We then use this fingerprint to compare it with another page. As soon as we find another URL with the same fingerprint, we inform our users about it in the “Duplicate Content Report” and leave it up to the user to decide what to do with it. We only compare indexable pages: Pages that point to another page with Canonical or similar (Robots.txt Block, Noindex, …) are not included.
What is “Near Duplicate Content”?
After close reading our definition of “Duplicate Content”, you may stumbled upon the following: One single varying sign only can result in two sites not being regarded as Duplicate Content any more. That’s why we have been working on a solution for this problem. What has been the very accurate Text Uniqueness before, is now the calculation of “Near Duplicate Content”.
The new algorithm uses fingerprints as well – however, with a large number of pages compared on Bit-basis which shed light on the actual similarity of different texts. One can now tell how many changes are necessary for text A in order to get to text B. We have been testing this algorithm and as the number of “False Positives” is low enough now, we decided to include the report.
Our goal is: Detecting very similar pages that differ in 2-3 sentences but all in all do not offer any added value. Another example would be product pages à la “Adidas Shoe Size 39” and “Adidas Shoe Size 40” – the only difference is the indication of the size, but actually, there’s no added value.
In the end, it’ll look like this: The graph accumulates the number of all found critical Near Duplicates per page.
All found critical Near Duplicates per page
If you click on the more detailed view (magnifier next to the number), you’ll get the more detailed listing with lighter versions of Near Duplicate Content:
Detailed view with percentages
Why is that important?
Anyone who is acquainted with the patents of Google and other search engines will know that such “Near Duplicate Content” algorithms are extremely important: They help search engines with the adjustment of their crawlers. If a site repeatedly displays similar content which do not provide any added value, search engines will prefer to invest their resources into domains where it’s more likely to find valuable content.
If a page shows too much resemblance with a previous page, it results in crawlers ignoring these sites and their links. We of course do not proceed like that: It is our goal to provide you with preferably all contents of your domain in order to give you an overview of potentially dormant parts of your website that could affect your domain metrics negatively.
Of course you can also use such algorithms to reveal entire spam projects. If every page of your domain is actually unique, it’ll be only a matter of time until the Panda filter strikes again.
You’ll find our new “Near Duplicate Content” report within the main menu “Content” below “Duplicate Content”. It’ll help you to reveal very similar pages within your website. The inspector’s detailed view ranges from “Very Critical Near Duplicate Content” to “Potential Near Duplicate Content”.