Duplicate Content: Definition, Causes and How to Get Rid of It
All search engines, including Google, have problems with duplicate content. When the same text is shown at numerous locations on the internet, a search engine can’t determine which URL address should be shown in Search Engine Results Pages (SERP). That could affect a web page’s ranking negatively. The problem is only exacerbated when altered versions of the content is linked to. In this article, we’ll assist you in understanding some of the reasons why duplicate content exists, and assist you in solving the problem.
Duplicate content defined
If you stand at a crossroad and multiple road signs indicate different directions to the same destination, you won’t know which direction to go. If, on top of that, the final destinations are even slightly different, the problem is even bigger. As a web user, you won’t really care because you’ll find the content needed, but a search engine needs to choose which page should be shown in its results because it does not want to display the same content more than once.
Let’s presume an article about ‘keyword a’ is shown at http://www.website.com/keyword-a/, but the same content is also shown at http://www. website.com/category/keyword-a/. This scenario actually occurs a lot in a CMS. If this article is redistributed by numerous bloggers but a few of them link to URL 1, while the rest link to URL 2, the search engine’s problem now becomes your problem as the links now each promote a different URL. As a result of this split, it is less likely that you would be able to rank for ‘keyword a’, and it would be much better if all the links pointing to the same URL.
How to use duplicate content checker
Using duplicate checker to identify internal duplicates for a whole website is very easy. In fact, it is the necessary step while doing website SEO optimization, because Google and other search engines love unique content that brings value for readers. Duplicate meta tags can lead to website penalizing, Google Panda Update, that means your website will not be seen in SERP.
How Google penalized sites for Duplicate content
When duplicate content is found on the website, there is a big chance that Google will apply sanctions. What can happen? In most ways, website owners may suffer from traffic losses. It happens due to Google stop indexing your page where the plagiarized text is detected. When it comes to prioritizing of what page has more value for the user, Google has a right to choose which website’s page is most likely to be in SERP. Therefore it causes some sites are to stop being visible for users. In tough cases, Google can impose a duplicate content penalty. In this way, you will get DMCA notice what means that you are suspected in the manipulation of search results and copyright violation.
There are numerous reasons why duplicate content exists, mostly technical. Humans don’t often store the same content in more than one place without ensuring that it is clear which one is original. The technical reasons mostly happen because developers don’t think like browsers or even users, let alone search engine bots. In the example mentioned above, a developer will see the article as only existing once.
The developers are not crazy, but they do see things from a different perspective. A CMS powering a website will only have one article in the database, but the site’s software will allow the same article to be recovered via more than one URL. From the developer’s perspective, the article’s unique identifier isn’t the URL, but the database ID for the article. A search engine, however, sees a URL as any text’s unique identifier. If this is explained to developers, they will understand the problem. This article will also provide solutions to this problem.
E-commerce websites keep tabs on visitors and enable them to add the products they want to a shopping cart. This is achieved by giving each user a ‘session.’ This is a short history of the visitor’s actions on the site and can include things like items in a shopping cart. To preserve a session when a visitor moves between pages, Session IDs have to be saved somewhere. This is most commonly done with cookies. Search engines do however not store cookies.
Some systems add Session IDs to the URL, resulting in all internal links on the site getting a Session ID appended to the URL. As Session IDs are unique to a session, new URLs are created, and this results in duplicate content.
Parameters passed via URLs
Duplicate content is also created when URL parameters are used, e.g. in tracking links, but the page’s content is not changed. Search engines see http://www.website.com/keyword-a/ and http://www. website.com/keyword-a/?source=facebook as different URLs. Although the latter will help you track where users came from, it might, however, make it more difficult for your page to rank highly, and this is not something you want!
The same applies to every other type of parameter added to URLs where the content is not changed. Other examples of parameters would be to change the sort order, or for displaying a different sidebar.
Content syndication and scraping
Duplicate content is mostly caused by something your website or you Google. It does happen that other websites scrape content from your site without linking back to the original article. In these cases, search engines don’t know this and deal with it as if it is simply a new version of the article. With more popular sites, more scrapers use its content, simply increasing the problem.
CMSs generally don’t use URLs that are straight-forward, but URLs that look like /?id=4&cat=6, where ID is the article number and cat the category number. The URL /?cat=6&id=4 will display the same result in most websites, but they’re not the same for search engines.
Pagination of comment
In WordPress and other systems, it is possible to paginate comments. This result is that content is duplicated across the URL of the article, and the URL of the article & /comment-page-x etc.
Pages designed to be printer friendly
If pages that are printer friendly are created and these are linked to from article pages, search engines will normally pick them up, unless they’re specifically blocked. Google then has to decide which version to show – the one that only shows the article or the one with peripheral content and ads.
With, or without WWW
Although this one has been around for ages, search engines still sometimes make a mistake. If both website versions are accessible, it creates duplicate content problem. A similar issue that occurs, although not as often, is https vs http URLs containing the same texts.
Canonical URLs – a potential solution
Although several URLs could point to the same piece of text, this problem is easy to solve. To do this, one person in the organization needs to determine without a shadow of a doubt what the ‘correct’ URL for a piece of content should be. Search engines know a piece of content’s ‘correct’ URL as the Canonical URL.
Finding duplicate content problems
If you’re not sure if you have issues with duplicate content on your website, there are a number of ways you can use to find out. Be aware of any content changes on your website, because it can harm on page optimization process.
Google Search Console
Pages with duplicate descriptions or titles are not good. Clicking on these in the tool will show the relevant URLs and will assist you in identifying the problem. If you, for example, have written an article on keyword a, but it is displayed in more than one category, their titles may be different. This could be ‘Keyword A – Category Y – Website’ and ‘Keyword A – Category Z – Website’. Google won’t see those as duplicated titles, but you can identify them if you search.
Search for snippets or titles
You could use some helpful search operators to assist you in these cases. If you need to identify all URLs on the site with the keyword A article, use the following string in Google:
site:website.com intitle:”Keyword A”
Google will display all pages within website.com that have that keyword A in the title. If you are very specific with intitle, it will be easy to identify duplicates. The same method can be used to find plagiarized content across the internet. If the article’s full title is ‘Keyword A is great’, you can search as follows:
intitle:”Keyword A is great”
For that query, Google will show all pages that match the title. It is also worthwhile to search for a few whole sentences from an article, as scrapers could make the title different. Google sometimes shows a notice below the results stating that some similar results have been left out. This shows that Google is ‘de-duping’ the results, but as this is still not good, click on the link and look at the full results to determine whether any of them can be fixed.
But there is always the fastest way to find if somebody duplicates your content. You can use the duplicate content checker and get quick answers on the most worried questions. Such tools can help you check content at your website pages and provide you with the relevant score. Use it to find the internal and external sources which duplicate your website`s content. As search engines prefer unique and valuable for users text, it is important for SEO to need to prevent yourself from stealing the whole articles or its parts from web pages. Duplicate checker finds the text which is repeated on other pages. In most ways, they work as plagiarism checkers and compare the content on your page with all the sites with which have a match for individual phrases and words. They can do all the functions described above but quicker.
Solving duplicate content problems
Once you know which URL should be used as the canonical URL for specific content, start canonicalizing your site. This entails letting search engines know which the canonical version of a page is and allow them to find it as fast as possible. There are a number of methods to solve the problem:
- Don’t create duplicate content.
- Use a canonical URL for similar texts.
- Add canonical links to all duplicate pages.
- Add HTML links from all duplicate pages to the canonical page.
Don’t create duplicate content
Various causes of duplicate content mentioned above can be fixed easily:
- Disabled session ID’s in a URL in the system settings.
- Pages that are printer friendly are unnecessary and print style sheets should be used.
- Comment pagination options should be disabled.
- Parameters should always be ordered in the same sequence.
- To avoid tracking link issues, use hashtag based tracking and not parameters.
- Either use WWW or not, but stick with one and redirect the other to it.
If the problem is not easy to fix, it may still be worthwhile doing it anyway. The ultimate aim should however be to prevent duplicate content completely.
Redirect similar pages to a canonical URL
It might be impossible to prevent your system from generating a wrong URL entirely, but you might still be able to redirect these. If you do manage to fix some duplicate content problems, ensure that the URLs for old duplicate content are redirected to the correct canonical URLs.
Add a canonical link to all duplicate pages
Sometimes it is impossible to delete duplicate versions of articles, even if it uses the wrong URL. The canonical link element was introduced by search engines to solve this issue. The element is put in the section of a website like this:
<link rel="canonical" href="http://website.com/correct_article/"/>
Put the canonical URL of the article in the href section. Search engines that support the canonical element will perform soft 301 redirects, relocating most of the link’s value for the page to the canonical page.
If possible, a normal 301 redirect is still better as it is faster.
Add a HTML link from all duplicate pages to the canonical page
If none of the solutions mentioned above are feasible, you can add links to the original article below or above the duplicate article. You may also want to implement this in the RSS feed by inserting a link to your original article. Although some scrapers might filter the link out, others may leave it as is. If Google finds a number of links all pointing to the original article, it will assume that that is the canonical version.
Duplicate problem can cause sirious problems. Depending on the structure of your pagination pages, it is highly likely that some pages may contain similar or identical content. In addition to this, you will often find that you have the same title and meta description tags on your site. In this case, duplicate content may cause difficulty for search engines when it comes time to determine the most relevant pages for a particular search query.
You can remove pagination from the index using “noindex tag. In most cases, this method is a priority and as quickly as possible implemented. Its essence is to exclude all pagination pages from the index, except the first.
It is implemented in the following way: such a meta tag
<meta name = “robots” content = “noindex, follow” />
added with a <head> section on all pages except the first. Thus, we exclude from the index all pages of pagination, except for the main page of the catalog and at the same time ensure the indexation of all pages that belong to this directory.