Duplicate Content

December 31, 2011

It’s widely known that having duplicate content is a bad thing, but we see it all the time. Sometimes it’s done intentionally. Other times, it happens without us even realizing it. How bad can it be?

The Truth

There are assumptions that having keywords, phrases, links, and content in general repeat throughout your website will improve ranking for a page or a website. This is much like keyword stuffing within a single page, but on a larger scale – both of which is bad practice. Duplicate content, intentional or not, is considered gaming the algorithm of a Search Engine. Google’s Webmaster Guidelines state:

Don’t create multiple pages, subdomains, or domains with substantially duplicate content.

How to avoid Duplicate Content?

There are a number of reasons why duplicate content exists, the more common reason being there’s a technical glitch somewhere. Technical duplicates will be addressed a little later. The other common reason is because there’s an attempt to cite content from another site, or an attempt to increase keyword density of a page, or a website. The latter is a clear violation of Google’s golden rule: Create content for users, not search engines.

So then, how can search engines identify and distinguish when content is being cited versus a block of content that was copy-and-pasted? Well, the key is in Google’s wording:

Don’t create multiple pages, subdomains, or domains with substantially duplicate content.

So here’s how I would approach it: If you come across content that you would like to use on your own site, grab a snippet and cite the original source. Better still, paraphrase the snippet so that it’s completely original content and you wouldn’t have to worry about duplicating content.

Duplicate Content Threshold

So just how large of a snippet can you grab without being penalized for duplicate content? Well, that all comes down to how an algorithm defines substantial. The duplicate-threshold-checking process can be broken down into two groups: exact duplicates, and similar duplicates. Exact duplicates are easy to identify – these are blocks of text that match, word for word, against another indexed document. Citing another body of work isn’t a bad thing, but there’s a threshold of duplicity that, when crossed, penalties are given.

Similar duplicates are handled, well, similarly. Blocks of text which contain words that are like the words in another indexed document can be identified. How is this possible? It’s the crawler’s job to find documents and index them into the database. After a document has indexed, it is parsed down into chunks of data to be handled later. At some point, Document A is compared, using all variations of each word, against Document B’s words, and all of its variations. The threshold for similar duplicates is higher than it is for exact duplicates; but flags, if not penalties, will occur for this voilation, too.

What kind of penalities should I expect?

To thoroughly answer this, I would need to explain indexing, scoring, and how database retrieval works – subjects which all deserve their own focus. Suffice it to say, the penalties are heavy. It can range anywhere from a lower quality score, demotion into a lower database tier, being de-indexed, or worse-yet, black-listed.

Technical Duplicates and Solutions

As I mentioned earlier, sometimes duplicate content is not intentional. A common source of duplicates stem from dynamically generated content, or content which allows parameters to be inserted into the url. Crawlers see this as a different page, and therefore as duplicates.

Let’s take a look at these URLs:

The content on these pages are the same, but arranged or displayed differently because of a user action – in this case, sorting widgets by price or popularity. Search engines will see these pages as separate documents, and they will be judged as such. These will, of course, be deemed duplicate content, and therefore penalize your site. So what can you do about this?

Robots.txt – block urls with parameters from being indexed: (Learn how to use the robots.txt file)
Disallow: *?sort=

Meta Tag – block crawlers from indexing specific pages
<meta name="robots" content="noindex">
<meta name="googlebot" content="noindex">

Canonicalization – Search Engines may select one page, of your duplicates, to index. You can specify which page to index by declaring a canonical url:
<link rel="canonical" href="http://www.fakedoodads.com/widgets.php"/>

.htaccess – globally redirect pages for crawlers and users
RedirectMatch 301 /widgets.php?sort=(.*) http://www.fakedoodads.com/widgets.php

While you’re in the .htaccess file, you might as well address the www vs. non-www canonicalization issue as well.

Wander into Google’s Webmaster Tools, and select your preferred domain. And, of course, keep your internal linking consistent. If you link using www, stick with that. if you user all-lowercase or mixedcase, stick with that, too. Sometimes it’s unavoidable, especially when others take your content and duplicate it elsewhere. The nature of websites sometimes require content to be duplicated to an extent. Try my suggestion of taking snippets, and/or paraphrasing the snippets. Definitely go through the technical solutions. Just follow the golden rule of creating content for users, and you’ll be fine.


Previous Post

Next Post