More thoughts on duplicate content masking

Inspired by this post on Eli’s excellent blog:

The question that has to be asked here really is:
How does Google (and other engines) know that your content is duplicate content?

Obviously, this can not be a 1 to 1 matching, as any change (be it ever so small) would mark your text as original.

I presume (and Elis success with his techinque seems to confirm this) that the dupe chekers work along the lines of the duplicate finders used in academia, for example.

So how do those checks work?
Basically, a duplicate finder works in 3 steps:

  1. Take x snippets from your text (normally those are chunks in varying lengths from 3-7 words)
  2. Hit Google and other engines
  3. Reanalyze sites / documents that return x positive matches.

Q: So, what to do about this?
A: Take more text from several sources and add to Eli’s techique.

Basically, what you want to do is this (in this example we take teeth whitening:

  • Take text from 2-3 sources on teeth whitening
    You might not want to take the most poular sites for this. Hit some article directories and maybe even some forgotten sites.
  • Mix the texts thoroughly.
    Take one sentence from here, another one from over there… leave a paragraph out.
    Add Eli’s techique.
    Change some words using Markov or manually.