Autodiscovery and RSS Scraping

Jonathan BaileySeptember 5, 2007

4 minutes read

Feed autodiscovery is one of the most powerful tools available for encouraging feed usage and subscription. Theoretically at least, by giving browsers and feed readers an easy way to identify the feed and users an intuitive way to subscribe to it, more people will take advantage of it.

However, when a reader of this site had a template issue and was forced to do away with feed autodiscovery on his site, he wondered if it might help him put a stop to the spammers who have been scraping his feed.

Though the idea was tempting, the sad truth is that autodiscovery does not play a major role one way or another in dealing with feed scraping. Though it helps browsers and users find the feed, spammers have other methods of feed detection that bypass not only the tags in your HTML, but your site altogether.

Discovering Autodiscovery

Feed autodiscovery is little more than a series of hidden tags that enable a browser to locate the feed automatically. They are usually embedded in the head of the document and are used in conjunction with buttons and subscription links.

Though some sites do without autodiscovery, most sites, including this one, do take advantage of it. Autodiscovery also comes pre-configured in most blog templates and blogging applications, meaning many sites who never activated the feature may still have it switched on.

The advantage of autodiscovery is that it makes it easier for visitors to subscribe to the feed by letting them do it through their browser directly. However, much of that advantage is likely mitigated by the fact most sites also use RSS buttons and most users subscribe via those methods. Furthermore, any user who is used to using the autodiscovery will likely look for such a button immediately after finding it isn’t there.

So, if disabling autodiscovery would help with feed scraping, it would be an appealing solution. Unlike truncated feeds, which negatively impact end users, disabling autodiscovery could be a way to deal with scraping without harming your actual visitors.

Sadly though, that is not the case and, even though disabling autodiscovery may not do a great deal of harm, it won’t do much good either.

Spammer Autodiscovery

The problem with disabling autodiscovery is that spammers don’t use it any more than ordinary users do. In fact, many spammers never even see your original site when they scrape your feed.

Large spam sites and spam networks get their blog posts and RSS feeds the same way search engines such as Technorati and Google Blog Search get theirs, through pinging services.

They look for updated content, check to see if it has the desired keywords and scrape what is interesting to them. If they see a feed that looks particularly promising, some of the applications will take that feed and make sure to get future entries. However, most scraping on the larger networks is done on a post-by-post basis, often focusing in on just the keywords desired.

This works well for spammers as most blog applications are set up to ping the major services by default and few people switch that feature off. It gives them the easiest access to the largest amount of content possible.

Smaller spam sites and networks often build up their scraped content by hand, trolling the Web for promising feeds and scraping them. Since they have to copy and paste the content from their browser to their application, it makes sense that they’d use the feed buttons and other links rather than the autodiscovery.

Even those who do use autodiscovery by default will, just like legitimate users, will not likely be swayed away from adding the feed so long as there is a clear RSS link somewhere to be found on the page. With RSS feed icons so widely present and easily understood, it is unlikely any blog reader, in particular a spammer, would be confused so long as they are present.

The only type of scraping operation that might be partially foiled would be any that relied on a traditional Internet spider to search the Web for RSS feeds. However, almost certainly, any such spider would be smart enough to follow links in the page itself and would discover the RSS feed when it ran across the icon. It would be a major oversight if such a spider could parse the Web looking for RSS feeds, but could not recognize ones in regular hyperlinks.

In short, it is unlikely that there are any spammers out there dependent on autodiscovery to find new feeds and any that are likely will not be that way for long.

Real Protection

Since most spammers never visit your site before scraping your feed, the best protection for your feed still resides with the feed itself. I’ve talked previously about Copyfeed as well as other WordPress plugins that can help stop RSS scraping. I’ve also mentioned the role a service such as FeedBurner can play in such a matter.

Those tools remain the best bet for serious RSS protection.

However, the best weapon of all is simple vigilance and awareness. By being aware of the problem and on the lookout for it, you’ll be doing more to protect your content than any trick or plugin. Given the large number of people out there unaware of these problems or otherwise removed from them, the fact that you’re reading this column and pondering these issues seriously puts you well ahead of the pack.

In short, by doing something at all, you’ll be doing a lot more than the vast majority of bloggers and writers out there.

Conclusions

All in all, there are no easy answers to the issue of stopping feed scraping. That includes disabling autodiscovery on your site.

Though many sites will disable that feature for other reasons, preventing content theft should not be one of them. It is clear that new technology will have to be developed to deal with this issue and, until those tools come along, protecting our content is going to be a matter of personal vigiliance and active enforcement.

In the meantime, I will continue to look for new ways to protect content, especially RSS feeds, and work with other bloggers to develop strategies to make the process easier and more effective.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileySeptember 5, 2007

4 minutes read

Want to Reuse or Republish this Content?

Follow us