Spam in RSS, PubSub's Thoughts

+ Posted by Josh Hallett on 01.26.05 // 10:01 PM

On Sunday night I posted an item about 'How Spam is Moving into RSS.' In the post I talked about how I am starting to see RSS based spam via feed services such as PubSub.

Knowing that PubSub is a client of Steve Rubel over at CooperKatz, I e-mailed Steve to find out what PubSub thought of the emerging issue. Before the next morning I had a reply from Bob Wyman, CTO of PubSub. I was shocked to say the least. It looked like Cluetrain Theses #83: We want you to take 50 million of us as seriously as you take one reporter from The Wall Street Journal.

Bob and I have traded e-mails over the past few days. PubSub has been working on the issue, but like traditional spam there is no simple solution. Although parts of this post may seem like an interview, it is/was not. I am simply merging the comments that he and I had over the past few days. Bob's comments are in italics:

Is PubSub aware of the issue, and what's their plan? (To be fair the same issues can happen with Technorati's watchlists)

This is a great question about a subject that we�ve done a great deal of work on but haven�t discussed much publicly. We probably should talk about this more if only to see if we can get some good ideas from the community on how to address it.

A good idea. We've seen how a great solution such as Google's No follow can quickly be adopted by the blogging community.

The problem of �spam� in RSS posts is a real one and one that is getting worse as time passes. To date, the issue has been very major for blog comments; however, the number of spam postings, other than comments, has been growing quite a bit. We are somewhat "protected" from the comment spam problem since most sites don't provide RSS or Atom feeds for their comments. However, we're hoping to be able to capture more comments in the future and we'll have to do more to address the comment spam issue.

Comment spam has been a well documented issue. I like many have battled it on my blog. MT-Blacklist does help control the problem, but it's not the perfect solution.

The primary "protection" we have against spam posts today is our LinkRank feature. (see: http://pubsub.com/linkranks.php). One of the major motivations for implementing LinkRanks was to allow subscribers to use LinkRanks in limiting and filtering the results we return. Thus, we provide the ability to say that you only want posts from the top 1%, 5%, etc. of blogs based on their LinkRank. Since people typically don't link to spam posts, spam blogs typically have *very* low LinkRanks. Thus, even a LinkRank filter of 75% does a pretty good job of filtering out most spam. We use a default LinkRank filter on our "front page" subscription box precisely to filter out spam. On any of the "focus" subscription pages, like the one for Weblogs at: http://www.pubsub.com/weblogs.php , the user is able to define exactly how much of a filter they want.

My first reaction to the LinkRank option was that the real power of blogs is the next big thing or interesting idea can come from a small, relatively unknown blog, so I don't want to limit my results?' So I do not currently use the filter.

Perhaps in the future, the burden for that choice is having to deal with spam. It's a very similar situation to corporations that refuse to install any spam filtering because it just might reject that 'one' e-mail from a potential or existing client. Bob agrees:

I have the same concern whenever I put a LinkRank filter on a subscription. The real beauty of what we do is allow content to surface no matter who the author is or how well known they might be. When PubSub is running without LinkRank filters, you find content based on what was said, not based on who said it or where it was said. Thus, I like to say that the basic PubSub system produces a more "democratic" form of content discovery then traditional systems which rely on things like building personal lists of sites to poll on a regular basis. The introduction of LinkRank tends to chew into that "democratic" nature of the system by limiting what gets seen to only those things that have been recognized by others to be useful and interesting. The effect is unfortunate, but often a necessary cost in order to avoid spam. The key to "effective" use of our system in this area is to be careful in deciding which subscriptions should have filters and which should not.

If you've got a subscription to keep up to date on the lastest buzz in a particular area, then you probably want to use a LinkRank filter to eliminate the noise and get only the "good stuff". On the other hand, if you're looking for new ideas, inspirations, new viewpoints, etc. then you should probably run with raw, unfiltered subscriptions and deal with the noise...

Automated tools may not be the answer either. Similar to e-mail spam filters and blacklists, RSS spam detection tools are not 100%.

There are other automated means that we could use to filter spam. For instance, we could do the same kind of Bayesian spam filtering that has become common in email systems. We've experimented with this in the past and have considered providing a subscription option that would allow one to say that they didn't want anything that had been flagged as "spam." However, it appears that these filters aren't quite as effective in catching blog spam as they are in catching email spam. We are also concerned about the reaction we might get from publishers who might object to our classifying their posts as spam... Nonetheless, we continue to experiment and hope to come up with a solution tailored to blog spam.

One of the ideas I had in my original post was to have some sort of human intervention, but with 8 million plus feeds currently monitored, that becomes quite a task. But segmentation might be the answer.

More organized human-oriented filtering systems would use the "feedgroup" support that we currently use to distinguish between different types of blogs (i.e. the option users have to filter out "personal journals." The idea is that some organization or person would present us with a list of "approved" spam-free blogs. We would assign a "feedgroup" id to the list and then attach the feedgroup id to each post that came from a blog in the list. Then, users would be able to include the feedgroup id in their subscriptions. (i.e. "RSS AND FEEDGROUP:27" would give you posts containing the word "RSS" only if they appeared on a blog in the list of blogs assigned to feedgroup 27.)

Human intervention would also involve implementing some sort of abuse submission process, but then that's not as simple as it sounds. First you would have to generate a set of policy guidelines on what is defined as RSS spam, and what is not. This of course we be a subject of friction with the offending bloggers....."what do you mean my blog is spam? I blog about my eBay auctions, what's wrong with that?' Next, how do you staff it, what would the volume be. Bob points out the obvious issues that will arise:

Often, the big problem with allowing people to submit abuse reports is dealing with the submitters -- not the alleged abusers... For instance, it is quite easy to create quite a bit of heat in the system when you start getting messages like: "I've told PubSub 16 times about those stupid anti-abortion folk's "spam" posts should be blocked... Why don't they cut them off?" or "I consider anything published by Scientologists to be spam but PubSub keeps publishing the stuff! Are they in league with the Scientologists?" You can, I'm sure, imagine other similar complaints...

Working against the system are the spammers and direct marketers trying to exploit the medium to sell their products. Last year I found this blog that discusses many of those issues.

I must admit that I do not consider kindly the many who seem to put the greatest amount of their effort into figuring out how to game the system rather then focusing on building good products and compelling value propositions. Trading ideas on how to effectively present your product or how to serve customers is one thing. Trading ideas on how to artificially increase your rank in one or another system seems to be a bit questionable.

So where to we go from here....

We are, of course, very interested in hearing any ideas that anyone might have about things that we could do to reduce the burden of spam on our users.

Bob and PubSub's open channel of communication on this subject is refreshing. Having clients that 'get' it make things much easier for PR practitioners.

The discussion has begun.

What are your ideas? We know that PubSub is listening.

Visitor Comments

Thank you for publishing the full discussion! Technorati is working on a spam squashing summit to address this issue.

Hi,

I think pubsub, Technorati et al could be helped by spamblog-tagging I suggested at http://pascal.vanhecke.info/2005/06/01/hunting-spamblogs/

They definitely should include an easy way to indicate spamblogs in their own interfaces (like you can report spam in Yahoo of Gmail), but centralised, independent "blogspam reservoirs" such as http://del.icio.us/tag/spamblog or http://www.furl.net/furled.jsp?topic=spamblog could help as well...

Post Your Comment






Linking Blogs

Listed below are links to weblogs that reference Spam in RSS, PubSub's Thoughts:

» Search Spam That's Coming... (Time-based) from Genuine VC
We all know about search engine spam. Wikipedia defines it using the coined term “spamdexing” as, “the practice of deliberately and dishonestly modifying HTML pages to increase the chance of them being placed close to the beginning of search engi... [Read More]

Blog Search
About Josh Hallett
Recent Blog Posts
How Much More Will We Pay for CLEAR?
posted on: Jun 26, 2008 at 04:07 PM

Vacation
posted on: Jun 23, 2008 at 10:45 PM

BlogPotomac Photo Highlights
posted on: Jun 23, 2008 at 10:43 PM

Washington DC Photo Tour
posted on: Jun 23, 2008 at 10:42 PM

Goodbye Tim
posted on: Jun 14, 2008 at 09:39 PM

BlogPotomac Photos
posted on: Jun 13, 2008 at 09:37 AM

@ BlogPotomac
posted on: Jun 12, 2008 at 11:40 AM

Running the Run
posted on: Jun 12, 2008 at 09:34 AM

Syndicate
Subscribe via E-mail
Where I Work

Blogs I Read
Photos
www.flickr.com
Location