RSS Abuse


As RSS and Atom feeds become more pervasive – and news readers and aggregators become more popular with the masses – there is a great risk of abuse. The blame for this abuse falls everywhere: webmasters who create the feeds, vendors who create software to read them and the end-users. The issue centers on sampling or polling times from these news readers – they are just too fast.

Without counting I would estimate that I read about 25 different feeds on a daily basis. That means that for each of those 25 feeds I have to request a document on their server every time that I want an update. This isn’t a whole lot different than if I had all of the respective sites bookmarked in my web browser and visited them every morning – while I’m eating my muffin. Say that conservatively, I only spend one minute at each site that I like to read. After 25 minutes I am finished. At this point however, it has been a while since I was at the first site that I visited – so I visit them again. And again. After a while I will notice two things: I haven’t done anything useful all day, and the last couple times I was at each website there was nothing new to read.

Obviously this isn’t behavior that one is likely to mimic – so most people tend to visit sites once or twice a day and all is well. However the news readers change this paradigm because they essentially automate the download of all of those sites that you would otherwise be reading. In fact, due to the simplicity and convenience, most people read many more feeds than they would otherwise. I will tell you that I certainly did not read 25 sites regularly before I started using an aggregator. Factor in millions of people each requesting all of their feeds every 30 minutes and you can start to see a problem.

Even more that a year ago, Wired wrote Will RSS Readers Clog The Net? which discusses this shift in content delivery paradigms from more of a global perspective. The internet wasn’t designed for everyone to be using automated readers and as such there is speculation that this might cause some trouble for us in the long run if we aren’t careful.

Another perspective is that of a site webmaster – RSS feeds are quickly becoming the most requested document on a lot of sites. Bandwidth and server load are serious considerations when feeds start to become even the least bit popular. A recent hint from Mac OSX Hints provides means to alter the frequency at which Safari – the recently RSS-aware browser for Macs – will update the feeds that it is tracking. This article is accompanied by a plea from Rob Griffith who maintains the site saying that the default rate of every 30 minutes is the fastest that he would want people to use.

Slashdot.org is the same, specifying that their system is set to automatically ban IPs that are polling their feeds more than once every 30 minutes. Ian mentions that he was banned recently for doing just that. I don’t follow Slashdot so I’m not familiar with the feed contents or the site activity, but I would imagine that 30 minute is more than enough. I personally have my aggregator, NetNewsWire Lite set to refresh every four hours. Even set at that rate it is still a distraction and I might upgrade to the full version which allows you to specify a refresh rate for each individual feed.

Obviously a lower refresh rate is start to help the sites whose feeds you read, but there are responsibilities for others as well. One of the simple things that both content creators and software vendors have to be aware of are 304s. Most of us are familiar with 404 and 403 error codes returned from some web sites which indicate “File Not Found” and “Forbidden” respectively - however there are a lot more of those codes which we don’t see. In fact, when everything is OK, the code is 200. Any page that might redirect you to another page likely responded with a 302 or 301 code. However, the 304 code which is used already for web browsers indicates: “Not Modified”. Used with news readers, it is the servers way of saying that the feed has not been modified since the last time the reader checked and so it isn’t necessary to download the content. In order for this to work, the reader needs to send the date and time of the last time it checked the feed. However there are other issues present as well.

If a feed is being generated automatically from the contents of a database every time that it is accessed than the server is always going to think that it is newer and deliver the whole document. From a webmaster’s point of view it is optimal to only have the feed updated when the content changes or possibly even less often depending on the needs of the site. If you do have a dynamic RSS feed, than you might want to consider using a function like my setModifiedDate which you may take and use if you want. The other side of the coin is that some news readers which are quick scripts that someone has thrown together don’t actually send the If-Modified-Since header with their request. Without that header sent by the reader, the server is going to return all of the content every time.

In closing – something to remember when you are setting up a news reader. If a site is only updated once or twice a day, do you really need to check it every 30 minutes? And if you only read your feeds when you have a chance, maybe you should consider using your news reader’s manual refresh functionality – if it has it.

Written by Colin Bate