Ted Leung on the air : RSS Aggregators are the killer app

Ted Leung on the air
Ted Leung on the air: Open Source, Java, Python, and ...

Wed, 23 Feb 2005

Scoble asked me to write this post, so here goes. I don't mean that RSS aggregators are the kind of killer app that sells a billion computers and creates new markets (there is that possibility, though). I mean the app that does so much that it consumes all available CPU, memory, network, and disk. Perhaps I really mean that they're the "killing my computer" app.

If I take out software development activities, the application that is pushing the limits of my hardware is my RSS aggregator. This is not in any way a slam on NetNewsWire, which is very, very, fine application. It's a reflection of the way that my relationship to the web has changed. I hardly use a standalone browser anymore -- mostly for searching or printing. I don't have time to go and visit all the web sites that have information that is useful to me. Fortunately, the aggregator takes care of that. Once the aggregator has the information, I want it to fold, spindle, and mutilate it. I'm at over 1000 feeds, and on an average day, it's not uncommon to have 4000 new items flow through the aggregator. It takes 25 minutes (spread out over two sessions) just to pull the data down and process it -- and I have a very fast connection. NetNewswire uses WebKit to render HTML in line -- a feature that makes it easy to cut through piles of entries, but one which is demanding of CPU and memory.

But that's just the basics. What happens when we start doing Bayesian stuff on 4000 items a day? Latent Semantic Indexing? Clustering? Reinforcement Learning? Oh, and I want to do all of those things on all the stuff that I ever pulled down, not just the new stuff. What happens if I want to build a "real-time" trend analyzer using RSS feed data as the input? The processor vendors should be licking their chops...

[00:04] | [computers/internet/microcontent] | # | TB | F | G | 11 Comments |

Ted,
You need a more scalable aggregator.

I currently subscribe to 5386 feeds and the only way I could get an aggregator that could handle that amount of data was to write my own.

It's called Aggrevator and it's available at: [http://www.oshineye.com/software/aggrevator.html]. Now it's not as pretty or to be frank as user friendly as something like NetNewsWire but it will easily handle the number of feeds you're subscribed to. I currently have a database of 841,905 dating back to early 2004 when it finally became sufficiently usable that I could abandon RSS Bandit.

Scalability must be designed in rather than tacked on afterwards and most people just aren't thinking about it. See this discussion on James Robertson's blog [http://www.cincomsmalltalk.com/blog/blogView?showComments=true&entry=3283281867] for an example.
Posted by ade at Wed Feb 23 04:32:04 2005

Isn't this an argument for web based aggregators like Bloglines? There will generally be more CPU power available on the server than the desktop, and a server based model can essentially aggregate the data once, then parse it out a million times based on user's personal preferences of what they want to see. It also distrubutes bandwidth load as a centralized aggregator can ping Slashdot once an hour, versus a million desktop aggregators all pinging Slashdot once per hour.
Posted by Chris at Wed Feb 23 05:08:10 2005

Perhaps we need a new term: strangler app? I can't comment on the rest, as I just recently topped 300 feeds (how do you guys find time to read that much? My skimming skills have increased, but not enough).
Posted by Lance Lavandowska at Wed Feb 23 09:57:31 2005

It's seems like you all forgotten XMPP and sweets of JEP-0060 - Publish Subscribe. Delivers personalized content right to you (push, not pull) - http://www.pubsub.com/
Posted by Stefan at Wed Feb 23 15:13:05 2005

Do you keep up 5000+ blogs. Are there really that many blogs worth looking at?

Wow. I guess my post doesnt address the issue, what if you want to read that many blogs, I guess over time it can build up. I think a 3 tier solution is pretty good. I do like the bloglines approach.
Posted by Berlin Brown at Wed Feb 23 16:07:42 2005

http://www.mackmo.com/nick/blog/java/2005/2/23/ProcessingRSSFeeds.txt
Posted by Nick Lothian at Wed Feb 23 16:31:08 2005

The point is not to read 5000 or so blogs on a daily basis. The point is that once you have tools that are good at managing large amounts of data (through things like implicit/explicit scoring and dynamic filtering) then the cost of adding yet another feed is insignificant. If the cost of adding new feeds is insignificant then everyone will end up subscribed to large amounts of feeds within a couple of years.

People don't sit in a library and read every new book that is published as it comes in. Instead we trust that the things we want will be there if and when we need them. We can do this because we know that a library has mechanisms that will let us find the relevant content and ignore the irrelevant.

Tools like Aggrevator are meant to stop the, frankly silly, scenario wherein people try to read everything that gets published by their peer group as it comes out. People only feel this manic rush because their aggregators throw away entries after a set period of time.

Tools that retain everything (and somehow solve the painful scalability issues associated with that) can start to offer people useful ways of analysing, manipulating and managing that data.

For instance I can take advantage of the power-law distribution of scores amongst those 5000 feeds and only read the top 20 because are clearly my favourites at the moment. I could imagine ever more sophisticated features like cluster analysis, vector space search or contextual network graphs (but not bayesian classification or latent semantic analysis because they just don't scale well enough for desktop usage) being added over time.
Posted by ade at Wed Feb 23 17:35:01 2005

Ade,

Yes I need a more scalable aggregator, and I'll take a look at Aggrevator. It sounds like the strn or gnus style Usenet newsreaders. But it looks like Aggrevator is another 3 paned aggregator, which doesn't fit my definition of scalable. Part of my workflow involve being able to skim large number of articles without messing with the keyboard/mouse at all. 3 paned aggregators fail at that.

Chris,
I don't use bloglines because it can't work off line. Also, some of the ways that I'd like to slice and dice my feed data involves data that I don't want to give to bloglines or any other service.

Stefan,
Getting the data is only a small part of the story -- it's analyzing it once you've got it that is going to get more and more expensive. What ade is doing with Aggrevator is just the beginning.

Nick,
Yes, that's the right direction, and yes, it may end up I/O bound.
Posted by Ted Leung at Wed Feb 23 18:55:28 2005

WRT scalability:

I think there are two kinds of scalability we are talking about.

1) Technical scalability. This is what Ade is dealing with. It is an issue (I destroyed a years worth of blog reading history when the stack overflowed on a HSQLDB based blog reader), but it isn't a seriously hard issue. Stick it in a SQL database, keep records on the file system and index them with Lucene - whatever. Just don't try keeping everything in one big XML file or something!

2) The Information Architecture. This is closer to what Ted is talking about: How should an aggregator work to present the best information possible and keep all the junk out the way? It seems stupid that 40-50 feeds seems "a lot" of feeds, and yet that is typically where most people max out. I don't think anyone has come close to solving that problem yet.
Posted by Nick Lothian at Wed Feb 23 20:26:37 2005

Web-based RSS Aggregator may be the right answer. The servers have that capability to handle huge data in databases, index them and render as web pages.

Desktop RSS Aggregators apart from the huge load problem you have, Desktop RSS Aggregators are a place for spyware, adware.

Check-out JAS RSS Portal.r
Posted by Jay at Thu Feb 24 00:10:47 2005

I imagine you're checked out Findory's blog aggregator which tries a collaborative filtering approach to finding interesting articles in the sea of blogs? It sounds like you're planning something along those lines for yourself?

(Personally, I've tried Findory a few times and haven't found that it the articles it surfaces in response to my reading pattern are all that interesting to me. When I use BlogLines I find myself oversubscribing and watching the backlog pile up, which I'm not crazy about.)
Posted by Maarten at Fri Feb 25 12:26:05 2005

You can subscribe to an RSS feed of the comments for this blog:

Add a comment here:

You can use some HTML tags in the comment text:
To insert a URI, just type it -- no need to write an anchor tag.
Allowable html tags are: <a href>, , , , <blockquote>,  , , <code>, <pre>, <cite>,  and .

You can also use some Wiki style:
URI => [uri title]
 => _emphasized text_
 => *bold text*
Ordered list => consecutive lines starting spaces and an asterisk

<	February 2005					>
Su	Mo	Tu	We	Th	Fr	Sa
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28