Ted Leung on the air: Open Source, Java, Python, and ...
Scoble asked me to write this post, so here goes. I don't mean that RSS aggregators are the kind of killer app that sells a billion computers and creates new markets (there is that possibility, though). I mean the app that does so much that it consumes all available CPU, memory, network, and disk. Perhaps I really mean that they're the "killing my computer" app.
If I take out software development activities, the application that is pushing the limits of my hardware is my RSS aggregator. This is not in any way a slam on NetNewsWire, which is very, very, fine application. It's a reflection of the way that my relationship to the web has changed. I hardly use a standalone browser anymore -- mostly for searching or printing. I don't have time to go and visit all the web sites that have information that is useful to me. Fortunately, the aggregator takes care of that. Once the aggregator has the information, I want it to fold, spindle, and mutilate it. I'm at over 1000 feeds, and on an average day, it's not uncommon to have 4000 new items flow through the aggregator. It takes 25 minutes (spread out over two sessions) just to pull the data down and process it -- and I have a very fast connection. NetNewswire uses WebKit to render HTML in line -- a feature that makes it easy to cut through piles of entries, but one which is demanding of CPU and memory.
But that's just the basics. What happens when we start doing Bayesian stuff on 4000 items a day? Latent Semantic Indexing? Clustering? Reinforcement Learning? Oh, and I want to do all of those things on all the stuff that I ever pulled down, not just the new stuff. What happens if I want to build a "real-time" trend analyzer using RSS feed data as the input? The processor vendors should be licking their chops...
You need a more scalable aggregator.
I currently subscribe to 5386 feeds and the only way I could get an aggregator that could handle that amount of data was to write my own.
It's called Aggrevator and it's available at: [http://www.oshineye.com/software/aggrevator.html]. Now it's not as pretty or to be frank as user friendly as something like NetNewsWire but it will easily handle the number of feeds you're subscribed to. I currently have a database of 841,905 dating back to early 2004 when it finally became sufficiently usable that I could abandon RSS Bandit.
Scalability must be designed in rather than tacked on afterwards and most people just aren't thinking about it. See this discussion on James Robertson's blog [http://www.cincomsmalltalk.com/blog/blogView?showComments=true&entry=3283281867] for an example.
Posted by ade at Wed Feb 23 04:32:04 2005
Posted by Chris at Wed Feb 23 05:08:10 2005
Posted by Lance Lavandowska at Wed Feb 23 09:57:31 2005
Posted by Stefan at Wed Feb 23 15:13:05 2005
Wow. I guess my post doesnt address the issue, what if you want to read that many blogs, I guess over time it can build up. I think a 3 tier solution is pretty good. I do like the bloglines approach.
Posted by Berlin Brown at Wed Feb 23 16:07:42 2005
Posted by Nick Lothian at Wed Feb 23 16:31:08 2005
People don't sit in a library and read every new book that is published as it comes in. Instead we trust that the things we want will be there if and when we need them. We can do this because we know that a library has mechanisms that will let us find the relevant content and ignore the irrelevant.
Tools like Aggrevator are meant to stop the, frankly silly, scenario wherein people try to read everything that gets published by their peer group as it comes out. People only feel this manic rush because their aggregators throw away entries after a set period of time.
Tools that retain everything (and somehow solve the painful scalability issues associated with that) can start to offer people useful ways of analysing, manipulating and managing that data.
For instance I can take advantage of the power-law distribution of scores amongst those 5000 feeds and only read the top 20 because are clearly my favourites at the moment. I could imagine ever more sophisticated features like cluster analysis, vector space search or contextual network graphs (but not bayesian classification or latent semantic analysis because they just don't scale well enough for desktop usage) being added over time.
Posted by ade at Wed Feb 23 17:35:01 2005
Yes I need a more scalable aggregator, and I'll take a look at Aggrevator. It sounds like the strn or gnus style Usenet newsreaders. But it looks like Aggrevator is another 3 paned aggregator, which doesn't fit my definition of scalable. Part of my workflow involve being able to skim large number of articles without messing with the keyboard/mouse at all. 3 paned aggregators fail at that.
Chris,
I don't use bloglines because it can't work off line. Also, some of the ways that I'd like to slice and dice my feed data involves data that I don't want to give to bloglines or any other service.
Stefan,
Getting the data is only a small part of the story -- it's analyzing it once you've got it that is going to get more and more expensive. What ade is doing with Aggrevator is just the beginning.
Nick,
Yes, that's the right direction, and yes, it may end up I/O bound.
Posted by Ted Leung at Wed Feb 23 18:55:28 2005
I think there are two kinds of scalability we are talking about.
1) Technical scalability. This is what Ade is dealing with. It is an issue (I destroyed a years worth of blog reading history when the stack overflowed on a HSQLDB based blog reader), but it isn't a seriously hard issue. Stick it in a SQL database, keep records on the file system and index them with Lucene - whatever. Just don't try keeping everything in one big XML file or something!
2) The Information Architecture. This is closer to what Ted is talking about: How should an aggregator work to present the best information possible and keep all the junk out the way? It seems stupid that 40-50 feeds seems "a lot" of feeds, and yet that is typically where most people max out. I don't think anyone has come close to solving that problem yet.
Posted by Nick Lothian at Wed Feb 23 20:26:37 2005
Desktop RSS Aggregators apart from the huge load problem you have, Desktop RSS Aggregators are a place for spyware, adware.
Check-out JAS RSS Portal.r
Posted by Jay at Thu Feb 24 00:10:47 2005
(Personally, I've tried Findory a few times and haven't found that it the articles it surfaces in response to my reading pattern are all that interesting to me. When I use BlogLines I find myself oversubscribing and watching the backlog pile up, which I'm not crazy about.)
Posted by Maarten at Fri Feb 25 12:26:05 2005
To insert a URI, just type it -- no need to write an anchor tag.
Allowable html tags are:
<a href>
, <em>
, <i>
, <b>
, <blockquote>
, <br/>
, <p>
, <code>
, <pre>
, <cite>
, <sub>
and <sup>
.You can also use some Wiki style:
URI => [uri title]
<em> => _emphasized text_
<b> => *bold text*
Ordered list => consecutive lines starting spaces and an asterisk