as promised, something tech related.

in a nutshell, the above is part of the visualization of my RSS clustering work. RSS clustering is a technique i've been using for over a year now (i didn't invent it, but i did implement it kind of naively) as a way of doing two main things: to accomplish this, you gather a pile of related feeds and break it into terms, find the most highly linked terms, and group around them. i'm doing this naively, as i said above, but it's been a gateway to more mature CS and methods than i would have expected.

the RSS reader model simply cannot scale to hundreds of feeds (which i find myself using every day). you simply cannot parse that much information, even if the fetching has been automated, without suffering from overload and the eventual numbness. at that point, you're essentially back to square one (minus the effort of finding sites to visit and finding the new items). i have much more to do with my days than stare at my newsreader all day.

what clustering does is reduce the redundancy inherent in the data stream. in the case of world news from dozens of outlets, they'll often be talking about the same topics. conflicts, poltics, science, events, and the like. and they'll often be using the same terms to do this. so, if you can find the overlap and reduce the visibility, you've streamlined the process some.

now take this a step further. you know how many hits for any term or topic you have, so you can rank them by popularity or by how linked any of these topics are within your data set. so, you can order the presentation of the data using that information and make your surfing more efficient. have a look at what the hot topic of the hour is, using the inherent operations of the world news organizations to act as a collaborative filter.

like i said, i've been doing this for over a year now with world news. it's effective, scales well, and provides more information beyond the headlines and blurbs syndicated. and now i'm sharing some of the gory details.

Last modified: Friday, Sep 03, 2004 @ 07:36am
