visualizing RSS clustering

abstract

the rapid growth of RSS feeds, both in production and in reading, has facilitated the consumption of information that has been part of the promise of the Internet for several years. one consequence of this is that many RSS users are unable to manage to use their feeds easily due to an overwhelming number of new stories that never ceases. for subjects like world news, many of these stories are redundant, adding a burden to the reader to sort out what stories they have already read. to deal with these twin problems of flooding and redundancy, a mechanism which uses both of these attributes has been developed. this system reduces the number of items to read and uses the overlapping information to divine interesting topics. a prototype system has been developed to gather, distill, and present world news and has been used by the author for over one and a half years with success.

introduction

the growth of the quantity of content on the internet is due in no small part to the ability of any user of the medium to publish to a wide audience with little effort and a nominal fee in most cases. this is a revolution in the ability to communicate to a broad audience is as sweeping as the development of the gutenburg press, although it is significantly more accessible to the average world citizen. because of this, the amount of content a saavy internet user will want to view as a steady stream can grow substantially. automated tools have appeared to facilitate this process, helping to ease the burden of keeping current.

RSS aggregators are typically used as a means of efficiently gathering and browsing large amounts of information. this works well for sites that regularily update or change their content. however, the scalability of this tool is rapidly challenged when a person attempts to monitor more than a few dozen sites using the approach offered by first generation RSS aggregators.

a typical RSS aggregator has a layout similar to a usenet reader or a mail user agent. in this model, a 3 paned layout is used, with navigation typically going from the list of feeds (panel 1) to the list of headlines for the current feed (panel 2) to the extracted content for a particular title (panel 3). new articles are typically indicated by highlighting the feed and headline. this is illustrated in the figure below.


RSS reader layout

layout of the UI of a typical RSS aggregator application. the feeds list is often on the lefthand side, but sometimes on the top. the stories within any single feed are usually listed in a window above the individual entry being viewed.


the main goal of the use of an RSS reader is to automatically gather content from dynamic sites and highlight new material. while this works well for a small number of RSS feeds, this usage model quickly breaks down when the number of feeds grows to several dozen or even hundreds. under these circumstances, the stream of new content is transformed into a flood of material, and the aggregator tool has only automated the gathering of more material to read. because an RSS reader gathers material that the user requests, it is reasonable to assume that they will possibly want to examine all of it. however, a typical human reaction under the circumstances of information overload is to begain triaging material by discarding it or by skimming it more rapidly. in both cases information is lost. furthermore, the information is usually presented without any semantic tags to indicate what material is of higher value than the other information. the user is left to make that determination on their own. finally, for feeds of a similar nature (ie news feeds of global events), a significant amount of redundancy will be observed, aggrevating the problem.


highlighted stories in an RSS reader

how the new stories discovered by an RSS aggregator are indicated. stories that have not yet been read are indicated in red, stories that have been read are indicated in black. individual feeds are indicated as blue boxes.


as a means of improving the scalability of the RSS aggregation approach, i have begun using an approach of doing second order analysis on the aggregated materials to make use of the redundancy in the information. i dub this technique "RSS clustering" because i group stories by topic. the redundancy observed in any collection of RSS feeds can be used for two main purposes. the first is to highlight the interesting bits of news within a pool of feeds, basing this on the assumption that the apperance of the topic in multiple entries is proportional to the importance of that topic. the second is that entries can be clustered around these topics, reducing the volume of information presented to the user at any one time.


clustering stories together

stories that are related by a common topic are grouped together to indicate their relation and streamline the RSS reading process. these stories are pulled from individual feeds.


this technique is not new or novel, and has been demonstrated by sites such as google news, topix, daypop, and to a larger extent popdex and blogdex. all of these sites aggregate dynamic content together and use a set of popularity hueristics to determine topics and content found interesting by the community at large. in the case of news sites like topix, daypop, and google news, the community is the news publication community. in the case of blogdex and popdex it is the blog publication community that collectively indicates what is interesting. this acts as a collaborative filter mechanism which can be leveraged by aggregating enough information.

additionally, the approach of topic maps has a similar approach and is much more mature. when visualized, they show the occurances on the basis of topics. this approach goes one step beyond this by actually clustering the similar occurances.

methods

  1. collect the data by downloading and parsing the RSS feeds in question. this can be a collection of feeds from sources on any topic that you want, and they work best if you either have a focused topic and feeds surrounding that or, if you want a broader set of topics, you'll need to scale up the quantity of feeds appropriately.
  2. normalize the data into a format that you can easily work with. this may mean inserting the entries into a database as rows or storing them in another format. storing them in XML can be done, but you'll have to use additional XML processing tools to gather the data. this will incur a lot of overhead for processing and complexity. i find it most useful to rip the data from the feeds into a small database and then re-extract it for analysis. this allows me to select a fixed number of stories for analysis and also to store the data over a long period of time for more long-term analysis.
  3. normalize the case, remove punctuation and strip stop words from the data to get at the bare and interesting terms. stop words are terms that appear in almost any entry and don't give any meaning. this can be words like "i", "you", or "and", or even terms like "world" or "national". punctuation removal is pretty obvious, basically you want to get at the bare words. in this step a paragraph like the following:
    	A man named Fred was stopped outside of your home today. He was 
    	walking a dog named Chief past a fire hydrant on Maple street.
    
    would end up looking like this:
    	man named fred stopped outside home today walking dog named chief 
    	past fire hydrant maple street
    
    additional criteria are always possible to weed out the uninteresting bits.
  4. stem the words to find their root words. this accomplishes a simple task: it reduces false differences and increases legitimate overlap. words like "fires", "fired" and "fire" then all look the same once their endings have been remove. the sentence From the previous step would now look like this:
            man name fred stop outside home today walk dog name chief
            past fire hydrant maple street
    
  5. discover the interesting terms by correlating the data together by any number of means. a static, simple approach is to count up terms that are left from the previous steps and order on those. more advanced approaches include NLP approaches to discover the subjects of the sentences. time-based approaches can look at a window of time and count appearances or, if you like the daypop approach, look at the relative trajectories of terms over time. then the interesting topics aren't the ones that are the most popular but instead the ones that are the fastest rising. another approach may be to look at topics that appear together, such as a subject and a location. a simple frequency count of the above sentences, which have been pruned and stemmed, would look like this:
    2 name
    1 walk
    1 today
    1 street
    1 stop
    1 past
    1 outside
    1 maple
    1 man
    1 hydrant
    1 home
    1 fred
    1 fire
    1 dog
    1 chief
    
  6. cluster news stories by finding which stories are related to the terms of interest. some people will allow a story to be reused multiple times for any terms that it contains, but i prefer to use a story only once. it reduces the size of the output with the risk being that the term that was used to pull it in the first place was not the best subject of the story.
  7. display the results in any number of formats. the graphic below shows the results of clustering in a directed graph approach, or you can show it in a text format to read it directly. the display should give some indication of what stories are grouped together and facilitate navigation of the output.

results

these steps have been implemented in a small toolkit written in the Python language. RSS feeds are grabbed directly from within the tool using the python URL library (urllib). RSS parsing is done using the python XML toolkit. stemming is done using the Porter stemming algorithm. the output was fed to the "neato" program from Graphviz. a list of over 60 individual RSS feeds from several dozen individual world news sites were processed using this technique. 1154 unique stories were gathered, of which 458 were correlated to other stories within this data set. the clustering that resulted from this process is shown in the figure below. this process, described in steps 1 through 6 above, took approximately 2 minutes on a pentium 4 1.4 GHz processor running Python 2.3, which included fetching all URLs listed in the list of feeds. processing using neato took approximately 3 minutes.


example output of RSS clustering done using the steps described above with input data from 66 sources (NYT, BBC, AP, UPI, etc). this data was gathered on 31 august 2004. the clustering output was processed using the "neato" tool from the Graphviz toolkit. headlines at the center of the cluster are the story with the most detail in the story description, not necessarily the one closest to all of the stories. use your right mouse button to get the SVG contextual menu to zoom or move the image. NOTE: if you don't see anything above, you need an SVG viewer for your platform. google for your platform, Linux/Windows/OS X are all supported by Adobe's SVG viewer.

at this point there are no metrics to evaluate the quality of the groupings. this makes it difficult to understand the impact of any intended improvements on the approach.

a more practical use of the mechanism is shown in the next two figures. these methods have been used to develop a personal news site using several dozen news feeds from around the world. because of the structure of a typical news feed, subjects are clearly indicated. a typical news headline, for example, usually includes a subject and an action, and often a geographic location. most daily newspapers follow this format for regular stories as they must clearly indicate to the reader what events are news and what they may want to read. magazines and other, less frequent sources of news often do not clearly identify the subject so clearly. these sources are difficult to use in the aggregation system without deeper analysis.

the author of this paper has been using such a system for over one and a half years to gather, distill and present world news from dozens of sources. this system, called "monkey news", is shown below. the top six topics are indicated in the header. stories are arranged in descending order from most popular to least and are gathered by topic. the system presents this information as a static HTML page updated every 2 hours.


monkey news front page

image capture of the the front page of the "monkey news" site showing the top six subjects and the first two clusters of stories grouped together by topic. stories appear in descending order of popularity as determined by topic mention from the various sites.


an individual group is shown next. the top of the grouping includes the headline, source identification, and a small paragraph describing the topic. additional sources are grouped underneath this initial story, allowing for the browsing of these stories.


monkey news story grouping

screen capture of an individual story block from monkey news. the grouping is clearly identifiable based onthe topic, and the usability of a single group is evident. the top story in the group has the description of the story available, with supplemental links and sources clearly identified.


using this method, the number of news sources can be easily expanded without increasing the burden on the reader. instead, the accuracy of the groups is improved, and additional views are available. this helps prevent the explosion of stories to read before a complete picture of world events is gained when new sources are added. this data is inferred from the quantity of sources talking about a subject, which provides a "vote" to the popularity of the topic. this data is inherent in a collection of stories and feeds and does not come from external sources.

monkey news has proven to be a useful site for the author. during the 2004 presidential election, for example, the viewpoints of various news sites was constantly available. during the invasion of iraq by US-led forces, a constant stream of updates was made available also from different sources, both foreign and domestic. while this would have been available using a traditional RSS aggregator as described above, the number of stories would have been overwhelming. over 1000 stories a day are gathered, grouped, and presented using this system. while an RSS aggregator facilitates the gathering of these stories, it does not facilitate their consumption. this approach clearly reduces clutter and improves readability over a basic, flat approach. monkey news results are periodically evaluated by comparing the stories that have been identified as popular to sites such as google news, topix, and daypop. while these sites often offer more stories, the core events at the time are correctly identified by the monkey news system.

evaluation of the method

while the method described above works well for handling large quantities of world news feeds, it fails to work in a number of other situations. these are inherent limitations in the approach taken and require significant changes to the methods.

one of the primary limitations of this approach is that it requires that the sources all discuss the same topics using similar terms. for example, it would be difficult to group an article about the US presidential race if it described the race using terms from shakespeare. because no other stories would be using the same poetic approach, the true nature of the topic would be obscured in the analysis.

this leads directly into the second limitation, namely a requirement that the analysis all (or nearly all) be done in the same language. in the absence of robust machine translators, the topics sources not use the same terms to describe a topic.

a third limitation is that some threshold must be crossed for a grouping to form, meaning the number of sources required to get reasonable results scales with the diversity of topics. this is evidenced by the success of the clustering for world news based on approximately 65 feeds, while similar groupings for "memes" requires thousands of blogs and other sites to be aggregated and analyzed.

finally, this approach does not work well to detect breaking news events, since it is a second order analysis. new, breaking topics (such as fast moving weather systems, terrorist attacks, or the like) are first reported in the press and must gather enough momentum or magnitude to compete with other stories in order to become visible. in this case, the topic would become most visible when ranked by momentum (ie rising in visibility most quickly) rather than static rankings of visibility.

future work

there are a number of improvements that can be made to the general strategy listed above. the original implementation of this approach (which was used to generate the cluster graph above) used a naive approach to discover news topics. it would be relatively easy to incorporate the Python natural language toolkit to do actual term analysis. additionally, digram and trigram analysis could be performed for special terms, such as "white house" and "sierra leon", rather than individual terms.

time resolved clustering can also be performed using techniques such as bayesian network analysis. this can be used to discover that terms are related and highlight relationships which were previously unknown.

additional time-resolved analysis that includes term analysis based on ranking over time. this can be used to evaluate the effectiveness of public relations efforts and major events such as political campaigns.

finally, clustering can be further improved by adding additional data to the system. currently the implementation only clusters based on terms that appear in the headlines. however, using the full news story body, additional contextual information can be gathered and used to produce more robust groupings.

it would be interesting to build a desktop application which acted in this fashion. the storage requirements for any single user would be minimal

related