pimlog

pimlog - personal information categorization

pimlog is a personal information categorizer. that is, it aims to take personal information written in a "flat file" log and categorize it into the 3 traditional PIM categories, this particular domain of written text does not have much in the way of public corpora (which is understandeable) to support any "traditional" stochiastic categorization. however, we make the following observations: consider some sample log entries (these were taken from my own log),
- need many more checks in xdr for compounds
- last bus outta here: 21:48
- get backups running
- lunch with jose thursday or friday
- get pictures developed
- france receipts tomorrow
- plane to accra departs friday 18:00
though these are by no means representative, they demonstrate the general nature of such log entries, at least in terms of what we have observed personally. notice that most of them are "very" ungrammatical and provide little semantic context. we argue that this is an effect of the fact that these are personal notes, since the only reader is the writer too, it becomes less important to ensure the general understanding of the sentences. another important observation is that these log entries tend to be consistent in style which provides an important anchor when attempting to categorize them. examples of style consistencies are things such as using the same set of words to denote a certain action, and employing similar structures and substructures. for example, consider
"I have a meeting with Dan tomorrow at noon."
and
"Tomorrow, I have a meeting with Dan at noon."
note that, while "tomorrow" changed positions, most words remained in the same order relative to some other set of words. these observations motivate what we call rule-based classification -- the scheme of classification used in pimlog.

rule-based classification

in rule-based classification, rules are inferred from existing content (for example in the same log). the author identifies real the content crucial to classifying content based on that particular rule. for example, in
"Tomorrow, I have a meeting with Dan at noon. We will be discussing new business plans for the new millenium, and then contemplating exactly how it is we're going to take ove the world."
only the boldfaced text is needed in order to determine that this should be a calendar entry for "tomorrow noon". given a set such a rule and an entry to be classified, we measure distance between the rule and the entry. given the observations made in the previous section, such a distance needs to favor structural likeness as well as using "approximately" the same words to do so -- that is, we can infer the likeness of sets of words. we have augmented the levenshtein distance (aka minimum edit distance) algorithm to measure our structural likeness. the reasoning behind this choice is that the minimum distance metrics are quite comparable to ours of structural likeness; for example, to simply "switching" two components does not cost as much as a wholesale substitution of said components with new ones. so, given a set of rules (the more the better), when a new entry is classified, pimlog measures said distance from the rule to the entry and chooses the rule which has the lowest distance (highest "rank"). this rule then indicates what type of entry that is classified by that particular rule. in addition to structural distance, pimlog also normalizes target entries and rules before they are compared. such normalizations substitutes specific names for more general categories, e.g. "Tuesday" -> "", and also normalizes names, so that "Karl Johann Adler Bach Fassbinder Fleischer Weber Koch" compares favorably to "Marius Eriksen." rules look like this:
Meeting with Dave on Tuesday / appt
Appointment with Dan Friday / appt
Appointment at the Marriot on Sunday / appt
Meet with Dr. Karl tomorrow morning / appt
Meeting with Roger tomorrow at noon / appt
I need to finish my NLP homework / todo
Continue project Sekret / todo
Mary's phone number is 666-666-666 / addr
Get backups running / todo
Check out Ion / todo
the ruleset may be augmented at any time. additionally, pimlog features a context free grammar, aimed at identifying and parsing time expressions so that entries that have temporal references can be further resolved. pimlog uses dparser to perform the parsing. like many other parsers, dparser is troubled (computationally) by large all-encompassing wildcard expressions. such expressions could be used to identify time-expressions anywhere within a large sentence, for example. however, due to the large resource consumption of this methology, we instead create a more limited grammar, and apply the grammar many times to each (i, j) window of tokens in the input. as such, we parse O(n2) times, but this turns out to be trivial in comparison with the more liberal grammars. of course, in this case, more than one parse may be successful. as such, we selected the parse with the larger hamming weight - that is, the number of "entries" filled in, for example month, day and time. as such, the hamming weight provides us with a weak measure of information.

documentation

pimlog is quite simple to use. simply create a RULES file in the format demonstrated above. use the script learn.py to create a new rules database from a rules file:
$ python learn.py RULES RULES.db
now you are ready to start applying pimlog. if your entries are in a file called ENTRIES; one entry per line pimlog applies categorization by
$ python match.py RULES.db ENTRIES
I have a meeting with Dan tomorrow at noon.
        Type 1 with Rank 0.833333
Appointment with Mary at 6p Thursday.
        Type 1 with Rank 1.000000
                6:0PM   
I need to figure out how to do my NLP assignment before Friday morning.
        Type 2 with Rank 0.777778
Some rather random text, that I have no idea how got here!
        Type 2 with Rank 0.142857
But, I do have an appointment with Mark tomorrow.  Hopefully at noon.
        Type 1 with Rank 0.833333
I need to get an appointment with Mary thursday jan 3 at 12p.
        Type 2 with Rank 0.571429
Remember my appointment with Angela on friday march 6 at 9am
        Type 1 with Rank 0.600000
                9:0AM Friday March 6
type 1, 2 and 3 refer to appointments, todo list entries and address book entries respectively. if the entry has been classified as an appointment, the time expression grammar is run on the entry, and if successful its output is augmented to that of the matcher.

logistics

pimlog is still highly experimental, but if you are interested, i will be more than happy to give you the source code, which will be released under a BSD license (request via email). pimlog is written in python (which saved me countless hours) and requires the dparser package and eric brill's tagger. otherwise, pimlog is self-contained.