pimlog - personal information categorization
pimlog is a personal information
categorizer. that is,
it aims to take personal information written in a "flat file" log and
categorize it into the 3 traditional PIM categories,
- todo list,
- appointments, and
- address book.
this particular domain of written text does not have much in the way
of public corpora (which is understandeable) to support any
"traditional" stochiastic categorization. however, we make the
following observations:
- the writing style in such logs are highly personal,
- log entries are typically ungrammatical, that is they do
not constitute "proper" english,
- even users individual users do not have a log of
sufficient size to be used relatively successfully using
stochiastic techniques.
consider some sample log entries (these were taken from my own log),
- need many more checks in xdr for compounds
- last bus outta here: 21:48
- get backups running
- lunch with jose thursday or friday
- get pictures developed
- france receipts tomorrow
- plane to accra departs friday 18:00
|
though these are by no means representative, they demonstrate the
general nature of such log entries, at least in terms of what we have
observed personally. notice that most of them are "very"
ungrammatical and provide little semantic context. we argue that this
is an effect of the fact that these are personal notes, since the only
reader is the writer too, it becomes less important to ensure the
general understanding of the sentences. another important observation
is that these log entries tend to be
consistent in style which
provides an important anchor when attempting to categorize them.
examples of style consistencies are things such as using the same set
of words to denote a certain action, and employing similar structures
and substructures. for example, consider
"I have a meeting with Dan tomorrow at noon."
and
"Tomorrow, I have a meeting with Dan at noon."
note that, while "tomorrow" changed positions, most words remained in
the same order relative to
some other set of words. these
observations motivate what we call
rule-based classification --
the scheme of classification used in
pimlog.
rule-based classification
in
rule-based classification,
rules are inferred from
existing content (for example in the same log). the author identifies
real the content crucial to classifying content based on that
particular rule. for example, in
"Tomorrow, I have a meeting with Dan at noon. We
will be discussing new business plans for the new millenium, and then
contemplating exactly how it is we're going to take ove the
world."
only the boldfaced text is needed in order to determine that this
should be a calendar entry for "tomorrow noon".
given a set such a rule and an entry to be classified, we measure
distance between the rule and the entry. given the
observations made in the previous section, such a distance needs to
favor
structural likeness as well as using "approximately" the
same words to do so -- that is, we can infer the likeness of sets of
words.
we have augmented the
levenshtein distance (aka
minimum edit distance) algorithm to measure our structural likeness.
the reasoning behind this choice is that the minimum distance metrics
are quite comparable to ours of structural likeness; for example, to
simply "switching" two components does not cost as much as a wholesale
substitution of said components with new ones.
so, given a set of rules (the more the better), when a new entry is
classified,
pimlog measures said distance from the rule to the
entry and chooses the rule which has the lowest distance (highest
"rank"). this rule then indicates what
type of entry that is
classified by that particular rule.
in addition to structural distance,
pimlog also normalizes
target entries and rules before they are compared. such
normalizations substitutes specific names for more general categories,
e.g. "Tuesday" -> "
", and also normalizes names, so that "Karl
Johann Adler Bach Fassbinder Fleischer Weber Koch" compares favorably
to "Marius Eriksen."
rules look like this:
Meeting with Dave on Tuesday / appt
Appointment with Dan Friday / appt
Appointment at the Marriot on Sunday / appt
Meet with Dr. Karl tomorrow morning / appt
Meeting with Roger tomorrow at noon / appt
I need to finish my NLP homework / todo
Continue project Sekret / todo
Mary's phone number is 666-666-666 / addr
Get backups running / todo
Check out Ion / todo
|
the ruleset may be augmented at any time.
additionally, pimlog features a context free grammar, aimed at
identifying and parsing time expressions so that entries that have
temporal references can be further resolved. pimlog uses dparser to perform the
parsing.
like many other parsers, dparser is troubled (computationally) by
large all-encompassing wildcard expressions. such expressions could
be used to identify time-expressions anywhere within a large sentence,
for example. however, due to the large resource consumption of this
methology, we instead create a more limited grammar, and apply the
grammar many times to each (i, j) window of tokens in the
input. as such, we parse O(n2) times, but this turns out
to be trivial in comparison with the more liberal grammars. of
course, in this case, more than one parse may be successful. as such,
we selected the parse with the larger hamming weight - that is,
the number of "entries" filled in, for example month, day and time.
as such, the hamming weight provides us with a weak measure of
information.
documentation
pimlog is quite simple to use. simply create a RULES
file in the format demonstrated above. use the script
learn.py to create a new rules database from a rules file:
$ python learn.py RULES RULES.db
|
now you are ready to start applying pimlog. if your entries
are in a file called ENTRIES; one entry per line
pimlog applies categorization by
$ python match.py RULES.db ENTRIES
I have a meeting with Dan tomorrow at noon.
Type 1 with Rank 0.833333
Appointment with Mary at 6p Thursday.
Type 1 with Rank 1.000000
6:0PM
I need to figure out how to do my NLP assignment before Friday morning.
Type 2 with Rank 0.777778
Some rather random text, that I have no idea how got here!
Type 2 with Rank 0.142857
But, I do have an appointment with Mark tomorrow. Hopefully at noon.
Type 1 with Rank 0.833333
I need to get an appointment with Mary thursday jan 3 at 12p.
Type 2 with Rank 0.571429
Remember my appointment with Angela on friday march 6 at 9am
Type 1 with Rank 0.600000
9:0AM Friday March 6
|
type 1, 2 and 3 refer to appointments, todo list entries and address
book entries respectively. if the entry has been classified as an
appointment, the time expression grammar is run on the entry, and if
successful its output is augmented to that of the matcher.
logistics
pimlog is still highly experimental, but if you are interested,
i will be more than happy to give you the source code, which will be
released under a BSD license (request via email).
pimlog is written in python (which saved me countless
hours) and requires the dparser package and eric
brill's tagger.
otherwise, pimlog is self-contained.
sponsor my efforts with books, music and more!
copyright (c) 2003 marius aamodt eriksen <marius@monkey.org>