Proposal for Senseval Scoring Scheme
I. Dan Melamed & Philip Resnik
----------------------------------------------
1. A principled evaluation metric can be derived by assigning
probabilities over sense tags output by WSD algorithms.
Algorithms that output multiple tags but do not assign
probabilities can be treated as assigning uniform probabilities
over the tags that they output. E.g. an algorithm that considers
senses A and B as possible, but eliminates senses C, D and E for a
word with 5 senses in the reference inventory is really saying:
sense prob.
----- -----
A .5
B .5
C 0
E 0
D 0
2. Given a probability distribution over sense tags and a single
known correct tag, the algorithm's score should be the probability
that the algorithm assigns to the correct tag. Note that the
exact-match criterion falls out as a special case of this metric,
where the algorithm selects exactly one sense, which is equivalent
to assigning 100% of the probability mass to it.
3. Given multiple possible correct tags for a given word token, the
algorithm's score should be the sum of ALL probabilities that it
assigns to ANY of the correct tags. The premise here is that it
is impossible to tell whether a multi-sense annotation was
intended as disjunctive or conjunctive, so algorithms should be
given the benefit of the doubt. E.g. annotator tags a word with
senses A and B:
algorithm's output score
------------------ -----
A 1
.5 A; .5 B 1
.3 A; .7 C .3
.3 A; .4 B; .3 C .7
A, B, C 2/3
4. The probabilistic treatment of sense tags can be extended to
handle tree-structured tagsets, such as HECTOR, if the structure
is interpreted as an IS-A hierarchy. E.g., if sense 3.2 is a
sub-sense of sense 3, then any word token of sense 3.2 *also* IS-A
token of sense 3. (Further extensions exist for more general
DAGs, such as WordNet, but they don't concern us here.)
The same scoring criterion can be used for structured tagsets as
for unstructured ones: What's the probability that the algorithm
assigns to any of the correct tags? The complication for
structured tagsets is that it is not obvious how to compare tags
that are in a parent-child relationship. This problem can be
solved by defining two kinds of probability distributions:
Pr(occurrence of parent sense | occurrence of child sense) and
Pr(occurrence of child sense | occurrence of parent sense). The
first one is easy: In a tree-structured IS-A hierarchy,
Pr(occurrence of parent node | occurrence of child node) = 1. The
second one is harder, unfortunately; in general, these
("downward") probabilities are unknown. In the absence of prior
knowledge about sense distributions over particular sense-tree
branches, the maximum entropy principle dictates that we assign a
uniform distribution over Pr(occurrence of child sense |
occurrence of parent sense) for each sense. Fortunately, this is
not such a bad assumption. It will be false in most individual
cases, but if we evaluate WSD algorithms by averaging performance
over many different word types, most of the biases should come out
in the wash.
Now, how do we use these conditional probabilities for scoring?
Treat each non-leaf sense-tag as underspecified. E.g. if sense 3
has just the two subsenses 3.1 and 3.2, then tagging a word with
sense 3 is equivalent to giving it a probability of one half of
being sense 3.1 and one half of being sense 3.2, given our
assumption of uniform downward probabilities. This applies both
to the tags in the output of WSD algorithms and to the manual
(correct, reference) annotations.
Example:
Suppose our sense-tree for a given word has senses 1 and 2, which
are not subdivided in any way. It has sense 3 divided into 3.1
and 3.2, as above. In addition 3.1 is subdivided into 3.1a and
3.1b. There is also a sense 4, which is split into 4.1, 4.2 and
4.3.
Under our assumption of uniform downward probabilities, we start
by deducing:
Pr(3.1 | 3) = .5
Pr(3.1a | 3.1) = .5
(and so) Pr(3.1a | 3) = .25
Pr(4.2 | 4) = 1/3, etc.
If any of the conditionals above are reversed, then the
probability is always 1. E.g. Pr(3 | 3.1a) = 1.
Next, we apply these probabilities to find Pr(any correct sense |
algorithm's assigned senses) as in the following example
cases:
manual annotation algorithm`s output score
----------------- ------------------ -----
2 3 0
3 3 1
3 3.1 1
3 3.1b 1
3.1 3 .5
3.1 & 3.2 3 .5 + .5* = 1
3.1a 3 .25
3.1a & 4.2 4 1/3 (= Pr(4.2 | 4))
3.1a & 4.2 3.1 .5
3.1a & 4.2 3.1 & 4.2 .5*.5 + .5*1 = .75
3.1a & 4.2 3.1 & 4 .5*.5 + .5*.333 = .41666
5. This scoring scheme depends only on the tree structure of the
hierarchy, and not on the types of nodes in it. In particular,
questions of whether part-of-speech and other syntactic
distinctions should be part of the sense inventory are orthogonal
to the issue addressed here.
==========================================================
References:
This proposal incorporates some ideas from section 3.1 of
Philip Resnik and David Yarowsky, "A perspective on word sense
disambiguation methods and their evaluation", position paper
presented at the ACL SIGLEX Workshop on Tagging Text with Lexical
Semantics: Why, What, and How?, held April 4-5, 1997 in
Washington, D.C., USA in conjunction with ANLP-97.
available from
http://www.umiacs.umd.edu/~resnik/papers/siglex97_perspective.ps