OkapiIndex.query_weight(): return an upper bound on possible doc scores.

CosineIndex.query_weight(): rewrote to squash code duplication. No change in what it returns (it's always returned an upper bound on possible doc scores, although people probably haven't thought of it that way before). Elsewhere: consequent changes. Problems: + mhindex.py needs repair, but I can't run it. Note that its current use of query_weight isn't legitimate (the usage doesn't conform to the IIndex interface -- passing a string is passing "a sequence", but not the intended sequence <wink>). + ZCTextIndex doesn't pass query_weight() on. + We've defined no methods to help clients compute what needs to be passed to query_weight (a sequence of only the positive terms). I changed mailtest.py to cheat, but it's doing a wrong thing for negative terms. + I expect it will be impossible to shake people from the belief that 100.0 * score / query_weight is some kind of "relevance score". It isn't. So perhaps better not to expose this in ZCTextIndex.

OkapiIndex.query_weight(): return an upper bound on possible doc scores.
CosineIndex.query_weight(): rewrote to squash code duplication. No change in what it returns (it's always returned an upper bound on possible doc scores, although people probably haven't thought of it that way before). Elsewhere: consequent changes. Problems: + mhindex.py needs repair, but I can't run it. Note that its current use of query_weight isn't legitimate (the usage doesn't conform to the IIndex interface -- passing a string is passing "a sequence", but not the intended sequence <wink>). + ZCTextIndex doesn't pass query_weight() on. + We've defined no methods to help clients compute what needs to be passed to query_weight (a sequence of only the positive terms). I changed mailtest.py to cheat, but it's doing a wrong thing for negative terms. + I expect it will be impossible to shake people from the belief that 100.0 * score / query_weight is some kind of "relevance score". It isn't. So perhaps better not to expose this in ZCTextIndex.
23578a23 · Tim Peters · db0965b2 · 23578a23 · 23578a23 · 23578a23
Commit 23578a23 authored May 28, 2002 by Tim Peters
5 changed files
--- a/lib/python/Products/ZCTextIndex/BaseIndex.py
+++ b/lib/python/Products/ZCTextIndex/BaseIndex.py
@@ -199,8 +199,13 @@ class BaseIndex(Persistent):
        raise NotImplementedError

    # Subclass must override.
-    # It's not clear what it should do; so far, it only makes real sense
-    # for the cosine indexer.
+    # It's not clear what it should do.  It must return an upper bound on
+    # document scores for the query.  It would be nice if a document score
+    # divided by the query's query_weight gave the proabability that a
+    # document was relevant, but nobody knows how to do that.  For
+    # CosineIndex, the ratio is the cosine of the angle between the document
+    # and query vectors.  For OkapiIndex, the ratio is a (probably
+    # unachievable) upper bound with no "intuitive meaning" beyond that.
    def query_weight(self, terms):
        raise NotImplementedError


--- a/lib/python/Products/ZCTextIndex/CosineIndex.py
+++ b/lib/python/Products/ZCTextIndex/CosineIndex.py
@@ -88,13 +88,8 @@ class CosineIndex(BaseIndex):
            wids += self._lexicon.termToWordIds(term)
        N = float(len(self._docweight))
        sum = 0.0
-        for wid in wids:
-            if wid == 0:
-                continue
-            map = self._wordinfo.get(wid)
-            if map is None:
-                continue
-            wt = math.log(1.0 + N / len(map))
+        for wid in self._remove_oov_wids(wids):
+            wt = inverse_doc_frequency(len(self._wordinfo[wid]), N)
            sum += wt ** 2.0
        return scaled_int(math.sqrt(sum))


--- a/lib/python/Products/ZCTextIndex/IIndex.py
+++ b/lib/python/Products/ZCTextIndex/IIndex.py
@@ -55,6 +55,10 @@ class IIndex(Interface.Base):
        'terms' is a sequence of all terms included in the query,
        although not terms with a not.  If a term appears more than
        once in a query, it should appear more than once in terms.
+
+        Nothing is defined about what "weight" means, beyond that the
+        result is an upper bound on document scores returned for the
+        query.
        """

    def index_doc(docid, text):

--- a/lib/python/Products/ZCTextIndex/OkapiIndex.py
+++ b/lib/python/Products/ZCTextIndex/OkapiIndex.py
@@ -142,11 +142,22 @@ class OkapiIndex(BaseIndex):
        return L

    def query_weight(self, terms):
-        # This method was inherited from the cosine measure, and doesn't
-        # make sense for Okapi measures in the way the cosine measure uses
-        # it.  See the long comment at the end of the file for how full
-        # Okapi BM25 deals with weighting query terms.
-        return 10   # arbitrary
+        # Get the wids.
+        wids = []
+        for term in terms:
+            termwids = self._lexicon.termToWordIds(term)
+            wids.extend(termwids)
+        # The max score for term t is the maximum value of
+        #     TF(D, t) * IDF(Q, t)
+        # We can compute IDF directly, and as noted in the comments below
+        # TF(D, t) is bounded above by 1+K1.
+        N = float(len(self._docweight))
+        tfmax = 1.0 + self.K1
+        sum = 0
+        for t in self._remove_oov_wids(wids):
+            idf = inverse_doc_frequency(len(self._wordinfo[t]), N)
+            sum += scaled_int(idf * tfmax)
+        return sum

    def _get_frequencies(self, wids):
        d = {}

--- a/lib/python/Products/ZCTextIndex/tests/mailtest.py
+++ b/lib/python/Products/ZCTextIndex/tests/mailtest.py
@@ -161,8 +161,10 @@ def query(rt, query_str):
    print "query:", query_str
    print "# results:", len(results), "of", num_results, \
          "in %.2f ms" % (elapsed * 1000)
+    qw = idx.index.query_weight([query_str])
    for docid, score in results:
-        print "docid %4d score %2d" % (docid, score)
+        scaled = 100.0 * score / qw
+        print "docid %7d score %6d scaled %5.2f%%" % (docid, score, scaled)
        if VERBOSE:
            msg = docs[docid]
            ctx = msg.text.split("\n", CONTEXT)