Commits · 90bae6a7542be668cc65cc184ef300cf3d993fd7 · Kirill Smelkov / Zope

20 May, 2002 7 commits

Add Zope Copyright notice. · 90bae6a7
Guido van Rossum authored May 20, 2002

90bae6a7
Add Zope Copyright notice. · 53c5d967
Guido van Rossum authored May 20, 2002
```
Fix typo in docstring.
```
53c5d967

Guido van Rossum authored May 20, 2002

- Rephrased the description of the grammar, pointing out that the
  lexicon decides on globbing syntax.

- Refactored term and atom parsing (moving atom parsing into a
  separate method).  The previously checked-in version accidentally
  accepted some invalid forms like ``foo AND -bar''; this is fixed.

tests/testQueryParser.py:

- Each test is now in a separate method; this produces more output
  (alas) but makes pinpointing the errors much simpler.

- Added some tests catching ``foo AND -bar'' and similar.

- Added an explicit test class for the handling of stopwords.  The
  "and/" test no longer has to check self.__class__.

- Some refactoring of the TestQueryParser class; the utility methods
  are now in a base class TestQueryParserBase, in a different order;
  compareParseTrees() now shows the parse tree it got when raising an
  exception.  The parser is now self.parser instead of self.p (see
  below).

tests/testZCTextIndex.py:

- setUp() no longer needs to assign to self.p; the parser is
  consistently called self.parser now.

47bb995d

Fix unintended recursion in parseQueryEx(). (Unittests are coming up! · 98607a5c
Guido van Rossum authored May 20, 2002
```
:-)
```
98607a5c
Limit copyright to 2002; none of this code existed last year. · 9491bc84
Guido van Rossum authored May 20, 2002

9491bc84

Refactor the query parser to rely on the lexicon for parsing terms. · b82b2746

Guido van Rossum authored May 20, 2002

ILexicon.py:

  - Added parseTerms() and isGlob().

  - Added get_word(), get_wid() (get_word() is old; get_wid() for symmetry).

  - Reflowed some text.

IQueryParser.py:

  - Expanded docs for parseQuery().

  - Added getIgnored() and parseQueryEx().

IPipelineElement.py:

  - Added processGlob().

Lexicon.py:

  - Added parseTerms() and isGlob().

  - Added get_wid().

  - Some pipeline elements now support processGlob().

ParseTree.py:

  - Clarified the error message for calling executeQuery() on a
    NotNode.

QueryParser.py (lots of changes):

  - Change private names __tokens etc. into protected _tokens etc.

  - Add getIgnored() and parseQueryEx() methods.

  - The atom parser now uses the lexicon's parseTerms() and isGlob()
    methods.

  - Query parts that consist only of stopwords (as determined by the
    lexicon), or of stopwords and negated terms, yield None instead of
    a parse tree node; the ignored term is added to self._ignored.
    None is ignored when combining terms for AND/OR/NOT operators, and
    when an operator has no non-None operands, the operator itself
    returns None.  When this None percolates all the way to the top,
    the parser raises a ParseError exception.

tests/testQueryParser.py:

  - Changed test expressions of the form "a AND b AND c" to "aa AND bb
    AND cc" so that the terms won't be considered stopwords.

  - The test for "and/" can only work for the base class.

tests/testZCTextIndex.py:

  - Added copyright notice.

  - Refactor testStopWords() to have two helpers, one for success, one
    for failures.

  - Change testStopWords() to require parser failure for those queries
    that have only stopwords or stopwords plus negated terms.

  - Improve compareSet() to sort the sets of keys, and use a more
    direct way of extracting the keys.  This wasn't strictly needed
    (nothing fails without this), but the old approach of copying the
    keys into a dict in a loop depends on the dict hashing to always
    return keys in the same order.

b82b2746

revert stopper setup.py-age; stopper is not in the Zope module. ok · 5f66a3ce
Matt Behrens authored May 20, 2002
```
guido@.

when/if merge day comes for the installer this will make for less
confusion :-)
```
5f66a3ce

19 May, 2002 6 commits
- For queries, show the total number of results as well as the nbest number; · 7b3de8db
  Tim Peters authored May 19, 2002
```
display the search time in milliseconds too.
```
  7b3de8db
- Show index and pack times in minutes instead of seconds. Show timestamps · f357f8a6
  Tim Peters authored May 19, 2002
```
for start and end of run.  Show elapsed wall-clock time in minutes.
```
  f357f8a6
- Gave it a "-c NNN" context argument (how many leading lines of result · 5da9eb6b
  Tim Peters authored May 19, 2002
```
msgs to display).  Changed the module docstring to separate the index-
generation args from the query args.
```
  5da9eb6b
- Oops! Call the right routine (typo in code just checked in). · a0360090
  Tim Peters authored May 19, 2002
  
  a0360090
- Beef up the reindexing tests to check that they actually fail before the · 94b452e8
  Tim Peters authored May 19, 2002
```
original doc text gets restored.
```
  94b452e8
- QueryParser refactoring step 1: add the lexicon to the constructor args. · bd532bbe
  Guido van Rossum authored May 19, 2002
  
  bd532bbe
18 May, 2002 5 commits
- Rearrange the Okapi reindexing tests to make it easier to figure out what · 97fbb9c9
  Tim Peters authored May 18, 2002
```
went wrong if they fail.
```
  97fbb9c9
- Restore CONTEXT to its original value. · 466d0130
  Tim Peters authored May 18, 2002
  
  466d0130
- Revert braindead change to final pack (it was my change, so it's OK for · f835a0c2
  Tim Peters authored May 18, 2002
```
me to call it braindead <wink>).
```
  f835a0c2
- Pack at the end even if the # of msgs isn't an exact multiple of · 1e8f93fb
  Tim Peters authored May 18, 2002
```
PACK_INTERVAL.
```
  1e8f93fb
- Display total pack time at the end. · eb8de680
  Tim Peters authored May 18, 2002
  
  eb8de680
17 May, 2002 22 commits

Special-case None search() results in AND, AND NOT, and OR contexts, and · dfbfbe55

Tim Peters authored May 17, 2002

uncomment the test cases that were failing in these contexts.

Read it and weep <wink>:  In an AND context, None is treated like the
universal set, which jibes with the convenient fiction that stop words
appear in every doc.  However, in AND NOT and OR contexts, None is
treated like the empty set, which doesn't jibe with anything except that
we want

    real_word AND NOT stop_word

and

    real_word OR stop_word

to act like

    real_word

If we treated None as if it were the universal set, these results would
be (respectively) the empty set and the universal set instead.

At a higher level, we *are* consistent with the notion that a query with
a stop word acts the same as if the clause with the stop word weren't
present.  That's what really drives this schizophrenic (context-dependent)
treatment of None.

dfbfbe55

Use the same stop list for both indexes. · f968ebb5
Jeremy Hylton authored May 17, 2002

f968ebb5
testDocUpdate(): assert that the common and unique wordsets aren't · 138b3120
Tim Peters authored May 17, 2002
```
empty.
```
138b3120
Added more little OOV query tests. · 4fe5e70c
Tim Peters authored May 17, 2002

4fe5e70c

Added a number of tests to trigger search-can-return-None bugs. The three · f4e63c3e

Tim Peters authored May 17, 2002

tests that currently fail are currently commented out.

Key question:  If someone does a search on a stopword, and nothing else is
in the query, what do we want to do?  Return all docs in a random order?
Return no docs?  Raise an exception?

Second question:  What if someone does a query on

    rare_word AND NOT stop_word

?

f4e63c3e

If -T is passed (query with old TextIndex), try as best as possible to · 86e12d94
Jeremy Hylton authored May 17, 2002
```
do the same query and work as ZCTextIndex would do.

Produce a result set, pump it into NBest, and extract the 10 best.
```
86e12d94
Reindex docs touching as few docid->w(docid, w) maps as possible. · 86fc53ee
Tim Peters authored May 17, 2002

86fc53ee
Add a little splitter that behaves pretty much like HTMLWordSplitter, · bad257b8
Jeremy Hylton authored May 17, 2002
```
but works with a TextIndex Lexicon.
```
bad257b8

_del_wordinfo(): Simplify. It's the caller's responsibility to ensure that · 81682acc

Tim Peters authored May 17, 2002

the index knows about the doc and the wid.

_del_wordinfo and _add_wordinfo:  s/map/doc2score/g.  map is a builtin
function, and it's needlessly confusing to name a vrbl that too.

81682acc

Improve OOV explanation, based on Guido's feedback. · 92c26bc8
Tim Peters authored May 17, 2002

92c26bc8
Implement unique using an IITreeSet as suggested by Tim. · 9b736188
Jeremy Hylton authored May 17, 2002

9b736188

Make sure stop words are used with old TextIndex. · 0d93f320

Jeremy Hylton authored May 17, 2002

I think that the default Lexicon for TextIndex does not use a stop
word list. For the comparison with ZCTextIndex, explicitly pass the
default stop word dict from TextIndex to the lexicon.

0d93f320

Shorten comment so it fits on line. · 8915733b
Jeremy Hylton authored May 17, 2002

8915733b

Two changes and a question posing as a comment. · 504af04c

Jeremy Hylton authored May 17, 2002

In unindex_doc() call _del_wordinfo() for each unique wid in the doc,
not for each wid.  Before we had WidCode and phrase searching,
_docwords stored a list of the unique wids.  The unindex code wasn't
updated when _docwords started storing all the wids, even duplicates.

Replace the try/except around __getitem__ in _add_wordinfo() with a
.get() call.

Add XXX comment about the purpose of the try/except(s) in
_del_wordinfo().  I suspect they only existed because _del_wordinfo()
was called repeatedly when a wid existed more than once.

504af04c

Remove redundant imports of ZODB. · cd596b3f

Guido van Rossum authored May 17, 2002

A ZODB import is only redundant if it is not used and does not
precede an import from Persistence.

cd596b3f

search_glob(): nuke the OOV wids (if any) before calling _search_wids. · ac419c5b
Tim Peters authored May 17, 2002
```
It's possible to get OOV wids here due to words the lexicon knows
about that the index has no current instances of.
```
ac419c5b
Implement correct (albeit inefficient) reindexing, and stop cheating · 9032867d
Tim Peters authored May 17, 2002
```
in the reindexing text.
```
9032867d
Remove more needless imports. · eebb1a61
Tim Peters authored May 17, 2002

eebb1a61
Put an XXX on the important line. · 3e375ea8
Tim Peters authored May 17, 2002

3e375ea8

testDocUpdate(): Thanks to stop-word removal, there weren't actually · f2a03547

Tim Peters authored May 17, 2002

*any* words in common across the versions.  Helped Will along by adding
a pragmatic comment to his "knocking indeed" rant.  Reworked to use
the inscrutable magic of dict.setdefault.

f2a03547

Moved a comment that got disconnected from its class. · 35879b41
Tim Peters authored May 17, 2002

35879b41

Factor out most of the code for indexing a doc. The cosine index may · 460bcba1

Tim Peters authored May 17, 2002

take longer to construct now; both indexers' _get_frequencies routines
were fiddled to return the same kind of stuff again, and I had
previously fiddled the cosine indexer's _get_frequencies to do something
weirder but (probably) faster than this.

460bcba1