Commit a6234aa8 authored by unknown's avatar unknown

manual.texi revisions to FULLTEXT section.

manual.texi	other miscellaneous cleanups.
manual.texi	fix missing word


Docs/manual.texi:
  revisions to FULLTEXT section.
  other miscellaneous cleanups.
parent 7a4e83eb
...@@ -33990,8 +33990,8 @@ DELETE FROM t1,t2 USING t1,t2,t3 WHERE t1.id=t2.id AND t2.id=t3.id ...@@ -33990,8 +33990,8 @@ DELETE FROM t1,t2 USING t1,t2,t3 WHERE t1.id=t2.id AND t2.id=t3.id
In the above case we delete matching rows just from tables @code{t1} and In the above case we delete matching rows just from tables @code{t1} and
@code{t2}. @code{t2}.
@code{ORDER BY} and using multiple tables in the @code{DELETE} is supported @code{ORDER BY} and using multiple tables in the @code{DELETE} statement
in MySQL 4.0. is supported in MySQL 4.0.
If an @code{ORDER BY} clause is used, the rows will be deleted in that order. If an @code{ORDER BY} clause is used, the rows will be deleted in that order.
This is really only useful in conjunction with @code{LIMIT}. For example: This is really only useful in conjunction with @code{LIMIT}. For example:
...@@ -35947,16 +35947,17 @@ You can set the default isolation level for @code{mysqld} with ...@@ -35947,16 +35947,17 @@ You can set the default isolation level for @code{mysqld} with
@cindex full-text search @cindex full-text search
@cindex FULLTEXT @cindex FULLTEXT
Since Version 3.23.23, MySQL has support for full-text indexing As of Version 3.23.23, MySQL has support for full-text indexing
and searching. Full-text indexes in MySQL are an index of type and searching. Full-text indexes in MySQL are an index of type
@code{FULLTEXT}. @code{FULLTEXT} indexes can be created from @code{VARCHAR} @code{FULLTEXT}. @code{FULLTEXT} indexes can be created from @code{VARCHAR}
and @code{TEXT} columns at @code{CREATE TABLE} time or added later with and @code{TEXT} columns at @code{CREATE TABLE} time or added later with
@code{ALTER TABLE} or @code{CREATE INDEX}. For large datasets, adding @code{ALTER TABLE} or @code{CREATE INDEX}. For large datasets, it will be
@code{FULLTEXT} index with @code{ALTER TABLE} (or @code{CREATE INDEX}) much faster to load your data into a table that has no @code{FULLTEXT}
would be much faster than inserting rows into the empty table that has index, then create the index with @code{ALTER TABLE} (or @code{CREATE
a @code{FULLTEXT} index. INDEX}). Loading data into a table that already has a @code{FULLTEXT}
index will be slower.
Full-text search is performed with the @code{MATCH} function. Full-text searching is performed with the @code{MATCH()} function.
@example @example
mysql> CREATE TABLE articles ( mysql> CREATE TABLE articles (
...@@ -35988,24 +35989,35 @@ mysql> SELECT * FROM articles ...@@ -35988,24 +35989,35 @@ mysql> SELECT * FROM articles
2 rows in set (0.00 sec) 2 rows in set (0.00 sec)
@end example @end example
The function @code{MATCH} matches a natural language (or boolean, The @code{MATCH()} function performs a natural language search for a string
see below) query in case-insensitive fashion @code{AGAINST} against a text collection (a set of of one or more columns included in
a text collection (which is simply the set of columns covered by a a @code{FULLTEXT} index). The search string is given as the argument to
@code{FULLTEXT} index). For every row in a table it returns relevance - @code{AGAINST()}. The search is performed in case-insensitive fashion.
a similarity measure between the text in that row (in the columns that are For every row in the table, @code{MATCH()} returns a relevance value,
part of the collection) and the query. When it is used in a @code{WHERE} that is, a similarity measure between the search string and the text in
clause (see example above) the rows returned are automatically sorted with that row in the columns named in the @code{MATCH()} list.
relevance decreasing. Relevance is a non-negative floating-point number.
Zero relevance means no similarity. Relevance is computed based on the
number of words in the row, the number of unique words in that row, the
total number of words in the collection, and the number of documents (rows)
that contain a particular word.
The above is a basic example of using @code{MATCH} function. Rows are When @code{MATCH()} is used in a @code{WHERE} clause (see example above)
returned with relevance decreasing. the rows returned are automatically sorted with highest relevance first.
Relevance values are non-negative floating-point numbers. Zero relevance
means no similarity. Relevance is computed based on the number of words
in the row, the number of unique words in that row, the total number of
words in the collection, and the number of documents (rows) that contain
a particular word.
It is also possible to perform a boolean mode search. This is explained
later in the section.
The preceding example is a basic illustration showing how to use the
@code{MATCH()} function. Rows are returned in order of decreasing
relevance.
The next example shows how to retrieve the relevance values explicitly.
As neither @code{WHERE} nor @code{ORDER BY} clauses are present, returned
rows are not ordered.
@example @example
mysql> SELECT id,MATCH title,body AGAINST ('Tutorial') FROM articles; mysql> SELECT id,MATCH (title,body) AGAINST ('Tutorial') FROM articles;
+----+-----------------------------------------+ +----+-----------------------------------------+
| id | MATCH (title,body) AGAINST ('Tutorial') | | id | MATCH (title,body) AGAINST ('Tutorial') |
+----+-----------------------------------------+ +----+-----------------------------------------+
...@@ -36019,12 +36031,16 @@ mysql> SELECT id,MATCH title,body AGAINST ('Tutorial') FROM articles; ...@@ -36019,12 +36031,16 @@ mysql> SELECT id,MATCH title,body AGAINST ('Tutorial') FROM articles;
6 rows in set (0.00 sec) 6 rows in set (0.00 sec)
@end example @end example
This example shows how to retrieve the relevances. As neither @code{WHERE} The following example is more complex. The query returns the relevance
nor @code{ORDER BY} clauses are present, returned rows are not ordered. and still sorts the rows in order of decreasing relevance. To achieve
this result, you should specify @code{MATCH()} twice. This will cause no
additional overhead, because the MySQL optimiser will notice that the
two @code{MATCH()} calls are identical and invoke the full-text search
code only once.
@example @example
mysql> SELECT id, body, MATCH title,body AGAINST ( mysql> SELECT id, body, MATCH (title,body) AGAINST
-> 'Security implications of running MySQL as root') AS score -> ('Security implications of running MySQL as root') AS score
-> FROM articles WHERE MATCH (title,body) AGAINST -> FROM articles WHERE MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root'); -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+ +----+-------------------------------------+-----------------+
...@@ -36036,18 +36052,12 @@ mysql> SELECT id, body, MATCH title,body AGAINST ( ...@@ -36036,18 +36052,12 @@ mysql> SELECT id, body, MATCH title,body AGAINST (
2 rows in set (0.00 sec) 2 rows in set (0.00 sec)
@end example @end example
This is more complex example - the query returns the relevance and still MySQL uses a very simple parser to split text into words. A ``word''
sorts the rows with relevance decreasing. To achieve it one should specify is any sequence of characters consisting of letters, numbers, @samp{'},
@code{MATCH} twice. Note, that this will cause no additional overhead, as and @samp{_}. Any ``word'' that is present in the stopword list or is just
MySQL optimiser will notice that these two @code{MATCH} calls are too short (3 characters or less) is ignored.
identical and will call full-text search code only once.
MySQL uses a very simple parser to split text into words. A Every correct word in the collection and in the query is weighted
``word'' is any sequence of letters, numbers, @samp{'}, and @samp{_}. Any
``word'' that is present in the stopword list or just too short (3
characters or less) is ignored.
Every correct word in the collection and in the query is weighted,
according to its significance in the query or collection. This way, a according to its significance in the query or collection. This way, a
word that is present in many documents will have lower weight (and may word that is present in many documents will have lower weight (and may
even have a zero weight), because it has lower semantic value in this even have a zero weight), because it has lower semantic value in this
...@@ -36057,28 +36067,28 @@ relevance of the row. ...@@ -36057,28 +36067,28 @@ relevance of the row.
Such a technique works best with large collections (in fact, it was Such a technique works best with large collections (in fact, it was
carefully tuned this way). For very small tables, word distribution carefully tuned this way). For very small tables, word distribution
does not reflect adequately their semantical value, and this model does not reflect adequately their semantic value, and this model
may sometimes produce bisarre results. may sometimes produce bizarre results.
@example @example
mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL'); mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec) Empty set (0.00 sec)
@end example @end example
Search for the word @code{MySQL} produces no results in the above example. The search for the word @code{MySQL} produces no results in the above
Word @code{MySQL} is present in more than half of rows, and as such, is example, because that word is present in more than half of rows. As such,
effectively treated as a stopword (that is, with semantical value zero). it is effectively treated as a stopword (that is, a word with zero semantic
It is, really, the desired behavior - a natural language query should not value). This is the most desirable behavior -- a natural language query
return every second row in 1GB table. should not return every second row from a 1GB table.
A word that matches half of rows in a table is less likely to locate relevant A word that matches half of rows in a table is less likely to locate relevant
documents. In fact, it will most likely find plenty of irrelevant documents. documents. In fact, it will most likely find plenty of irrelevant documents.
We all know this happens far too often when we are trying to find something on We all know this happens far too often when we are trying to find something on
the Internet with a search engine. It is with this reasoning that such rows the Internet with a search engine. It is with this reasoning that such rows
have been assigned a low semantical value in @strong{this particular dataset}. have been assigned a low semantic value in @strong{this particular dataset}.
Since version 4.0.1 MySQL can also perform boolean fulltext searches using As of Version 4.0.1, MySQL can also perform boolean full-text searches using
@code{IN BOOLEAN MODE} modifier. the @code{IN BOOLEAN MODE} modifier.
@example @example
mysql> SELECT * FROM articles WHERE MATCH (title,body) mysql> SELECT * FROM articles WHERE MATCH (title,body)
...@@ -36095,38 +36105,44 @@ mysql> SELECT * FROM articles WHERE MATCH (title,body) ...@@ -36095,38 +36105,44 @@ mysql> SELECT * FROM articles WHERE MATCH (title,body)
@end example @end example
This query retrieved all the rows that contain the word @code{MySQL} This query retrieved all the rows that contain the word @code{MySQL}
(note: 50% threshold is gone), but does @strong{not} contain the word (note: the 50% threshold is not used), but that do @strong{not} contain
@code{YourSQL}. Note, that it does not auto-magically sort rows in the word @code{YourSQL}. Note that a boolean mode search does not
decreasing relevance order (the last row has the highest relevance, auto-magically sort rows in order of decreasing relevance. You can
as it contains @code{MySQL} twice). Boolean fulltext search can also see this from result of the preceding query, where the row with the
work even without @code{FULLTEXT} index, but it would be @strong{slow}. highest relevance (the one that contains @code{MySQL} twice) is listed
last, not first. A boolean full-text search can also work even without
a @code{FULLTEXT} index, although it would be @strong{slow}.
Boolean fulltext search supports the following operators: The boolean full-text search capability supports the following operators:
@table @code @table @code
@item + @item +
A plus sign prepended to a word indicates that this word @strong{must be} A leading plus sign indicates that this word @strong{must be}
present in every row returned. present in every row returned.
@item - @item -
A minus sign prepended to a word indicates that this word @strong{must not} A leading minus sign indicates that this word @strong{must not be}
be present in the rows returned. present in any row returned.
@item @item
By default - without plus or minus - the word is optional, but the rows that By default (when neither plus nor minus is specified) the word is optional,
contain it will be rated higher. This mimicks the behaviour of but the rows that contain it will be rated higher. This mimicks the
@code{MATCH ... AGAINST()} without @code{IN BOOLEAN MODE} modifier. behaviour of @code{MATCH() ... AGAINST()} without the @code{IN BOOLEAN
MODE} modifier.
@item < > @item < >
These two operators are used to increase and decrease word's contribution These two operators are used to change a word's contribution to the
to the relevance value, assigned to a row. See an example below. relevance value that is assigned to a row. The @code{<} operator
decreases the contribution and the @code{>} operator increases it.
See the example below.
@item ( ) @item ( )
Parentheses are used - as usual - to group words into subexpressions. Parentheses are used to group words into subexpressions.
@item ~ @item ~
This is negation operator. It makes word's contribution to the row A leading tilde acts as a negation operator, causing the word's
relevance negative. It's useful for marking noise words. A row that has contribution to the row relevance to be negative. It's useful for marking
such a word will be rated lower than others, but will not be excluded noise words. A row that contains such a word will be rated lower than
altogether, as with @code{-} operator. others, but will not be excluded altogether, as it would be with the
@code{-} operator.
@item * @item *
This is truncation operator. Unlike others it should be @strong{appended} An asterisk is the truncation operator. Unlike the other operators, it
to the word, not prepended. should be @strong{appended} to the word, not prepended.
@end table @end table
And here are some examples: And here are some examples:
...@@ -36148,25 +36164,25 @@ order), but rank ``apple pie'' higher than ``apple strudel''. ...@@ -36148,25 +36164,25 @@ order), but rank ``apple pie'' higher than ``apple strudel''.
@end table @end table
@menu @menu
* Fulltext Restrictions:: Fulltext Restrictions * Fulltext Restrictions:: Full-text Restrictions
* Fulltext Fine-tuning:: Fine-tuning MySQL Full-text Search * Fulltext Fine-tuning:: Fine-tuning MySQL Full-text Search
* Fulltext TODO:: Full-text Search TODO * Fulltext TODO:: Full-text Search TODO
@end menu @end menu
@node Fulltext Restrictions, Fulltext Fine-tuning, Fulltext Search, Fulltext Search @node Fulltext Restrictions, Fulltext Fine-tuning, Fulltext Search, Fulltext Search
@subsection Fulltext Restrictions @subsection Full-text Restrictions
@itemize @bullet @itemize @bullet
@item @item
All parameters to the @code{MATCH} function must be columns from the All parameters to the @code{MATCH()} function must be columns from the
same table that is part of the same fulltext index, unless this same table that is part of the same @code{FULLTEXT} index, unless the
@code{MATCH} is @code{IN BOOLEAN MODE}. @code{MATCH()} is @code{IN BOOLEAN MODE}.
@item @item
Column list between @code{MATCH} and @code{AGAINST} must match exactly The @code{MATCH()} column list must exactly match the column list in some
a column list in the @code{FULLTEXT} index definition, unless this @code{FULLTEXT} index definition for the table, unless this @code{MATCH()}
@code{MATCH} is @code{IN BOOLEAN MODE}. is @code{IN BOOLEAN MODE}.
@item @item
The argument to @code{AGAINST} must be a constant string. The argument to @code{AGAINST()} must be a constant string.
@end itemize @end itemize
...@@ -36176,7 +36192,7 @@ The argument to @code{AGAINST} must be a constant string. ...@@ -36176,7 +36192,7 @@ The argument to @code{AGAINST} must be a constant string.
Unfortunately, full-text search has few user-tunable parameters yet, Unfortunately, full-text search has few user-tunable parameters yet,
although adding some is very high on the TODO. If you have a although adding some is very high on the TODO. If you have a
MySQL source distribution (@pxref{Installing source}), you can MySQL source distribution (@pxref{Installing source}), you can
more control on the full-text search behavior. exert more control over full-text searching behavior.
Note that full-text search was carefully tuned for the best searching Note that full-text search was carefully tuned for the best searching
effectiveness. Modifying the default behavior will, in most cases, effectiveness. Modifying the default behavior will, in most cases,
...@@ -36186,37 +36202,37 @@ unless you know what you are doing! ...@@ -36186,37 +36202,37 @@ unless you know what you are doing!
@itemize @bullet @itemize @bullet
@item @item
Minimal length of word to be indexed is defined by MySQL The minimum length of words to be indexed is defined by the MySQL
variable @code{ft_min_word_length}. @xref{SHOW VARIABLES}. variable @code{ft_min_word_length}. @xref{SHOW VARIABLES}.
Change it to the value you prefer, and rebuild Change it to the value you prefer, and rebuild
your @code{FULLTEXT} indexes. your @code{FULLTEXT} indexes.
@item @item
The stopword list is defined in @file{myisam/ft_static.c} The stopword list is defined in @file{myisam/ft_static.c}
Modify it to your taste, recompile MySQL and rebuild Modify it to your taste, recompile MySQL, and rebuild
your @code{FULLTEXT} indexes. your @code{FULLTEXT} indexes.
@item @item
The 50% threshold is caused by the particular weighting scheme chosen. To The 50% threshold is determined by the particular weighting scheme chosen.
disable it, change the following line in @file{myisam/ftdefs.h}: To disable it, change the following line in @file{myisam/ftdefs.h}:
@example @example
#define GWS_IN_USE GWS_PROB #define GWS_IN_USE GWS_PROB
@end example @end example
to To:
@example @example
#define GWS_IN_USE GWS_FREQ #define GWS_IN_USE GWS_FREQ
@end example @end example
and recompile MySQL. Then recompile MySQL.
There is no need to rebuild the indexes in this case. There is no need to rebuild the indexes in this case.
@strong{Note:} by doing this you @strong{severely} decrease MySQL ability @strong{Note:} by doing this you @strong{severely} decrease MySQL's ability
to provide adequate relevance values by @code{MATCH} function. to provide adequate relevance values for the @code{MATCH()} function.
It means, that if you really need to search for such a common words, If you really need to search for such common words, it would be better to
then you should rather search @code{IN BOOLEAN MODE}, which does not search using @code{IN BOOLEAN MODE} instead, which does not observe the 50%
has 50% threshold. threshold.
@item @item
Sometimes search engine maintaner would like to change operators used Sometimes the search engine maintainer would like to change the operators used
for boolean fulltext search. They are defined by a for boolean fulltext searches. These are defined by the
@code{ft_boolean_syntax} variable. @xref{SHOW VARIABLES}. @code{ft_boolean_syntax} variable. @xref{SHOW VARIABLES}.
Still, this variable is read-only, its value is set in Still, this variable is read-only, its value is set in
@file{myisam/ft_static.c}. @file{myisam/ft_static.c}.
...@@ -36237,7 +36253,7 @@ the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc. ...@@ -36237,7 +36253,7 @@ the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc.
@item Support for multi-byte charsets. @item Support for multi-byte charsets.
@item Make stopword list to depend of the language of the data. @item Make stopword list to depend of the language of the data.
@item Stemming (dependent of the language of the data, of course). @item Stemming (dependent of the language of the data, of course).
@item Generic user-supplyable UDF (?) preparser. @item Generic user-suppliable UDF (?) preparser.
@item Make the model more flexible (by adding some adjustable @item Make the model more flexible (by adding some adjustable
parameters to @code{FULLTEXT} in @code{CREATE/ALTER TABLE}). parameters to @code{FULLTEXT} in @code{CREATE/ALTER TABLE}).
@end itemize @end itemize
...@@ -49697,7 +49713,7 @@ Fixed bug with @code{LOCK TABLE} and BDB tables. ...@@ -49697,7 +49713,7 @@ Fixed bug with @code{LOCK TABLE} and BDB tables.
@itemize @bullet @itemize @bullet
@item @item
Fixed a bug when using @code{MATCH} in @code{HAVING} clause. Fixed a bug when using @code{MATCH()} in @code{HAVING} clause.
@item @item
Fixed a bug when using @code{HEAP} tables with @code{LIKE}. Fixed a bug when using @code{HEAP} tables with @code{LIKE}.
@item @item
...@@ -50266,7 +50282,7 @@ that caused @code{mysql_install_db} to core dump on some Linux machines. ...@@ -50266,7 +50282,7 @@ that caused @code{mysql_install_db} to core dump on some Linux machines.
@item @item
Changed @code{mi_create()} to use less stack space. Changed @code{mi_create()} to use less stack space.
@item @item
Fixed bug with optimiser trying to over-optimise @code{MATCH} when used Fixed bug with optimiser trying to over-optimise @code{MATCH()} when used
with @code{UNIQUE} key. with @code{UNIQUE} key.
@item @item
Changed @code{crash-me} and the MySQL benchmarks to also work Changed @code{crash-me} and the MySQL benchmarks to also work
...@@ -50722,7 +50738,7 @@ More variables in @code{SHOW SLAVE STATUS} and @code{SHOW MASTER STATUS}. ...@@ -50722,7 +50738,7 @@ More variables in @code{SHOW SLAVE STATUS} and @code{SHOW MASTER STATUS}.
@item @item
@code{SLAVE STOP} now will not return until the slave thread actually exits. @code{SLAVE STOP} now will not return until the slave thread actually exits.
@item @item
Full text search via the @code{MATCH} function and @code{FULLTEXT} index type Full text search via the @code{MATCH()} function and @code{FULLTEXT} index type
(for MyISAM files). This makes @code{FULLTEXT} a reserved word. (for MyISAM files). This makes @code{FULLTEXT} a reserved word.
@end itemize @end itemize
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment