NAME DBIx::TextIndex - Perl extension for full-text searching in SQL databases SYNOPSIS use DBIx::TextIndex; my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh, document_table => 'document_table', document_fields => ['column_1', 'column_2'], document_id_field => 'primary_key', index_dbh => $index_dbh, collection => 'collection_1', }); $index->initialize; $index->add_document(\@document_ids); my $results = $index->search({ column_1 => '"a phrase" +and -not or', column_2 => 'more words', }); foreach my $document_id (sort {$$results{$b} <=> $$results{$a}} keys %$results ) { print "DocumentID: $document_id Score: $$results{$document_id} \n"; } $index->delete; DESCRIPTION DBIx::TextIndex was developed for doing full-text searches on BLOB columns stored in a MySQL database. Almost any database with BLOB and DBI support should work with minor adjustments to SQL statements in the module. Implements a crude parser for tokenizing a user input string into phrases, can-include words, must-include words, and must-not-include words. The following methods are available: $index = DBIx::TextIndex->new(\%args) Constructor method. The first time an index is created, the following arguments must be passed to new(): my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh, document_table => 'document_table', document_fields => ['column_1', 'column_2'], document_id_field => 'primary_key', index_dbh => $index_dbh, collection => 'collection_1', }); document_dbh DBI connection handle to database containing text documents document_table Name of database table containing text documents document_fields Reference to a list of column names to be indexed from document_table document_id_field Name of a unique integer key column in document_table index_dbh DBI connection handle to database containing TextIndex tables. I recommend using a separate database for your TextIndex, because the module creates and drops tables without warning. collection A name for the index. Should contain only alpha-numeric characters or underscores [A-Za-z0-9_] After creating a new TextIndex for the first time, and after calling initialize(), only the index_dbh, document_dbh, and collection arguments are needed to create subsequent instances of a TextIndex. $index->initialize This method creates all the inverted tables for the TextIndex in the database specified by document_dbh. This method should be called only once when creating a new index! It drops all the inverted tables before creating new ones. initialize() also stores the document_table, document_fields, and document_id_field attributes in a special table called "collection," so subsequent calls to new() for a given collection do not need those arguments. $index->add_document(\@document_ids) Add all the @documents_ids from document_id_field to the TextIndex. @document_ids must be sorted from lowest to highest. All further calls to add_document() must use @document_ids higher than those previously added to the index. Reindexing previously-indexed documents will yield unpredictable results! $index->search(\%search_args) search() returns $results, a reference to a hash. The keys of the hash are document ids, and the values are the relative scores of the documents. If an error occured while searching, $results will be a scalar variable containing an error message. $results = $index->search({ first_field => '+andword -notword orword "phrase words"', second_field => ... ... }); if (ref $results) { print "The score for $document_id is $results->{$document_id}\n"; } else { print "Error: $results\n"; } $index->unscored_search(\%search_args) unscored_search() returns $document_ids, a reference to an array. Since the scoring algorithm is skippped, this method is much faster than search() $document_ids = $index->unscored_search({ first_field => '+andword -notword orword "phrase words"', second_field => ... }); if (ref $document_ids) { print "Here's all the document ids:\n"; map { print "$_\n" } @$document_ids; } else { print "Error: $document_ids\n"; } $index->delete delete() removes the tables associated with a TextIndex from index_dbh. CHANGES 0.05 Added unscored_search() which returns a reference to an array of document_ids, without scores. Should be much faster than scored search. Added error handling in case _occurence() doesn't return a number. 0.04 Bug fix: add_document() will return if passed empty array ref instead of producing error. Changed _boolean_compare() and _phrase_search() so and_words and phrases behave better in multiple-field searches. Result set for each field is calculated first, then union of all fields is taken for final result set. Scores are scaled lower in _search(). 0.03 Added example scripts in examples/. 0.02 Added or_mask_set. 0.01 Initial public release. Should be considered beta, and methods may be added or changed until the first stable release. AUTHOR Daniel Koch, dkoch@bizjournal.com COPYRIGHT Copyright 1997, 1998, 1999, 2000, 2001 by Daniel Koch. All rights reserved. LICENSE This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License". DISCLAIMER This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the "GNU General Public License" for more details. ACKNOWLEDGEMENTS Thanks to Ulrich Pfeifer for ideas and code from Man::Index module in "Information Retrieval, and What pack 'w' Is For" article from The Perl Journal vol. 2 no. 2. Thanks to Steffen Beyer for the Bit::Vector module, which enables fast set operations in this module. Version 5.3 or greater of Bit::Vector is required by DBIx::TextIndex. BUGS Uses quite a bit of memory. MySQL-specific SQL is used. Parser is not very good. Documentation is not complete. Phrase searching relies on full-table scan. Any suggestions for adding word-proximity information to the index would be much appreciated. No facility for deleting documents from an index. Work-around: create a new index. Please feel free to email me (dkoch@bizjournals.com) with any questions or suggestions. SEE ALSO perl(1).