Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.
For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.
Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER; //Analyzer analyzer = new SimpleAnalyzer(); MemoryIndex index = new MemoryIndex(); index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer); index.addField("author", "Tales of James", analyzer); QueryParser parser = new QueryParser("content", analyzer); float score = index.search(parser.parse("+author:james +salmon~ +fish* manual~")); if (score > 0.0f) { System.out.println("it's a match"); } else { System.out.println("no match found"); } System.out.println("indexData=" + index.toString());
(: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :) declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :) for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book
MemoryIndex index = ... synchronized (index) { // read and/or write index (i.e. add fields and/or query) }
This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAMDirectory
. Note that RAMDirectory
has particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case. Memory consumption is probably larger than for RAMDirectory
.
Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.
If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).
|
|