Interface for implementing query-based traversal.
Query-based traversal is a scheme whereby a repository is traversed according to a query that visits each document in a natural order that is efficiently supported by the underlying repository and can be easily checkpointed and restarted.
A good use case is a repository that supports access to documents in last-modified-date order. In particular, suppose a repository supports a query analogous to the following SQL query (the repository need not support SQL, SQL is used here only as an example):
select documentid, lastmodifydate from documents where lastmodifydate < date-constant order by lastmodifydate
Such a repository can easily be traversed by lastmodifydate, and the state of the traversal is easily encapsulated in a single, small data item: the date of the last document processed. Increasing last-modified-date order is convenient because if a document is processed during traversal, but then later modified, then it will be picked up again later in the traversal process. Thus, this traversal is appropriate both for initial load and for incremental update.
For such a repository, the implementor is urged to let the Connector Manager (the caller) maintain the traversal state. This is achieved by implementing the interface methods as follows:
Checkpoints are supplied by the {@link DocumentList#checkpoint()} method.
Please observe that the Connector Manager (the caller) makes no guarantee to consume the entire {@code DocumentList} returned by either the{@code startTraversal} or {@code resumeTraversal} calls.The Connector Manager will consume as many it chooses, depending on load, schedule and other factors. The Connector Manager guarantees to call {@code checkpoint} after handling the last document it hassuccessfully processed from the {@code DocumentList} it was using.Thus, the implementor is free to use a query that only returns a small number of results, if that gets better performance.
For example, to continue the SQL analogy, a query like this could be used:
select TOP 10 documentid, lastmodifydate from documents ...
The {@code setBatchHint} method is provided so that the ConnectorManager can tell the implementation that it only wants that many results per call. This is a hint - the implementation need not observe it. The implementation is free to return a DocumentList with fewer or more results. For example, the traversal may be completely up to date, so perhaps there are no results to return. Or, for internal reasons, the implementation may not want to return the full batchHint number of results. When returning more results than the hint, some or all of the extra documents may be ignored.
The Connector Manager makes a distinction between the return of a {@code null} DocumentList and an empty DocumentList (a DocumentList with zero entries). Returning a {@code null} DocumentList will have an impact onscheduling - the Connector Manager may choose to wait longer after receiving a {@code null} result before it calls again. Also, if a {@code null} resultis returned, the Connector Manager will not [indeed, cannot] call {@code checkpoint} before calling start or resume traversal again. Returninga {@code null} DocumentList is suitable when a traversal is completely up todate, with no new documents available and no new checkpoint state.
Returning an empty DocumentList will probably not have an impact on scheduling. The Connector Manager will call {@code checkpoint}, and will likely call {@code resumeTraversal} again immediately.Returning an empty DocumentList is not appropriate if a traversal is completely up to date, as it would effectively induce a spin, constantly calling {@code resumeTraversal} when it has no work to do.Returning an empty DocumentList is a convenient way to indicate to the Connector Manager, that although no documents were provided in this batch, the Connector wishes to continue searching the repository for suitable content. The call to {@code checkpoint} allows theConnector to record its progress through the repository. This mechanism is suitable for cases when the search for suitable content may exceed the Connector Manager's timeout.
If the Connector returns a non- {@code null} {@code DocumentList}, even one with zero entries, the Connector Manager will nearly always call {@code checkpoint} when it has finished processing the DocumentList.
An implementation need not let the Connector Manager store the traversal state, it may choose to store the state itself. Implementors are discouraged from using this technique unless necessary, because it makes transactionality more difficult and it introduces resource dependencies of which the Connector Manager is unaware. However, there may be repositories which have a natural traversal order, but this state of this traversal is not easily expressed in a small data item. For example, a repository may consist of a large number of named sub-repositories, each of which can be traversed in modify date order, but for which there is no convenient way of traversing them all in one query. In this case, the implementation may choose to maintain state itself, as a table of pairs: (sub-repository-name, per-repository-date-stamp). In such a case, the implementor may implement the interface methods as follows:
- {@code startTraversal()} Clear the internal state. Return thefirst few documents
- {@code resumeTraversal(String checkpoint)} Resume traversalaccording to the internal state of the implementation. The Connector Manager will pass in whatever checkpoint String was returned by the last call to {@link DocumentList#checkpoint()} but the implementation is free to ignorethis and use its internal state. However, even in this case, {@code checkpoint} must not return a {@code null} String.
The implementation must be careful about when and how it commits its internal state to external storage. Remember again that the Connector Manager makes no guarantee to consume the entire result set return by a traversal call. If the Connector Manager does not call checkpoint, the implementation should not assume that the documents returned by {@link DocumentList#nextDocument} havebeen processed. The implementation should wait until the checkpoint call, and only commit the state up to the last document returned.
Note on "Metadata and URL" feeds vs. Content feeds: Some repositories are fully web-enabled but are difficult or impossible for the Search Appliance to crawl, because they make heavy use of ASP or JSP, or they have a metadata model that is not conveniently accessible with the content in a single page. Such repositories are good candidates for connectors. However, a developer may not choose to implement authentication and authorization through a connector. It may be sufficient to use standard web mechanisms for these tasks.
The developer can achieve this by following these steps. In the document list returned by the traversal methods, specify the {@link SpiConstants#PROPNAME_SEARCHURL}property. The value should be a URL. If this property is specified, the Connector Manager will use a "URL Feed" rather than a "Content Feed" for that document. In this case, the implementor should
not supply the content of the document. The Search Appliance will fetch the content from the specified URL. Also, this URL will be used to trigger normal authentication and authorization for that document. For more details, see the documentation on Metadata and URL Feeds.
Note on Documents returned by traversal calls: The {@code Document} objects returned by the queries defined heremust contain special properties according to the following rules:
- {@link SpiConstants#PROPNAME_DOCID} This property must be present.
- {@link SpiConstants#PROPNAME_SEARCHURL} If present, this means that theConnector Manager will generate a Metadata and URL feed, with the specified URL. If this is present, then the {@link SpiConstants#PROPNAME_CONTENT}property should not be.
- {@link SpiConstants#PROPNAME_CONTENT} This property should hold thecontent of the document. If present, the connector framework will base-64 encode the value and present it to the Search Appliance as the primary content to be indexed. If this is present, then the {@link SpiConstants#PROPNAME_SEARCHURL} property should notbe.
- {@link SpiConstants#PROPNAME_DISPLAYURL} If present, this will be usedas the primary link on a results page. This should not be used with {@link SpiConstants#PROPNAME_SEARCHURL}.
@since 1.0