abacusai.document_retriever

Classes

DocumentRetriever

A vector store that stores embeddings for a list of document trunks.

Module Contents

class abacusai.document_retriever.DocumentRetriever(client, name=None, documentRetrieverId=None, createdAt=None, featureGroupId=None, featureGroupName=None, indexingRequired=None, latestDocumentRetrieverVersion={}, documentRetrieverConfig={})

Bases: abacusai.return_class.AbstractApiClass

A vector store that stores embeddings for a list of document trunks.

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • name (str) – The name of the document retriever.

  • documentRetrieverId (str) – The unique identifier of the vector store.

  • createdAt (str) – When the vector store was created.

  • featureGroupId (str) – The feature group id associated with the document retriever.

  • featureGroupName (str) – The feature group name associated with the document retriever.

  • indexingRequired (bool) – Whether the document retriever is required to be indexed due to changes in underlying data.

  • latestDocumentRetrieverVersion (DocumentRetrieverVersion) – The latest version of vector store.

  • documentRetrieverConfig (DocumentRetrieverConfig) – The config for vector store creation.

name = None
document_retriever_id = None
created_at = None
feature_group_id = None
feature_group_name = None
indexing_required = None
latest_document_retriever_version
document_retriever_config
deprecated_keys
__repr__()
to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

rename(name)

Updates an existing document retriever.

Parameters:

name (str) – The name to update the document retriever with.

Returns:

The updated document retriever.

Return type:

DocumentRetriever

create_version(feature_group_id=None, document_retriever_config=None)

Creates a document retriever version from the latest version of the feature group that the document retriever associated with.

Parameters:
  • feature_group_id (str) – The ID of the feature group to update the document retriever with.

  • document_retriever_config (VectorStoreConfig) – The configuration, including chunk_size and chunk_overlap_fraction, for document retrieval.

Returns:

The newly created document retriever version.

Return type:

DocumentRetrieverVersion

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

DocumentRetriever

describe()

Describe a Document Retriever.

Parameters:

document_retriever_id (str) – A unique string identifier associated with the document retriever.

Returns:

The document retriever object.

Return type:

DocumentRetriever

list_versions(limit=100, start_after_version=None)

List all the document retriever versions with a given ID.

Parameters:
  • limit (int) – The number of vector store versions to retrieve. The maximum value is 100.

  • start_after_version (str) – An offset parameter to exclude all document retriever versions up to this specified one.

Returns:

All the document retriever versions associated with the document retriever.

Return type:

list[DocumentRetrieverVersion]

get_document_snippet(document_id, start_word_index=None, end_word_index=None)

Get a snippet from documents in the document retriever.

Parameters:
  • document_id (str) – The ID of the document to retrieve the snippet from.

  • start_word_index (int) – If provided, will start the snippet at the index (of words in the document) specified.

  • end_word_index (int) – If provided, will end the snippet at the index of (of words in the document) specified.

Returns:

The documentation snippet found from the document retriever.

Return type:

DocumentRetrieverLookupResult

restart()

Restart the document retriever if it is stopped or has failed. This will start the deployment of the document retriever,

but will not wait for it to be ready. You need to call wait_until_ready to wait until the deployment is ready.

Parameters:

document_retriever_id (str) – A unique string identifier associated with the document retriever.

wait_until_ready(timeout=3600)

A waiting call until document retriever is ready. It restarts the document retriever if it is stopped.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 3600 seconds.

wait_until_deployment_ready(timeout=3600)

A waiting call until the document retriever deployment is ready to serve.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 3600 seconds.

get_status()

Gets the status of the document retriever. It represents indexing status until indexing isn’t complete, and deployment status after indexing is complete.

Returns:

A string describing the status of a document retriever (pending, indexing, complete, active, etc.).

Return type:

str

get_deployment_status()

Gets the deployment status of the document retriever.

Returns:

A string describing the deployment status of document retriever (pending, deploying, active, etc.).

Return type:

str

get_matching_documents(query, filters=None, limit=None, result_columns=None, max_words=None, num_retrieval_margin_words=None, max_words_per_chunk=None, score_multiplier_column=None, min_score=None, required_phrases=None, filter_clause=None, crowding_limits=None, include_text_search=False)

Lookup document retrievers and return the matching documents from the document retriever deployed with given query.

Original documents are split into chunks and stored in the document retriever. This lookup function will return the relevant chunks from the document retriever. The returned chunks could be expanded to include more words from the original documents and merged if they are overlapping, and permitted by the settings provided. The returned chunks are sorted by relevance.

Parameters:
  • query (str) – The query to search for.

  • filters (dict) – A dictionary mapping column names to a list of values to restrict the retrieved search results.

  • limit (int) – If provided, will limit the number of results to the value specified.

  • result_columns (list) – If provided, will limit the column properties present in each result to those specified in this list.

  • max_words (int) – If provided, will limit the total number of words in the results to the value specified.

  • num_retrieval_margin_words (int) – If provided, will add this number of words from left and right of the returned chunks.

  • max_words_per_chunk (int) – If provided, will limit the number of words in each chunk to the value specified. If the value provided is smaller than the actual size of chunk on disk, which is determined during document retriever creation, the actual size of chunk will be used. I.e, chunks looked up from document retrievers will not be split into smaller chunks during lookup due to this setting.

  • score_multiplier_column (str) – If provided, will use the values in this column to modify the relevance score of the returned chunks. Values in this column must be numeric.

  • min_score (float) – If provided, will filter out the results with score lower than the value specified.

  • required_phrases (list) – If provided, each result will have at least one of the phrases.

  • filter_clause (str) – If provided, filter the results of the query using this sql where clause.

  • crowding_limits (dict) – A dictionary mapping metadata columns to the maximum number of results per unique value of the column. This is used to ensure diversity of metadata attribute values in the results. If a particular attribute value has already reached its maximum count, further results with that same attribute value will be excluded from the final result set.

  • include_text_search (bool) – If true, combine the ranking of results from a BM25 text search over the documents with the vector search using reciprocal rank fusion. It leverages both lexical and semantic matching for better overall results. It’s particularly valuable in professional, technical, or specialized fields where both precision in terminology and understanding of context are important.

Returns:

The relevant documentation results found from the document retriever.

Return type:

list[DocumentRetrieverLookupResult]