abacusai.document_data

Classes

DocumentData

Data extracted from a docstore document.

Module Contents

class abacusai.document_data.DocumentData(client, docId=None, mimeType=None, pageCount=None, totalPageCount=None, extractedText=None, embeddedText=None, pages=None, tokens=None, metadata=None, pageMarkdown=None, extractedPageText=None, augmentedPageText=None)

Bases: abacusai.return_class.AbstractApiClass

Data extracted from a docstore document.

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • docId (str) – Unique Docstore string identifier for the document.

  • mimeType (str) – The mime type of the document.

  • pageCount (int) – The number of pages for which the data is available. This is generally same as the total number of pages but may be less than the total number of pages in the document if processing is done only for selected pages.

  • totalPageCount (int) – The total number of pages in the document.

  • extractedText (str) – The extracted text in the document obtained from OCR.

  • embeddedText (str) – The embedded text in the document. Only available for digital documents.

  • pages (list) – List of embedded text for each page in the document. Only available for digital documents.

  • tokens (list) – List of extracted tokens in the document obtained from OCR.

  • metadata (list) – List of metadata for each page in the document.

  • pageMarkdown (list) – The markdown text for the page.

  • extractedPageText (list) – List of extracted text for each page in the document obtained from OCR. Available when return_extracted_page_text parameter is set to True in the document data retrieval API.

  • augmentedPageText (list) – List of extracted text for each page in the document obtained from OCR augmented with embedded links in the document.

doc_id
mime_type
page_count
total_page_count
extracted_text
embedded_text
pages
tokens
metadata
page_markdown
extracted_page_text
augmented_page_text
deprecated_keys
__repr__()
to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict