abacusai.document_data
Classes
Data extracted from a docstore document. |
Module Contents
- class abacusai.document_data.DocumentData(client, docId=None, mimeType=None, pageCount=None, totalPageCount=None, extractedText=None, embeddedText=None, pages=None, tokens=None, metadata=None, pageMarkdown=None, extractedPageText=None, augmentedPageText=None)
Bases:
abacusai.return_class.AbstractApiClass
Data extracted from a docstore document.
- Parameters:
client (ApiClient) – An authenticated API Client instance
docId (str) – Unique Docstore string identifier for the document.
mimeType (str) – The mime type of the document.
pageCount (int) – The number of pages for which the data is available. This is generally same as the total number of pages but may be less than the total number of pages in the document if processing is done only for selected pages.
totalPageCount (int) – The total number of pages in the document.
extractedText (str) – The extracted text in the document obtained from OCR.
embeddedText (str) – The embedded text in the document. Only available for digital documents.
pages (list) – List of embedded text for each page in the document. Only available for digital documents.
tokens (list) – List of extracted tokens in the document obtained from OCR.
metadata (list) – List of metadata for each page in the document.
pageMarkdown (list) – The markdown text for the page.
extractedPageText (list) – List of extracted text for each page in the document obtained from OCR. Available when return_extracted_page_text parameter is set to True in the document data retrieval API.
augmentedPageText (list) – List of extracted text for each page in the document obtained from OCR augmented with embedded links in the document.
- doc_id
- mime_type
- page_count
- total_page_count
- extracted_text
- embedded_text
- pages
- tokens
- metadata
- page_markdown
- extracted_page_text
- augmented_page_text
- deprecated_keys
- __repr__()