abacusai.api_class.dataset
Classes
An abstract class for dataset configs |
|
Custom config for dataset parsing. |
|
Document processing configuration. |
|
Document processing configuration for dataset imports. |
|
Config information for incremental datasets from database connectors |
|
Config information for parsing attachments |
Module Contents
- class abacusai.api_class.dataset.DatasetConfig
Bases:
abacusai.api_class.abstract.ApiClass
An abstract class for dataset configs
- Parameters:
is_documentset (bool) – Whether the dataset is a document set
- class abacusai.api_class.dataset.ParsingConfig
Bases:
abacusai.api_class.abstract.ApiClass
Custom config for dataset parsing.
- Parameters:
- class abacusai.api_class.dataset.DocumentProcessingConfig
Bases:
abacusai.api_class.abstract.ApiClass
Document processing configuration.
- Parameters:
document_type (DocumentType) – Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically.
highlight_relevant_text (bool) – Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False.
extract_bounding_boxes (bool) – Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
ocr_mode (OcrMode) – OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
use_full_ocr (bool) – Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
remove_header_footer (bool) – Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
remove_watermarks (bool) – Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
convert_to_markdown (bool) – Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
mask_pii (bool) – Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False.
- document_type: abacusai.api_class.enums.DocumentType = None
- ocr_mode: abacusai.api_class.enums.OcrMode
- __post_init__()
- _detect_ocr_mode()
- class abacusai.api_class.dataset.DatasetDocumentProcessingConfig
Bases:
DocumentProcessingConfig
Document processing configuration for dataset imports.
- Parameters:
extract_bounding_boxes (bool) – Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
ocr_mode (OcrMode) – OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
use_full_ocr (bool) – Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
remove_header_footer (bool) – Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
remove_watermarks (bool) – Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
convert_to_markdown (bool) – Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
page_text_column (str) – Name of the output column which contains the extracted text for each page. If not provided, no column will be created.
- class abacusai.api_class.dataset.IncrementalDatabaseConnectorConfig
Bases:
abacusai.api_class.abstract.ApiClass
Config information for incremental datasets from database connectors
- Parameters:
timestamp_column (str) – If dataset is incremental, this is the column name of the required column in the dataset. This column must contain timestamps in descending order which are used to determine the increments of the incremental dataset.
- class abacusai.api_class.dataset.AttachmentParsingConfig
Bases:
abacusai.api_class.abstract.ApiClass
Config information for parsing attachments
- Parameters: