abacusai.api_class.dataset ========================== .. py:module:: abacusai.api_class.dataset Classes ------- .. autoapisummary:: abacusai.api_class.dataset.DatasetConfig abacusai.api_class.dataset.ParsingConfig abacusai.api_class.dataset.DocumentProcessingConfig abacusai.api_class.dataset.DatasetDocumentProcessingConfig abacusai.api_class.dataset.IncrementalDatabaseConnectorConfig abacusai.api_class.dataset.AttachmentParsingConfig Module Contents --------------- .. py:class:: DatasetConfig Bases: :py:obj:`abacusai.api_class.abstract.ApiClass` An abstract class for dataset configs :param is_documentset: Whether the dataset is a document set :type is_documentset: bool .. py:attribute:: is_documentset :type: bool :value: None .. py:class:: ParsingConfig Bases: :py:obj:`abacusai.api_class.abstract.ApiClass` Custom config for dataset parsing. :param escape: Escape character for CSV files. Defaults to '"'. :type escape: str :param csv_delimiter: Delimiter for CSV files. Defaults to None. :type csv_delimiter: str :param file_path_with_schema: Path to the file with schema. Defaults to None. :type file_path_with_schema: str .. py:attribute:: escape :type: str :value: '"' .. py:attribute:: csv_delimiter :type: str :value: None .. py:attribute:: file_path_with_schema :type: str :value: None .. py:class:: DocumentProcessingConfig Bases: :py:obj:`abacusai.api_class.abstract.ApiClass` Document processing configuration. :param document_type: Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically. :type document_type: DocumentType :param highlight_relevant_text: Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False. :type highlight_relevant_text: bool :param extract_bounding_boxes: Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False. :type extract_bounding_boxes: bool :param ocr_mode: OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True. :type ocr_mode: OcrMode :param use_full_ocr: Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True. :type use_full_ocr: bool :param remove_header_footer: Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True. :type remove_header_footer: bool :param remove_watermarks: Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True. :type remove_watermarks: bool :param convert_to_markdown: Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True. :type convert_to_markdown: bool :param mask_pii: Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False. :type mask_pii: bool :param extract_images: Whether to extract images from the document e.g. diagrams in a PDF page. Defaults to False. :type extract_images: bool .. py:attribute:: document_type :type: abacusai.api_class.enums.DocumentType :value: None .. py:attribute:: highlight_relevant_text :type: bool :value: None .. py:attribute:: extract_bounding_boxes :type: bool :value: False .. py:attribute:: ocr_mode :type: abacusai.api_class.enums.OcrMode .. py:attribute:: use_full_ocr :type: bool :value: None .. py:attribute:: remove_header_footer :type: bool :value: False .. py:attribute:: remove_watermarks :type: bool :value: True .. py:attribute:: convert_to_markdown :type: bool :value: False .. py:attribute:: mask_pii :type: bool :value: False .. py:attribute:: extract_images :type: bool :value: False .. py:method:: __post_init__() .. py:method:: _detect_ocr_mode() .. py:method:: _get_filtered_dict(config) :classmethod: Filters out default values from the config .. py:class:: DatasetDocumentProcessingConfig Bases: :py:obj:`DocumentProcessingConfig` Document processing configuration for dataset imports. :param extract_bounding_boxes: Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False. :type extract_bounding_boxes: bool :param ocr_mode: OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True. :type ocr_mode: OcrMode :param use_full_ocr: Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True. :type use_full_ocr: bool :param remove_header_footer: Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True. :type remove_header_footer: bool :param remove_watermarks: Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True. :type remove_watermarks: bool :param convert_to_markdown: Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True. :type convert_to_markdown: bool :param page_text_column: Name of the output column which contains the extracted text for each page. If not provided, no column will be created. :type page_text_column: str .. py:attribute:: page_text_column :type: str :value: None .. py:class:: IncrementalDatabaseConnectorConfig Bases: :py:obj:`abacusai.api_class.abstract.ApiClass` Config information for incremental datasets from database connectors :param timestamp_column: If dataset is incremental, this is the column name of the required column in the dataset. This column must contain timestamps in descending order which are used to determine the increments of the incremental dataset. :type timestamp_column: str .. py:attribute:: timestamp_column :type: str :value: None .. py:class:: AttachmentParsingConfig Bases: :py:obj:`abacusai.api_class.abstract.ApiClass` Config information for parsing attachments :param feature_group_name: feature group name :type feature_group_name: str :param column_name: column name :type column_name: str :param urls: list of urls :type urls: str .. py:attribute:: feature_group_name :type: str :value: None .. py:attribute:: column_name :type: str :value: None .. py:attribute:: urls :type: str :value: None