abacusai.dataset ================ .. py:module:: abacusai.dataset Classes ------- .. autoapisummary:: abacusai.dataset.Dataset Module Contents --------------- .. py:class:: Dataset(client, datasetId=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isDocumentset=None, extractBoundingBoxes=None, mergeFileSchemas=None, referenceOnlyDocumentset=None, versionLimit=None, schema={}, refreshSchedules={}, latestDatasetVersion={}, parsingConfig={}, documentProcessingConfig={}, attachmentParsingConfig={}) Bases: :py:obj:`abacusai.return_class.AbstractApiClass` A dataset reference :param client: An authenticated API Client instance :type client: ApiClient :param datasetId: The unique identifier of the dataset. :type datasetId: str :param sourceType: The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING. :type sourceType: str :param dataSource: Location of data. It may be a URI such as an s3 bucket or the database table. :type dataSource: str :param createdAt: The timestamp at which this dataset was created. :type createdAt: str :param ignoreBefore: The timestamp at which all previous events are ignored when training. :type ignoreBefore: str :param ephemeral: The dataset is ephemeral and not used for training. :type ephemeral: bool :param lookbackDays: Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system. :type lookbackDays: int :param databaseConnectorId: The Database Connector used. :type databaseConnectorId: str :param databaseConnectorConfig: The database connector query used to retrieve data. :type databaseConnectorConfig: dict :param connectorType: The type of connector used to get this dataset FILE or DATABASE. :type connectorType: str :param featureGroupTableName: The table name of the dataset's feature group :type featureGroupTableName: str :param applicationConnectorId: The Application Connector used. :type applicationConnectorId: str :param applicationConnectorConfig: The application connector query used to retrieve data. :type applicationConnectorConfig: dict :param incremental: If dataset is an incremental dataset. :type incremental: bool :param isDocumentset: If dataset is a documentset. :type isDocumentset: bool :param extractBoundingBoxes: Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True. :type extractBoundingBoxes: bool :param mergeFileSchemas: If the merge file schemas policy is enabled. :type mergeFileSchemas: bool :param referenceOnlyDocumentset: Signifies whether to save the data reference only. Only valid if is_documentset if True. :type referenceOnlyDocumentset: bool :param versionLimit: Version limit for the dataset. :type versionLimit: int :param latestDatasetVersion: The latest version of this dataset. :type latestDatasetVersion: DatasetVersion :param schema: List of resolved columns. :type schema: DatasetColumn :param refreshSchedules: List of schedules that determines when the next version of the dataset will be created. :type refreshSchedules: RefreshSchedule :param parsingConfig: The parsing config used for dataset. :type parsingConfig: ParsingConfig :param documentProcessingConfig: The document processing config used for dataset (when is_documentset is True). :type documentProcessingConfig: DocumentProcessingConfig :param attachmentParsingConfig: The attachment parsing config used for dataset (eg. for salesforce attachment parsing) :type attachmentParsingConfig: AttachmentParsingConfig .. py:attribute:: dataset_id :value: None .. py:attribute:: source_type :value: None .. py:attribute:: data_source :value: None .. py:attribute:: created_at :value: None .. py:attribute:: ignore_before :value: None .. py:attribute:: ephemeral :value: None .. py:attribute:: lookback_days :value: None .. py:attribute:: database_connector_id :value: None .. py:attribute:: database_connector_config :value: None .. py:attribute:: connector_type :value: None .. py:attribute:: feature_group_table_name :value: None .. py:attribute:: application_connector_id :value: None .. py:attribute:: application_connector_config :value: None .. py:attribute:: incremental :value: None .. py:attribute:: is_documentset :value: None .. py:attribute:: extract_bounding_boxes :value: None .. py:attribute:: merge_file_schemas :value: None .. py:attribute:: reference_only_documentset :value: None .. py:attribute:: version_limit :value: None .. py:attribute:: schema .. py:attribute:: refresh_schedules .. py:attribute:: latest_dataset_version .. py:attribute:: parsing_config .. py:attribute:: document_processing_config .. py:attribute:: attachment_parsing_config .. py:attribute:: deprecated_keys .. py:method:: __repr__() .. py:method:: to_dict() Get a dict representation of the parameters in this class :returns: The dict value representation of the class parameters :rtype: dict .. py:method:: create_version_from_file_connector(location = None, file_format = None, csv_delimiter = None, merge_file_schemas = None, parsing_config = None, sql_query = None) Creates a new version of the specified dataset. :param location: External URI to import the dataset from. If not specified, the last location will be used. :type location: str :param file_format: File format to be used. If not specified, the service will try to detect the file format. :type file_format: str :param csv_delimiter: If the file format is CSV, use a specific CSV delimiter. :type csv_delimiter: str :param merge_file_schemas: Signifies if the merge file schema policy is enabled. :type merge_file_schemas: bool :param parsing_config: Custom config for dataset parsing. :type parsing_config: ParsingConfig :param sql_query: The SQL query to use when fetching data from the specified location. Use `__TABLE__` as a placeholder for the table name. For example: "SELECT * FROM __TABLE__ WHERE event_date > '2021-01-01'". If not provided, the entire dataset from the specified location will be imported. :type sql_query: str :returns: The new Dataset Version created. :rtype: DatasetVersion .. py:method:: create_version_from_database_connector(object_name = None, columns = None, query_arguments = None, sql_query = None) Creates a new version of the specified dataset. :param object_name: The name/ID of the object in the service to query. If not specified, the last name will be used. :type object_name: str :param columns: The columns to query from the external service object. If not specified, the last columns will be used. :type columns: str :param query_arguments: Additional query arguments to filter the data. If not specified, the last arguments will be used. :type query_arguments: str :param sql_query: The full SQL query to use when fetching data. If present, this parameter will override object_name, columns, and query_arguments. :type sql_query: str :returns: The new Dataset Version created. :rtype: DatasetVersion .. py:method:: create_version_from_application_connector(dataset_config = None) Creates a new version of the specified dataset. :param dataset_config: Dataset config for the application connector. If any of the fields are not specified, the last values will be used. :type dataset_config: ApplicationConnectorDatasetConfig :returns: The new Dataset Version created. :rtype: DatasetVersion .. py:method:: create_version_from_upload(file_format = None) Creates a new version of the specified dataset using a local file upload. :param file_format: File format to be used. If not specified, the service will attempt to detect the file format. :type file_format: str :returns: Token to be used when uploading file parts. :rtype: Upload .. py:method:: create_version_from_document_reprocessing(document_processing_config = None) Creates a new dataset version for a source docstore dataset with the provided document processing configuration. This does not re-import the data but uses the same data which is imported in the latest dataset version and only performs document processing on it. :param document_processing_config: The document processing configuration to use for the new dataset version. If not specified, the document processing configuration from the source dataset will be used. :type document_processing_config: DatasetDocumentProcessingConfig :returns: The new dataset version created. :rtype: DatasetVersion .. py:method:: snapshot_streaming_data() Snapshots the current data in the streaming dataset. :param dataset_id: The unique ID associated with the dataset. :type dataset_id: str :returns: The new Dataset Version created by taking a snapshot of the current data in the streaming dataset. :rtype: DatasetVersion .. py:method:: set_column_data_type(column, data_type) Set a Dataset's column type. :param column: The name of the column. :type column: str :param data_type: The type of the data in the column. Note: Some ColumnMappings may restrict the options or explicitly set the DataType. :type data_type: DataType :returns: The dataset and schema after the data type has been set. :rtype: Dataset .. py:method:: set_streaming_retention_policy(retention_hours = None, retention_row_count = None, ignore_records_before_timestamp = None) Sets the streaming retention policy. :param retention_hours: Number of hours to retain streamed data in memory. :type retention_hours: int :param retention_row_count: Number of rows to retain streamed data in memory. :type retention_row_count: int :param ignore_records_before_timestamp: The Unix timestamp (in seconds) to use as a cutoff to ignore all entries sent before it :type ignore_records_before_timestamp: int .. py:method:: get_schema() Retrieves the column schema of a dataset. :param dataset_id: Unique string identifier of the dataset schema to look up. :type dataset_id: str :returns: List of column schema definitions. :rtype: list[DatasetColumn] .. py:method:: set_database_connector_config(database_connector_id, object_name = None, columns = None, query_arguments = None, sql_query = None) Sets database connector config for a dataset. This method is currently only supported for streaming datasets. :param database_connector_id: Unique String Identifier of the Database Connector to import the dataset from. :type database_connector_id: str :param object_name: If applicable, the name/ID of the object in the service to query. :type object_name: str :param columns: The columns to query from the external service object. :type columns: str :param query_arguments: Additional query arguments to filter the data. :type query_arguments: str :param sql_query: The full SQL query to use when fetching data. If present, this parameter will override `object_name`, `columns` and `query_arguments`. :type sql_query: str .. py:method:: update_version_limit(version_limit) Updates the version limit for the specified dataset. :param version_limit: The maximum number of versions permitted for the feature group. Once this limit is exceeded, the oldest versions will be purged in a First-In-First-Out (FIFO) order. :type version_limit: int :returns: The updated dataset. :rtype: Dataset .. py:method:: refresh() Calls describe and refreshes the current object's fields :returns: The current object :rtype: Dataset .. py:method:: describe() Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc. :param dataset_id: The unique ID associated with the dataset. :type dataset_id: str :returns: The dataset. :rtype: Dataset .. py:method:: list_versions(limit = 100, start_after_version = None) Retrieves a list of all dataset versions for the specified dataset. :param limit: The maximum length of the list of all dataset versions. :type limit: int :param start_after_version: The ID of the version after which the list starts. :type start_after_version: str :returns: A list of dataset versions. :rtype: list[DatasetVersion] .. py:method:: delete() Deletes the specified dataset from the organization. :param dataset_id: Unique string identifier of the dataset to delete. :type dataset_id: str .. py:method:: wait_for_import(timeout=900) A waiting call until dataset is imported. :param timeout: The waiting time given to the call to finish, if it doesn't finish by the allocated time, the call is said to be timed out. :type timeout: int .. py:method:: wait_for_inspection(timeout=None) A waiting call until dataset is completely inspected. :param timeout: The waiting time given to the call to finish, if it doesn't finish by the allocated time, the call is said to be timed out. :type timeout: int .. py:method:: get_status() Gets the status of the latest dataset version. :returns: A string describing the status of a dataset (importing, inspecting, complete, etc.). :rtype: str .. py:method:: describe_feature_group() Gets the feature group attached to the dataset. :returns: A feature group object. :rtype: FeatureGroup .. py:method:: create_refresh_policy(cron) To create a refresh policy for a dataset. :param cron: A cron style string to set the refresh time. :type cron: str :returns: The refresh policy object. :rtype: RefreshPolicy .. py:method:: list_refresh_policies() Gets the refresh policies in a list. :returns: A list of refresh policy objects. :rtype: List[RefreshPolicy]