abacusai.dataset
Classes
| A dataset reference | 
Module Contents
- class abacusai.dataset.Dataset(client, datasetId=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isDocumentset=None, extractBoundingBoxes=None, mergeFileSchemas=None, referenceOnlyDocumentset=None, versionLimit=None, schema={}, refreshSchedules={}, latestDatasetVersion={}, parsingConfig={}, documentProcessingConfig={}, attachmentParsingConfig={})
- Bases: - abacusai.return_class.AbstractApiClass- A dataset reference - Parameters:
- client (ApiClient) – An authenticated API Client instance 
- datasetId (str) – The unique identifier of the dataset. 
- sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING. 
- dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table. 
- createdAt (str) – The timestamp at which this dataset was created. 
- ignoreBefore (str) – The timestamp at which all previous events are ignored when training. 
- ephemeral (bool) – The dataset is ephemeral and not used for training. 
- lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system. 
- databaseConnectorId (str) – The Database Connector used. 
- databaseConnectorConfig (dict) – The database connector query used to retrieve data. 
- connectorType (str) – The type of connector used to get this dataset FILE or DATABASE. 
- featureGroupTableName (str) – The table name of the dataset’s feature group 
- applicationConnectorId (str) – The Application Connector used. 
- applicationConnectorConfig (dict) – The application connector query used to retrieve data. 
- incremental (bool) – If dataset is an incremental dataset. 
- isDocumentset (bool) – If dataset is a documentset. 
- extractBoundingBoxes (bool) – Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True. 
- mergeFileSchemas (bool) – If the merge file schemas policy is enabled. 
- referenceOnlyDocumentset (bool) – Signifies whether to save the data reference only. Only valid if is_documentset if True. 
- versionLimit (int) – Version limit for the dataset. 
- latestDatasetVersion (DatasetVersion) – The latest version of this dataset. 
- schema (DatasetColumn) – List of resolved columns. 
- refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created. 
- parsingConfig (ParsingConfig) – The parsing config used for dataset. 
- documentProcessingConfig (DocumentProcessingConfig) – The document processing config used for dataset (when is_documentset is True). 
- attachmentParsingConfig (AttachmentParsingConfig) – The attachment parsing config used for dataset (eg. for salesforce attachment parsing) 
 
 - dataset_id = None
 - source_type = None
 - data_source = None
 - created_at = None
 - ignore_before = None
 - ephemeral = None
 - lookback_days = None
 - database_connector_id = None
 - database_connector_config = None
 - connector_type = None
 - feature_group_table_name = None
 - application_connector_id = None
 - application_connector_config = None
 - incremental = None
 - is_documentset = None
 - extract_bounding_boxes = None
 - merge_file_schemas = None
 - reference_only_documentset = None
 - version_limit = None
 - schema
 - refresh_schedules
 - latest_dataset_version
 - parsing_config
 - document_processing_config
 - attachment_parsing_config
 - deprecated_keys
 - __repr__()
 - to_dict()
- Get a dict representation of the parameters in this class - Returns:
- The dict value representation of the class parameters 
- Return type:
 
 - get_raw_data_from_realtime(check_permissions=False, start_time=None, end_time=None, column_filter=None)
- Returns raw data from a realtime dataset. Only Microsoft Teams datasets are supported currently due to data size constraints in realtime datasets. - Parameters:
- check_permissions (bool) – If True, checks user permissions using session email. 
- start_time (str) – Start time filter (inclusive) for created_date_time_t in ISO 8601 format (e.g. 2025-05-13T08:25:11Z or 2025-05-13T08:25:11+00:00). 
- end_time (str) – End time filter (inclusive) for created_date_time_t in ISO 8601 format (e.g. 2025-05-13T08:25:11Z or 2025-05-13T08:25:11+00:00). 
- column_filter (dict) – Dictionary mapping column names to filter values. Only rows matching all column filters will be returned. 
 
 
 - create_version_from_file_connector(location=None, file_format=None, csv_delimiter=None, merge_file_schemas=None, parsing_config=None, sql_query=None)
- Creates a new version of the specified dataset. - Parameters:
- location (str) – External URI to import the dataset from. If not specified, the last location will be used. 
- file_format (str) – File format to be used. If not specified, the service will try to detect the file format. 
- csv_delimiter (str) – If the file format is CSV, use a specific CSV delimiter. 
- merge_file_schemas (bool) – Signifies if the merge file schema policy is enabled. 
- parsing_config (ParsingConfig) – Custom config for dataset parsing. 
- sql_query (str) – The SQL query to use when fetching data from the specified location. Use __TABLE__ as a placeholder for the table name. For example: “SELECT * FROM __TABLE__ WHERE event_date > ‘2021-01-01’”. If not provided, the entire dataset from the specified location will be imported. 
 
- Returns:
- The new Dataset Version created. 
- Return type:
 
 - create_version_from_database_connector(object_name=None, columns=None, query_arguments=None, sql_query=None)
- Creates a new version of the specified dataset. - Parameters:
- object_name (str) – The name/ID of the object in the service to query. If not specified, the last name will be used. 
- columns (str) – The columns to query from the external service object. If not specified, the last columns will be used. 
- query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used. 
- sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns, and query_arguments. 
 
- Returns:
- The new Dataset Version created. 
- Return type:
 
 - create_version_from_application_connector(dataset_config=None)
- Creates a new version of the specified dataset. - Parameters:
- dataset_config (ApplicationConnectorDatasetConfig) – Dataset config for the application connector. If any of the fields are not specified, the last values will be used. 
- Returns:
- The new Dataset Version created. 
- Return type:
 
 - create_version_from_upload(file_format=None)
- Creates a new version of the specified dataset using a local file upload. 
 - create_version_from_document_reprocessing(document_processing_config=None)
- Creates a new dataset version for a source docstore dataset with the provided document processing configuration. This does not re-import the data but uses the same data which is imported in the latest dataset version and only performs document processing on it. - Parameters:
- document_processing_config (DatasetDocumentProcessingConfig) – The document processing configuration to use for the new dataset version. If not specified, the document processing configuration from the source dataset will be used. 
- Returns:
- The new dataset version created. 
- Return type:
 
 - snapshot_streaming_data()
- Snapshots the current data in the streaming dataset. - Parameters:
- dataset_id (str) – The unique ID associated with the dataset. 
- Returns:
- The new Dataset Version created by taking a snapshot of the current data in the streaming dataset. 
- Return type:
 
 - set_column_data_type(column, data_type)
- Set a Dataset’s column type. 
 - set_streaming_retention_policy(retention_hours=None, retention_row_count=None, ignore_records_before_timestamp=None)
- Sets the streaming retention policy. 
 - get_schema()
- Retrieves the column schema of a dataset. - Parameters:
- dataset_id (str) – Unique string identifier of the dataset schema to look up. 
- Returns:
- List of column schema definitions. 
- Return type:
 
 - set_database_connector_config(database_connector_id, object_name=None, columns=None, query_arguments=None, sql_query=None)
- Sets database connector config for a dataset. This method is currently only supported for streaming datasets. - Parameters:
- database_connector_id (str) – Unique String Identifier of the Database Connector to import the dataset from. 
- object_name (str) – If applicable, the name/ID of the object in the service to query. 
- columns (str) – The columns to query from the external service object. 
- query_arguments (str) – Additional query arguments to filter the data. 
- sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns and query_arguments. 
 
 
 - update_version_limit(version_limit)
- Updates the version limit for the specified dataset. 
 - refresh()
- Calls describe and refreshes the current object’s fields - Returns:
- The current object 
- Return type:
 
 - describe()
- Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc. 
 - list_versions(limit=100, start_after_version=None)
- Retrieves a list of all dataset versions for the specified dataset. - Parameters:
- Returns:
- A list of dataset versions. 
- Return type:
 
 - delete()
- Deletes the specified dataset from the organization. - Parameters:
- dataset_id (str) – Unique string identifier of the dataset to delete. 
 
 - wait_for_import(timeout=900)
- A waiting call until dataset is imported. - Parameters:
- timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. 
 
 - wait_for_inspection(timeout=None)
- A waiting call until dataset is completely inspected. - Parameters:
- timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. 
 
 - get_status()
- Gets the status of the latest dataset version. - Returns:
- A string describing the status of a dataset (importing, inspecting, complete, etc.). 
- Return type:
 
 - describe_feature_group()
- Gets the feature group attached to the dataset. - Returns:
- A feature group object. 
- Return type:
 
 - create_refresh_policy(cron)
- To create a refresh policy for a dataset. - Parameters:
- cron (str) – A cron style string to set the refresh time. 
- Returns:
- The refresh policy object. 
- Return type:
 
 - list_refresh_policies()
- Gets the refresh policies in a list. - Returns:
- A list of refresh policy objects. 
- Return type:
- List[RefreshPolicy]