abacusai.dataset

Classes

Dataset

A dataset reference

Module Contents

class abacusai.dataset.Dataset(client, datasetId=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isDocumentset=None, extractBoundingBoxes=None, mergeFileSchemas=None, referenceOnlyDocumentset=None, versionLimit=None, schema={}, refreshSchedules={}, latestDatasetVersion={}, parsingConfig={}, documentProcessingConfig={}, attachmentParsingConfig={})

Bases: abacusai.return_class.AbstractApiClass

A dataset reference

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • datasetId (str) – The unique identifier of the dataset.

  • sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.

  • dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table.

  • createdAt (str) – The timestamp at which this dataset was created.

  • ignoreBefore (str) – The timestamp at which all previous events are ignored when training.

  • ephemeral (bool) – The dataset is ephemeral and not used for training.

  • lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.

  • databaseConnectorId (str) – The Database Connector used.

  • databaseConnectorConfig (dict) – The database connector query used to retrieve data.

  • connectorType (str) – The type of connector used to get this dataset FILE or DATABASE.

  • featureGroupTableName (str) – The table name of the dataset’s feature group

  • applicationConnectorId (str) – The Application Connector used.

  • applicationConnectorConfig (dict) – The application connector query used to retrieve data.

  • incremental (bool) – If dataset is an incremental dataset.

  • isDocumentset (bool) – If dataset is a documentset.

  • extractBoundingBoxes (bool) – Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.

  • mergeFileSchemas (bool) – If the merge file schemas policy is enabled.

  • referenceOnlyDocumentset (bool) – Signifies whether to save the data reference only. Only valid if is_documentset if True.

  • versionLimit (int) – Version limit for the dataset.

  • latestDatasetVersion (DatasetVersion) – The latest version of this dataset.

  • schema (DatasetColumn) – List of resolved columns.

  • refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created.

  • parsingConfig (ParsingConfig) – The parsing config used for dataset.

  • documentProcessingConfig (DocumentProcessingConfig) – The document processing config used for dataset (when is_documentset is True).

  • attachmentParsingConfig (AttachmentParsingConfig) – The attachment parsing config used for dataset (eg. for salesforce attachment parsing)

dataset_id
source_type
data_source
created_at
ignore_before
ephemeral
lookback_days
database_connector_id
database_connector_config
connector_type
feature_group_table_name
application_connector_id
application_connector_config
incremental
is_documentset
extract_bounding_boxes
merge_file_schemas
reference_only_documentset
version_limit
schema
refresh_schedules
latest_dataset_version
parsing_config
document_processing_config
attachment_parsing_config
deprecated_keys
__repr__()
to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

create_version_from_file_connector(location=None, file_format=None, csv_delimiter=None, merge_file_schemas=None, parsing_config=None, sql_query=None)

Creates a new version of the specified dataset.

Parameters:
  • location (str) – External URI to import the dataset from. If not specified, the last location will be used.

  • file_format (str) – File format to be used. If not specified, the service will try to detect the file format.

  • csv_delimiter (str) – If the file format is CSV, use a specific CSV delimiter.

  • merge_file_schemas (bool) – Signifies if the merge file schema policy is enabled.

  • parsing_config (ParsingConfig) – Custom config for dataset parsing.

  • sql_query (str) – The SQL query to use when fetching data from the specified location. Use __TABLE__ as a placeholder for the table name. For example: “SELECT * FROM __TABLE__ WHERE event_date > ‘2021-01-01’”. If not provided, the entire dataset from the specified location will be imported.

Returns:

The new Dataset Version created.

Return type:

DatasetVersion

create_version_from_database_connector(object_name=None, columns=None, query_arguments=None, sql_query=None)

Creates a new version of the specified dataset.

Parameters:
  • object_name (str) – The name/ID of the object in the service to query. If not specified, the last name will be used.

  • columns (str) – The columns to query from the external service object. If not specified, the last columns will be used.

  • query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used.

  • sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns, and query_arguments.

Returns:

The new Dataset Version created.

Return type:

DatasetVersion

create_version_from_application_connector(dataset_config=None)

Creates a new version of the specified dataset.

Parameters:

dataset_config (ApplicationConnectorDatasetConfig) – Dataset config for the application connector. If any of the fields are not specified, the last values will be used.

Returns:

The new Dataset Version created.

Return type:

DatasetVersion

create_version_from_upload(file_format=None)

Creates a new version of the specified dataset using a local file upload.

Parameters:

file_format (str) – File format to be used. If not specified, the service will attempt to detect the file format.

Returns:

Token to be used when uploading file parts.

Return type:

Upload

create_version_from_document_reprocessing(document_processing_config=None)

Creates a new dataset version for a source docstore dataset with the provided document processing configuration. This does not re-import the data but uses the same data which is imported in the latest dataset version and only performs document processing on it.

Parameters:

document_processing_config (DatasetDocumentProcessingConfig) – The document processing configuration to use for the new dataset version. If not specified, the document processing configuration from the source dataset will be used.

Returns:

The new dataset version created.

Return type:

DatasetVersion

snapshot_streaming_data()

Snapshots the current data in the streaming dataset.

Parameters:

dataset_id (str) – The unique ID associated with the dataset.

Returns:

The new Dataset Version created by taking a snapshot of the current data in the streaming dataset.

Return type:

DatasetVersion

set_column_data_type(column, data_type)

Set a Dataset’s column type.

Parameters:
  • column (str) – The name of the column.

  • data_type (DataType) – The type of the data in the column. Note: Some ColumnMappings may restrict the options or explicitly set the DataType.

Returns:

The dataset and schema after the data type has been set.

Return type:

Dataset

set_streaming_retention_policy(retention_hours=None, retention_row_count=None, ignore_records_before_timestamp=None)

Sets the streaming retention policy.

Parameters:
  • retention_hours (int) – Number of hours to retain streamed data in memory.

  • retention_row_count (int) – Number of rows to retain streamed data in memory.

  • ignore_records_before_timestamp (int) – The Unix timestamp (in seconds) to use as a cutoff to ignore all entries sent before it

get_schema()

Retrieves the column schema of a dataset.

Parameters:

dataset_id (str) – Unique string identifier of the dataset schema to look up.

Returns:

List of column schema definitions.

Return type:

list[DatasetColumn]

set_database_connector_config(database_connector_id, object_name=None, columns=None, query_arguments=None, sql_query=None)

Sets database connector config for a dataset. This method is currently only supported for streaming datasets.

Parameters:
  • database_connector_id (str) – Unique String Identifier of the Database Connector to import the dataset from.

  • object_name (str) – If applicable, the name/ID of the object in the service to query.

  • columns (str) – The columns to query from the external service object.

  • query_arguments (str) – Additional query arguments to filter the data.

  • sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns and query_arguments.

update_version_limit(version_limit)

Updates the version limit for the specified dataset.

Parameters:

version_limit (int) – The maximum number of versions permitted for the feature group. Once this limit is exceeded, the oldest versions will be purged in a First-In-First-Out (FIFO) order.

Returns:

The updated dataset.

Return type:

Dataset

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

Dataset

describe()

Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc.

Parameters:

dataset_id (str) – The unique ID associated with the dataset.

Returns:

The dataset.

Return type:

Dataset

list_versions(limit=100, start_after_version=None)

Retrieves a list of all dataset versions for the specified dataset.

Parameters:
  • limit (int) – The maximum length of the list of all dataset versions.

  • start_after_version (str) – The ID of the version after which the list starts.

Returns:

A list of dataset versions.

Return type:

list[DatasetVersion]

delete()

Deletes the specified dataset from the organization.

Parameters:

dataset_id (str) – Unique string identifier of the dataset to delete.

wait_for_import(timeout=900)

A waiting call until dataset is imported.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

wait_for_inspection(timeout=None)

A waiting call until dataset is completely inspected.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

get_status()

Gets the status of the latest dataset version.

Returns:

A string describing the status of a dataset (importing, inspecting, complete, etc.).

Return type:

str

describe_feature_group()

Gets the feature group attached to the dataset.

Returns:

A feature group object.

Return type:

FeatureGroup

create_refresh_policy(cron)

To create a refresh policy for a dataset.

Parameters:

cron (str) – A cron style string to set the refresh time.

Returns:

The refresh policy object.

Return type:

RefreshPolicy

list_refresh_policies()

Gets the refresh policies in a list.

Returns:

A list of refresh policy objects.

Return type:

List[RefreshPolicy]