abacusai.api_class.feature_group

Classes

SamplingConfig

An abstract class for the sampling config of a feature group

NSamplingConfig

The number of distinct values of the key columns to include in the sample, or number of rows if key columns not specified.

PercentSamplingConfig

The fraction of distinct values of the feature group to include in the sample.

_SamplingConfigFactory

Helper class that provides a standard way to create an ABC using

MergeConfig

An abstract class for the merge config of a feature group

LastNMergeConfig

Merge LAST N chunks/versions of an incremental dataset.

TimeWindowMergeConfig

Merge rows within a given timewindow of the most recent timestamp

_MergeConfigFactory

Helper class that provides a standard way to create an ABC using

OperatorConfig

Configuration for a template Feature Group Operation

UnpivotConfig

Unpivot Columns in a FeatureGroup.

MarkdownConfig

Transform a input column to a markdown column.

CrawlerTransformConfig

Transform a input column of urls to html text

ExtractDocumentDataConfig

Extracts data from documents.

DataGenerationConfig

Generate synthetic data using a model for finetuning an LLM.

UnionTransformConfig

Takes Union of current feature group with 1 or more selected feature groups of same type.

_OperatorConfigFactory

A class to select and return the the correct type of Operator Config based on a serialized OperatorConfig instance.

Module Contents

class abacusai.api_class.feature_group.SamplingConfig

Bases: abacusai.api_class.abstract.ApiClass

An abstract class for the sampling config of a feature group

sampling_method: abacusai.api_class.enums.SamplingMethodType
classmethod _get_builder()
__post_init__()
class abacusai.api_class.feature_group.NSamplingConfig

Bases: SamplingConfig

The number of distinct values of the key columns to include in the sample, or number of rows if key columns not specified.

Parameters:
  • sample_count (int) – The number of rows to include in the sample

  • key_columns (List[str]) – The feature(s) to use as the key(s) when sampling

sample_count: int
key_columns: List[str]
__post_init__()
class abacusai.api_class.feature_group.PercentSamplingConfig

Bases: SamplingConfig

The fraction of distinct values of the feature group to include in the sample.

Parameters:
  • sample_percent (float) – The percentage of the rows to sample

  • key_columns (List[str]) – The feature(s) to use as the key(s) when sampling

sample_percent: float
key_columns: List[str]
__post_init__()
class abacusai.api_class.feature_group._SamplingConfigFactory

Bases: abacusai.api_class.abstract._ApiClassFactory

Helper class that provides a standard way to create an ABC using inheritance.

config_class_key = 'sampling_method'
config_abstract_class
config_class_map
class abacusai.api_class.feature_group.MergeConfig

Bases: abacusai.api_class.abstract.ApiClass

An abstract class for the merge config of a feature group

merge_mode: abacusai.api_class.enums.MergeMode
classmethod _get_builder()
__post_init__()
class abacusai.api_class.feature_group.LastNMergeConfig

Bases: MergeConfig

Merge LAST N chunks/versions of an incremental dataset.

Parameters:
  • num_versions (int) – The number of versions to merge. num_versions == 0 means merge all versions.

  • include_version_timestamp_column (bool) – If set, include a column with the creation timestamp of source FG versions.

num_versions: int
include_version_timestamp_column: bool
__post_init__()
class abacusai.api_class.feature_group.TimeWindowMergeConfig

Bases: MergeConfig

Merge rows within a given timewindow of the most recent timestamp

Parameters:
  • feature_name (str) – Time based column to index on

  • time_window_size_ms (int) – Range of merged rows will be [MAX_TIME - time_window_size_ms, MAX_TIME]

  • include_version_timestamp_column (bool) – If set, include a column with the creation timestamp of source FG versions.

feature_name: str
time_window_size_ms: int
include_version_timestamp_column: bool
__post_init__()
class abacusai.api_class.feature_group._MergeConfigFactory

Bases: abacusai.api_class.abstract._ApiClassFactory

Helper class that provides a standard way to create an ABC using inheritance.

config_class_key = 'merge_mode'
config_abstract_class
config_class_map
class abacusai.api_class.feature_group.OperatorConfig

Bases: abacusai.api_class.abstract.ApiClass

Configuration for a template Feature Group Operation

operator_type: abacusai.api_class.enums.OperatorType
classmethod _get_builder()
__post_init__()
class abacusai.api_class.feature_group.UnpivotConfig

Bases: OperatorConfig

Unpivot Columns in a FeatureGroup.

Parameters:
  • columns (List[str]) – Which columns to unpivot.

  • index_column (str) – Name of new column containing the unpivoted column names as its values

  • value_column (str) – Name of new column containing the row values that were unpivoted.

  • exclude (bool) – If True, the unpivoted columns are all the columns EXCEPT the ones in the columns argument. Default is False.

columns: List[str]
index_column: str
value_column: str
exclude: bool
__post_init__()
class abacusai.api_class.feature_group.MarkdownConfig

Bases: OperatorConfig

Transform a input column to a markdown column.

Parameters:
  • input_column (str) – Name of input column to transform.

  • output_column (str) – Name of output column to store transformed data.

  • input_column_type (MarkdownOperatorInputType) – Type of input column to transform.

input_column: str
output_column: str
input_column_type: abacusai.api_class.enums.MarkdownOperatorInputType
__post_init__()
class abacusai.api_class.feature_group.CrawlerTransformConfig

Bases: OperatorConfig

Transform a input column of urls to html text

Parameters:
  • input_column (str) – Name of input column to transform.

  • output_column (str) – Name of output column to store transformed data.

  • depth_column (str) – Increasing depth explores more links, capturing more content

  • disable_host_restriction (bool) – If True, will not restrict crawling to the same host.

  • honour_website_rules (bool) – If True, will respect robots.txt rules.

  • user_agent (str) – If provided, will use this user agent instead of randomly selecting one.

input_column: str
output_column: str
depth_column: str
input_column_type: str
crawl_depth: int
disable_host_restriction: bool
honour_website_rules: bool
user_agent: str
__post_init__()
class abacusai.api_class.feature_group.ExtractDocumentDataConfig

Bases: OperatorConfig

Extracts data from documents.

Parameters:
  • doc_id_column (str) – Name of input document ID column.

  • document_column (str) – Name of the input document column which contains the page infos. This column will be transformed to include the document processing config in the output feature group.

  • document_processing_config (DocumentProcessingConfig) – Document processing configuration.

doc_id_column: str
document_column: str
document_processing_config: abacusai.api_class.dataset.DocumentProcessingConfig
__post_init__()
class abacusai.api_class.feature_group.DataGenerationConfig

Bases: OperatorConfig

Generate synthetic data using a model for finetuning an LLM.

Parameters:
  • prompt_col (str) – Name of the input prompt column.

  • completion_col (str) – Name of the output completion column.

  • description_col (str) – Name of the description column.

  • id_col (str) – Name of the identifier column.

  • generation_instructions (str) – Instructions for the data generation model.

  • temperature (float) – Sampling temperature for the model.

  • fewshot_examples (int) – Number of fewshot examples used to prompt the model.

  • concurrency (int) – Number of concurrent processes.

  • examples_per_target (int) – Number of examples per target.

  • subset_size (Optional[int]) – Size of the subset to use for generation.

  • verify_response (bool) – Whether to verify the response.

  • token_budget (int) – Token budget for generation.

  • oversample (bool) – Whether to oversample the data.

  • documentation_char_limit (int) – Character limit for documentation.

  • frequency_penalty (float) – Penalty for frequency of token appearance.

  • model (str) – Model to use for data generation.

  • seed (Optional[int]) – Seed for random number generation.

prompt_col: str
completion_col: str
description_col: str
id_col: str
generation_instructions: str
temperature: float
fewshot_examples: int
concurrency: int
examples_per_target: int
subset_size: int
verify_response: bool
token_budget: int
oversample: bool
documentation_char_limit: int
frequency_penalty: float
model: str
seed: int
__post_init__()
class abacusai.api_class.feature_group.UnionTransformConfig

Bases: OperatorConfig

Takes Union of current feature group with 1 or more selected feature groups of same type.

Parameters:
  • feature_group_ids (List[str]) – List of feature group IDs to union with source FG.

  • drop_non_intersecting_columns (bool) – If true, will drop columns that are not present in all feature groups. If false fills missing columns with nulls.

feature_group_ids: List[str]
drop_non_intersecting_columns: bool
__post_init__()
class abacusai.api_class.feature_group._OperatorConfigFactory

Bases: abacusai.api_class.abstract._ApiClassFactory

A class to select and return the the correct type of Operator Config based on a serialized OperatorConfig instance.

config_abstract_class
config_class_key = 'operator_type'
config_class_map