Skip to content

dataset.py


pycmtensor.dataset

The code snippet is a part of a class called Dataset that converts a pandas DataFrame into an xarray dataset. It initializes the dataset object with the DataFrame and the name of the choice variable. It also provides methods to access and manipulate the dataset.

Dataset(df, choice, **kwargs)

Initialize the Dataset object with a pandas DataFrame and the name of the choice variable.

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object containing the dataset.

required
choice str

The name of the choice variable.

required
**kwargs optional

Additional keyword arguments to configure the dataset.

{}

Attributes:

Name Type Description
n int

The number of rows in the dataset.

x list[TensorVariable]

The list of input TensorVariable objects.

y TensorVariable

The output TensorVariable object.

scale dict

A dictionary of scaling factors for each variable.

choice str

The name of the choice variable.

ds dict

A dictionary of variable values.

split_frac float

The split fraction used to split the dataset.

idx_train list

The list of indices of the training dataset.

idx_valid list

The list of indices of the validation dataset.

n_train int

The size of the training dataset.

n_valid int

The size of the validation dataset.

Example

Example initialization of a Dataset object:

ds = Dataset(df=pd.read_csv("datafile.csv", sep=","), choice="mode")
ds.split(frac=0.8)

Accessing attributes:

print(ds.choice)

Output:

'car'

Raises:

Type Description
IndexError

If the choice variable is not found in the DataFrame columns.

__getitem__(key)

Returns the input or output variable(s) of the dataset object by their names.

Parameters:

Name Type Description Default
key str or list or tuple

The name(s) of the variable(s) to be accessed.

required

Returns:

Type Description

TensorVariable or list of TensorVariable: The input or output variable(s) corresponding to the given name(s).

Raises:

Type Description
KeyError

If the given name(s) do not match any input or output variable.

drop(variables)

Method for dropping variables from the dataset

Parameters:

Name Type Description Default
variables list[str]

list of str variables from the dataset to drop

required

Raises:

Type Description
KeyError

raises an error if any item in variables is not found in the dataset or item is the choice variable

Warning

Choice variable cannot be explicitly dropped.

scale_variable(variable, factor)

Multiply values of the variable by 1/factor.

Parameters:

Name Type Description Default
variable str or list[str]

the name of the variable or a list of variable names

required
factor float

the scaling factor

required

split(frac)

Method to split the dataset into training and validation subsets based on a given fraction.

Parameters:

Name Type Description Default
frac float

The fraction to split the dataset into the training set.

required

Returns:

Type Description
None

None

Notes
  • The actual splitting of the dataset is done during the training procedure or when invoking the train_dataset() or valid_dataset() methods.

train_dataset(variables, index=None, batch_size=None, shift=None)

Returns a slice of the (or the full) training data array with the sequence matching the list of variables.

Parameters:

Name Type Description Default
variables Union[list, str, TensorVariable]

a tensor, label, or list of tensors or list of labels

required
index int

the start of the slice of the data array. If None is given, returns the full data array.

None
batch_size int

length of the slice. If None is given, returns the index from index to N where N is the length of the array.

None
shift int

the offset of the slice between 0 and batch_size. If None is given, shift=0.

None

Returns:

Type Description
list

a list of array object(s) corresponding to the input variables

Example

How to retrieve data array from Dataset:

ds = Dataset(df, choice="choice")

# index "age" and "location" data arrays
return ds.train_dataset([ds["age"], ds["location"]])

# similar result
return ds.train_dataset(["age", "location"])

valid_dataset(variables, index=None, batch_size=None, shift=None)

Returns a slice of the (or the full) validation data array with the sequence matching the list of variables.

Parameters:

Name Type Description Default
variables Union[list, str, TensorVariable]

a tensor, label, or list of tensors or list of labels

required
index int

the start of the slice of the data array. If None is given, returns the full data array.

None
batch_size int

length of the slice. If None is given, returns the index from index to N where N is the length of the array.

None
shift int

the offset of the slice between 0 and batch_size. If None is given, shift=0.

None

Returns:

Type Description
list

a list of array object(s) corresponding to the input variables