dataset.py¶

`pycmtensor.dataset` ¶

The code snippet is a part of a class called Dataset that converts a pandas DataFrame into an xarray dataset. It initializes the dataset object with the DataFrame and the name of the choice variable. It also provides methods to access and manipulate the dataset.

`Dataset(df, choice, **kwargs)` ¶

Initialize the Dataset object with a pandas DataFrame and the name of the choice variable.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object containing the dataset.	required
`choice`	`str`	The name of the choice variable.	required
`**kwargs`	`optional`	Additional keyword arguments to configure the dataset.	`{}`

Attributes:

Name	Type	Description
`n`	`int`	The number of rows in the dataset.
`x`	`list[TensorVariable]`	The list of input TensorVariable objects.
`y`	`TensorVariable`	The output TensorVariable object.
`scale`	`dict`	A dictionary of scaling factors for each variable.
`choice`	`str`	The name of the choice variable.
`ds`	`dict`	A dictionary of variable values.
`split_frac`	`float`	The split fraction used to split the dataset.
`idx_train`	`list`	The list of indices of the training dataset.
`idx_valid`	`list`	The list of indices of the validation dataset.
`n_train`	`int`	The size of the training dataset.
`n_valid`	`int`	The size of the validation dataset.

Example

Example initialization of a Dataset object:

ds = Dataset(df=pd.read_csv("datafile.csv", sep=","), choice="mode")
ds.split(frac=0.8)

Accessing attributes:

print(ds.choice)

Output:

'car'

Raises:

Type	Description
`IndexError`	If the choice variable is not found in the DataFrame columns.

`getitem(key)` ¶

Returns the input or output variable(s) of the dataset object by their names.

Parameters:

Name	Type	Description	Default
`key`	`str or list or tuple`	The name(s) of the variable(s) to be accessed.	required

Returns:

Type	Description
	TensorVariable or list of TensorVariable: The input or output variable(s) corresponding to the given name(s).

Raises:

Type	Description
`KeyError`	If the given name(s) do not match any input or output variable.

`drop(variables)` ¶

Method for dropping variables from the dataset

Parameters:

Name	Type	Description	Default
`variables`	`list[str]`	list of `str` variables from the dataset to drop	required

Raises:

Type	Description
`KeyError`	raises an error if any item in `variables` is not found in the dataset or item is the choice variable

Warning

Choice variable cannot be explicitly dropped.

`scale_variable(variable, factor)` ¶

Multiply values of the variable by 1/factor.

Parameters:

Name	Type	Description	Default
`variable`	`str or list[str]`	the name of the variable or a list of variable names	required
`factor`	`float`	the scaling factor	required

`split(frac)` ¶

Method to split the dataset into training and validation subsets based on a given fraction.

Parameters:

Name	Type	Description	Default
`frac`	`float`	The fraction to split the dataset into the training set.	required

Returns:

Type	Description
`None`	None

Notes

The actual splitting of the dataset is done during the training procedure or when invoking the train_dataset() or valid_dataset() methods.

`train_dataset(variables, index=None, batch_size=None, shift=None)` ¶

Returns a slice of the (or the full) training data array with the sequence matching the list of variables.

Parameters:

Name	Type	Description	Default
`variables`	`Union[list, str, TensorVariable]`	a tensor, label, or list of tensors or list of labels	required
`index`	`int`	the start of the slice of the data array. If `None` is given, returns the full data array.	`None`
`batch_size`	`int`	length of the slice. If `None` is given, returns the index from `index` to `N` where `N` is the length of the array.	`None`
`shift`	`int`	the offset of the slice between `0` and `batch_size`. If `None` is given, `shift=0`.	`None`

Returns:

Type	Description
`list`	a list of array object(s) corresponding to the input variables

Example

How to retrieve data array from Dataset:

ds = Dataset(df, choice="choice")

# index "age" and "location" data arrays
return ds.train_dataset([ds["age"], ds["location"]])

# similar result
return ds.train_dataset(["age", "location"])

`valid_dataset(variables, index=None, batch_size=None, shift=None)` ¶

Returns a slice of the (or the full) validation data array with the sequence matching the list of variables.

Parameters:

Name	Type	Description	Default
`variables`	`Union[list, str, TensorVariable]`	a tensor, label, or list of tensors or list of labels	required
`index`	`int`	the start of the slice of the data array. If `None` is given, returns the full data array.	`None`
`batch_size`	`int`	length of the slice. If `None` is given, returns the index from `index` to `N` where `N` is the length of the array.	`None`
`shift`	`int`	the offset of the slice between `0` and `batch_size`. If `None` is given, `shift=0`.	`None`

Returns:

Type	Description
`list`	a list of array object(s) corresponding to the input variables

dataset.py¶

pycmtensor.dataset ¶

Dataset(df, choice, **kwargs) ¶

__getitem__(key) ¶

drop(variables) ¶

scale_variable(variable, factor) ¶

split(frac) ¶

train_dataset(variables, index=None, batch_size=None, shift=None) ¶

valid_dataset(variables, index=None, batch_size=None, shift=None) ¶

`pycmtensor.dataset` ¶

`Dataset(df, choice, **kwargs)` ¶

`getitem(key)` ¶

`drop(variables)` ¶

`scale_variable(variable, factor)` ¶

`split(frac)` ¶

`train_dataset(variables, index=None, batch_size=None, shift=None)` ¶

`valid_dataset(variables, index=None, batch_size=None, shift=None)` ¶