dataset.py¶
pycmtensor.dataset
¶
The code snippet is a part of a class called Dataset
that converts a pandas DataFrame into an xarray dataset. It initializes the dataset object with the DataFrame and the name of the choice variable. It also provides methods to access and manipulate the dataset.
Dataset(df, choice, **kwargs)
¶
Initialize the Dataset object with a pandas DataFrame and the name of the choice variable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The pandas DataFrame object containing the dataset. |
required |
choice |
str
|
The name of the choice variable. |
required |
**kwargs |
optional
|
Additional keyword arguments to configure the dataset. |
{}
|
Attributes:
Name | Type | Description |
---|---|---|
n |
int
|
The number of rows in the dataset. |
x |
list[TensorVariable]
|
The list of input TensorVariable objects. |
y |
TensorVariable
|
The output TensorVariable object. |
scale |
dict
|
A dictionary of scaling factors for each variable. |
choice |
str
|
The name of the choice variable. |
ds |
dict
|
A dictionary of variable values. |
split_frac |
float
|
The split fraction used to split the dataset. |
idx_train |
list
|
The list of indices of the training dataset. |
idx_valid |
list
|
The list of indices of the validation dataset. |
n_train |
int
|
The size of the training dataset. |
n_valid |
int
|
The size of the validation dataset. |
Example
Example initialization of a Dataset object:
ds = Dataset(df=pd.read_csv("datafile.csv", sep=","), choice="mode")
ds.split(frac=0.8)
Accessing attributes:
print(ds.choice)
Output:
'car'
Raises:
Type | Description |
---|---|
IndexError
|
If the choice variable is not found in the DataFrame columns. |
__getitem__(key)
¶
Returns the input or output variable(s) of the dataset object by their names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str or list or tuple
|
The name(s) of the variable(s) to be accessed. |
required |
Returns:
Type | Description |
---|---|
TensorVariable or list of TensorVariable: The input or output variable(s) corresponding to the given name(s). |
Raises:
Type | Description |
---|---|
KeyError
|
If the given name(s) do not match any input or output variable. |
drop(variables)
¶
Method for dropping variables
from the dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
variables |
list[str]
|
list of |
required |
Raises:
Type | Description |
---|---|
KeyError
|
raises an error if any item in |
Warning
Choice variable cannot be explicitly dropped.
scale_variable(variable, factor)
¶
Multiply values of the variable
by 1/factor
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
variable |
str or list[str]
|
the name of the variable or a list of variable names |
required |
factor |
float
|
the scaling factor |
required |
split(frac)
¶
Method to split the dataset into training and validation subsets based on a given fraction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frac |
float
|
The fraction to split the dataset into the training set. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Notes
- The actual splitting of the dataset is done during the training procedure or when invoking the
train_dataset()
orvalid_dataset()
methods.
train_dataset(variables, index=None, batch_size=None, shift=None)
¶
Returns a slice of the (or the full) training data array with the sequence matching the list of variables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
variables |
Union[list, str, TensorVariable]
|
a tensor, label, or list of tensors or list of labels |
required |
index |
int
|
the start of the slice of the data array. If |
None
|
batch_size |
int
|
length of the slice. If |
None
|
shift |
int
|
the offset of the slice between |
None
|
Returns:
Type | Description |
---|---|
list
|
a list of array object(s) corresponding to the input variables |
Example
How to retrieve data array from Dataset:
ds = Dataset(df, choice="choice")
# index "age" and "location" data arrays
return ds.train_dataset([ds["age"], ds["location"]])
# similar result
return ds.train_dataset(["age", "location"])
valid_dataset(variables, index=None, batch_size=None, shift=None)
¶
Returns a slice of the (or the full) validation data array with the sequence matching the list of variables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
variables |
Union[list, str, TensorVariable]
|
a tensor, label, or list of tensors or list of labels |
required |
index |
int
|
the start of the slice of the data array. If |
None
|
batch_size |
int
|
length of the slice. If |
None
|
shift |
int
|
the offset of the slice between |
None
|
Returns:
Type | Description |
---|---|
list
|
a list of array object(s) corresponding to the input variables |