Data Splits¶
In most data science workflows, it's common to split data into different subsets for analysis and comparison. In support of this, DataInterface
subclasses allow you to specify and split your data based on specific logic that is provided to a DataSplit
.
Split types¶
Column Name and Value¶
- Split data based on a column value.
- Supports inequality signs.
- Works with
Pandas
andPolars
DataFrames
.
Example
import polars as pl
from opsml import PolarsData, DataSplit, CardInfo
info = CardInfo(name="data", repository="mlops", contact="user@mlops.com")
df = pl.DataFrame(
{
"foo": [1, 2, 3, 4, 5, 6],
"bar": ["a", "b", "c", "d", "e", "f"],
"y": [1, 2, 3, 4, 5, 6],
}
)
interface = PolarsData(
info=info,
data=df,
data_splits = [
DataSplit(label="train", column_name="foo", column_value=6, inequality="<"),
DataSplit(label="test", column_name="foo", column_value=6)
]
)
splits = interface.split_data()
assert splits["train"].X.shape[0] == 5
assert splits["test"].X.shape[0] == 1
Indices¶
- Split data based on pre-defined indices
- Works with
NDArray
,pyarrow.Table
,pandas.DataFrame
andpolars.DataFrame
import numpy as np
from opsml import NumpyData, DataSplit, CardInfo
info = CardInfo(name="data", repository="mlops", contact="user@mlops.com")
data = np.random.rand(10, 10)
interface = NumpyData(
info=info,
data=data,
data_splits = [
DataSplit(label="train", indices=[0,1,5])
]
)
splits = interface.split_data()
assert splits["train"].X.shape[0] == 3
Start and Stop Slicing¶
- Split data based on row slices with a start and stop index
- Works with
NDArray
,pyarrow.Table
,pandas.DataFrame
andpolars.DataFrame
import numpy as np
from opsml import NumpyData, DataSplit, CardInfo
info = CardInfo(name="data", repository="mlops", contact="user@mlops.com")
data = np.random.rand(10, 10)
interface = NumpyData(
info=info,
data=data,
data_splits = [
DataSplit(label="train", start=0, stop=3)
]
)
splits = interface.split_data()
assert splits["train"].X.shape[0] == 3
opsml.DataSplit
¶
Bases: BaseModel
Creates a data split based on the provided logic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label |
Label for the split |
required | |
column_name |
Column name to split on |
required | |
column_value |
Column value to split on. Can be a string, float, int, or timestamp. |
required | |
inequality |
Inequality sign to split on |
required | |
start |
Start index to split on |
required | |
stop |
Stop index to split on |
required | |
indices |
List of indices to split on |
required |
Source code in opsml/data/splitter.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
convert_to_list(value)
classmethod
¶
Pre to convert indices to list if not None
Source code in opsml/data/splitter.py
79 80 81 82 83 84 85 86 87 |
|
serialize_column_value(column_value, _info)
¶
Serializes pd.timestamp to str. This is used when saving the data split as a JSON file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_value |
Optional[Union[str, float, int, Timestamp]]
|
Column value to serialize |
required |
Returns:
Type | Description |
---|---|
Optional[Union[str, float, int]]
|
Union[str, float, int]: Serialized column value |
Source code in opsml/data/splitter.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
trim_whitespace(value)
classmethod
¶
Trims whitespace from inequality signs
Source code in opsml/data/splitter.py
89 90 91 92 93 94 95 96 97 |
|