Skip to content

Overview

DataCards are used for storing, versioning, and tracking data. All DataCards require a DataInterface and optional metadata. See DataInterface for more information

Creating a Card

# Data
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split
import numpy as np

# Opsml
from opsml import CardInfo, DataCard, CardRegistry, DataSplit, PandasData

card_info = CardInfo(name="linnerrud", repository="opsml", contact="user@email.com")
data, target = load_linnerud(return_X_y=True, as_frame=True)
data["Pulse"] = target.Pulse

# Split indices
indices = np.arange(data.shape[0])

# usual train-val split
train_idx, test_idx = train_test_split(indices, test_size=0.2, train_size=None)

data_interface = PandasData(
    data=data,
    dependent_vars=["Pulse"],
    # define splits
    data_splits=[
        DataSplit(label="train", indices=train_idx),
        DataSplit(label="test", indices=test_idx),
    ],
)

data_card = DataCard(info=card_info, interface=data_interface)

# splits look good
splits = data_card.split_data()
print(splits["train"].X.head())

"""   
    Chins  Situps  Jumps
0    5.0   162.0   60.0
1    2.0   110.0   60.0
2   12.0   101.0  101.0
3   12.0   105.0   37.0
4   13.0   155.0   58.0
"""

data_registry = CardRegistry(registry_name="data")
data_registry.register_card(card=data_card)
print(data_card.version)
# > 1.0.0

DataCard Args

name: str
Name for the data (Required)
repository: str
repository data belongs to (Required)
contact: str
Email to associate with data (Required)
interface: DataInterface
DataInterface used to interact with data. See DataInterface for more information
metadata: DataCardMetadata
Optional DataCardMetadata used to store metadata about data. See DataCardMetadata for more information. If not provided, a default object is created. When registering a card, the metadata is updated with the latest information.

Docs

opsml.DataCard

Bases: ArtifactCard

Create a DataCard from your data.

Parameters:

Name Type Description Default
interface

Instance of DataInterface that contains data

required
name

What to name the data

required
repository

Repository that this data is associated with

required
contact

Contact to associate with data card

required
info

CardInfo object containing additional metadata. If provided, it will override any values provided for name, repository, contact, and version.

Name, repository, and contact are required arguments for all cards. They can be provided directly or through a CardInfo object.

required
version

DataCard version

required
uid

Unique id assigned to the DataCard

required

Returns:

Type Description

DataCard

Source code in opsml/cards/data.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
class DataCard(ArtifactCard):
    """Create a DataCard from your data.

    Args:
        interface:
            Instance of `DataInterface` that contains data
        name:
            What to name the data
        repository:
            Repository that this data is associated with
        contact:
            Contact to associate with data card
        info:
            `CardInfo` object containing additional metadata. If provided, it will override any
            values provided for `name`, `repository`, `contact`, and `version`.

            Name, repository, and contact are required arguments for all cards. They can be provided
            directly or through a `CardInfo` object.

        version:
            DataCard version
        uid:
            Unique id assigned to the DataCard

    Returns:
        DataCard

    """

    model_config = ConfigDict(extra="forbid")

    interface: SerializeAsAny[Union[DataInterface, Dataset]]
    metadata: DataCardMetadata = DataCardMetadata()

    def load_data(self, **kwargs: Union[str, int]) -> None:  # pylint: disable=differing-param-doc
        """
        Load data to interface

        Args:
            kwargs:
                Keyword arguments to pass to the data loader

            ---- Supported kwargs for ImageData and TextDataset ----

            split:
                Split to use for data. If not provided, then all data will be loaded.
                Only used for subclasses of `Dataset`.

            batch_size:
                What batch size to use when loading data. Only used for subclasses of `Dataset`.
                Defaults to 1000.

            chunk_size:
                How many files per batch to use when writing arrow back to local file.
                Defaults to 1000.

                Example:

                    - If batch_size=1000 and chunk_size=100, then the loaded batch will be split into
                    10 chunks to write in parallel. This is useful for large datasets.

        """
        from opsml.storage.card_loader import DataCardLoader

        DataCardLoader(self).load_data(**kwargs)

    def create_data_profile(self, bin_size: int = 20, compute_correlations: bool = False) -> Optional[DataProfile]:
        """
        Create data profile for the current data card

        Args:
            bin_size:
                Number of bins for histograms. Default is 20
            compute_correlations:
                Whether to compute correlations or not. Default is False
        """
        if isinstance(self.interface, DataInterface):
            return self.interface.create_data_profile(
                bin_size=bin_size,
                compute_correlations=compute_correlations,
            )

        logger.warning("Data profile is only supported for DataInterface subclasses. You have a Dataset subclass.")
        return None

    def load_data_profile(self) -> None:
        """
        Load data to interface
        """
        from opsml.storage.card_loader import DataCardLoader

        DataCardLoader(self).load_data_profile()

    def create_registry_record(self) -> Dict[str, Any]:
        """
        Creates required metadata for registering the current data card.
        Implemented with a DataRegistry object.
            Returns:
            Registry metadata
        """
        exclude_attr = {"data"}
        dumped_model = self.model_dump(exclude=exclude_attr)
        dumped_model["interface_type"] = self.interface.name()
        return dumped_model

    def add_info(self, info: Dict[str, Union[float, int, str]]) -> None:
        """
        Adds metadata to the existing DataCard metadata dictionary

        Args:
            info:
                Dictionary containing name (str) and value (float, int, str) pairs
                to add to the current metadata set
        """

        self.metadata.additional_info = {**info, **self.metadata.additional_info}

    def split_data(self) -> Dict[str, Data]:
        """Splits data interface according to data split logic"""

        assert isinstance(self.interface, DataInterface), "Splitting is only support for DataInterface subclasses"
        if self.data is None:
            self.load_data()

        return self.interface.split_data()

    @property
    def data_splits(self) -> List[DataSplit]:
        """Returns data splits"""
        assert isinstance(self.interface, DataInterface), "Data splits are only supported for DataInterface subclasses"
        return self.interface.data_splits

    @property
    def data(self) -> Any:
        """Returns data"""
        assert isinstance(
            self.interface, DataInterface
        ), "Data attribute is only supported for DataInterface subclasses"
        return self.interface.data

    @property
    def data_profile(self) -> Any:
        """Returns data profile"""
        assert isinstance(self.interface, DataInterface), "Data profile is only supported for DataInterface subclasses"
        return self.interface.data_profile

    @property
    def card_type(self) -> str:
        return CardType.DATACARD.value

card_type: str property

create_data_profile(bin_size=20, compute_correlations=False)

Create data profile for the current data card

Parameters:

Name Type Description Default
bin_size int

Number of bins for histograms. Default is 20

20
compute_correlations bool

Whether to compute correlations or not. Default is False

False
Source code in opsml/cards/data.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def create_data_profile(self, bin_size: int = 20, compute_correlations: bool = False) -> Optional[DataProfile]:
    """
    Create data profile for the current data card

    Args:
        bin_size:
            Number of bins for histograms. Default is 20
        compute_correlations:
            Whether to compute correlations or not. Default is False
    """
    if isinstance(self.interface, DataInterface):
        return self.interface.create_data_profile(
            bin_size=bin_size,
            compute_correlations=compute_correlations,
        )

    logger.warning("Data profile is only supported for DataInterface subclasses. You have a Dataset subclass.")
    return None

split_data()

Splits data interface according to data split logic

Source code in opsml/cards/data.py
147
148
149
150
151
152
153
154
def split_data(self) -> Dict[str, Data]:
    """Splits data interface according to data split logic"""

    assert isinstance(self.interface, DataInterface), "Splitting is only support for DataInterface subclasses"
    if self.data is None:
        self.load_data()

    return self.interface.split_data()

load_data(**kwargs)

Load data to interface

Parameters:

Name Type Description Default
kwargs Union[str, int]

Keyword arguments to pass to the data loader

{}
split

Split to use for data. If not provided, then all data will be loaded. Only used for subclasses of Dataset.

required
batch_size

What batch size to use when loading data. Only used for subclasses of Dataset. Defaults to 1000.

required
chunk_size

How many files per batch to use when writing arrow back to local file. Defaults to 1000.

Example:

- If batch_size=1000 and chunk_size=100, then the loaded batch will be split into
10 chunks to write in parallel. This is useful for large datasets.
required
Source code in opsml/cards/data.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def load_data(self, **kwargs: Union[str, int]) -> None:  # pylint: disable=differing-param-doc
    """
    Load data to interface

    Args:
        kwargs:
            Keyword arguments to pass to the data loader

        ---- Supported kwargs for ImageData and TextDataset ----

        split:
            Split to use for data. If not provided, then all data will be loaded.
            Only used for subclasses of `Dataset`.

        batch_size:
            What batch size to use when loading data. Only used for subclasses of `Dataset`.
            Defaults to 1000.

        chunk_size:
            How many files per batch to use when writing arrow back to local file.
            Defaults to 1000.

            Example:

                - If batch_size=1000 and chunk_size=100, then the loaded batch will be split into
                10 chunks to write in parallel. This is useful for large datasets.

    """
    from opsml.storage.card_loader import DataCardLoader

    DataCardLoader(self).load_data(**kwargs)