Overview¶

DataCards are used for storing, versioning, and tracking data. All DataCards require a DataInterface and optional metadata. See DataInterface for more information

Creating a Card¶

# Data
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split
import numpy as np

# Opsml
from opsml import CardInfo, DataCard, CardRegistry, DataSplit, PandasData

card_info = CardInfo(name="linnerrud", repository="opsml", contact="user@email.com")
data, target = load_linnerud(return_X_y=True, as_frame=True)
data["Pulse"] = target.Pulse

# Split indices
indices = np.arange(data.shape[0])

# usual train-val split
train_idx, test_idx = train_test_split(indices, test_size=0.2, train_size=None)

data_interface = PandasData(
    data=data,
    dependent_vars=["Pulse"],
    # define splits
    data_splits=[
        DataSplit(label="train", indices=train_idx),
        DataSplit(label="test", indices=test_idx),
    ],
)

data_card = DataCard(info=card_info, interface=data_interface)

# splits look good
splits = data_card.split_data()
print(splits["train"].X.head())

"""   
    Chins  Situps  Jumps
0    5.0   162.0   60.0
1    2.0   110.0   60.0
2   12.0   101.0  101.0
3   12.0   105.0   37.0
4   13.0   155.0   58.0
"""

data_registry = CardRegistry(registry_name="data")
data_registry.register_card(card=data_card)
print(data_card.version)
# > 1.0.0

DataCard Args¶

name: str: Name for the data (Required)
repository: str: repository data belongs to (Required)
contact: str: Email to associate with data (Required)
interface: DataInterface: DataInterface used to interact with data. See DataInterface for more information
metadata: DataCardMetadata: Optional DataCardMetadata used to store metadata about data. See DataCardMetadata for more information. If not provided, a default object is created. When registering a card, the metadata is updated with the latest information.

Docs¶

`opsml.DataCard` ¶

Bases: ArtifactCard

Create a DataCard from your data.

Parameters:

Name	Description	Default
`interface`	Instance of `DataInterface` that contains data	required
`name`	What to name the data	required
`repository`	Repository that this data is associated with	required
`contact`	Contact to associate with data card	required
`info`	`CardInfo` object containing additional metadata. If provided, it will override any values provided for `name`, `repository`, `contact`, and `version`. Name, repository, and contact are required arguments for all cards. They can be provided directly or through a `CardInfo` object.	required
`version`	DataCard version	required
`uid`	Unique id assigned to the DataCard	required

Returns:

Type	Description
	DataCard

Source code in opsml/cards/data.py

class DataCard(ArtifactCard):
    """Create a DataCard from your data.

    Args:
        interface:
            Instance of `DataInterface` that contains data
        name:
            What to name the data
        repository:
            Repository that this data is associated with
        contact:
            Contact to associate with data card
        info:
            `CardInfo` object containing additional metadata. If provided, it will override any
            values provided for `name`, `repository`, `contact`, and `version`.

            Name, repository, and contact are required arguments for all cards. They can be provided
            directly or through a `CardInfo` object.

        version:
            DataCard version
        uid:
            Unique id assigned to the DataCard

    Returns:
        DataCard

    """

    model_config = ConfigDict(extra="forbid")

    interface: SerializeAsAny[Union[DataInterface, Dataset]]
    metadata: DataCardMetadata = DataCardMetadata()

    def load_data(self, **kwargs: Union[str, int]) -> None:  # pylint: disable=differing-param-doc
        """
        Load data to interface

        Args:
            kwargs:
                Keyword arguments to pass to the data loader

            ---- Supported kwargs for ImageData and TextDataset ----

            split:
                Split to use for data. If not provided, then all data will be loaded.
                Only used for subclasses of `Dataset`.

            batch_size:
                What batch size to use when loading data. Only used for subclasses of `Dataset`.
                Defaults to 1000.

            chunk_size:
                How many files per batch to use when writing arrow back to local file.
                Defaults to 1000.

                Example:

                    - If batch_size=1000 and chunk_size=100, then the loaded batch will be split into
                    10 chunks to write in parallel. This is useful for large datasets.

        """
        from opsml.storage.card_loader import DataCardLoader

        DataCardLoader(self).load_data(**kwargs)

    def create_data_profile(self, bin_size: int = 20, compute_correlations: bool = False) -> Optional[DataProfile]:
        """
        Create data profile for the current data card

        Args:
            bin_size:
                Number of bins for histograms. Default is 20
            compute_correlations:
                Whether to compute correlations or not. Default is False
        """
        if isinstance(self.interface, DataInterface):
            return self.interface.create_data_profile(
                bin_size=bin_size,
                compute_correlations=compute_correlations,
            )

        logger.warning("Data profile is only supported for DataInterface subclasses. You have a Dataset subclass.")
        return None

    def load_data_profile(self) -> None:
        """
        Load data to interface
        """
        from opsml.storage.card_loader import DataCardLoader

        DataCardLoader(self).load_data_profile()

    def create_registry_record(self) -> Dict[str, Any]:
        """
        Creates required metadata for registering the current data card.
        Implemented with a DataRegistry object.
            Returns:
            Registry metadata
        """
        exclude_attr = {"data"}
        dumped_model = self.model_dump(exclude=exclude_attr)
        dumped_model["interface_type"] = self.interface.name()
        return dumped_model

    def add_info(self, info: Dict[str, Union[float, int, str]]) -> None:
        """
        Adds metadata to the existing DataCard metadata dictionary

        Args:
            info:
                Dictionary containing name (str) and value (float, int, str) pairs
                to add to the current metadata set
        """

        self.metadata.additional_info = {**info, **self.metadata.additional_info}

    def split_data(self) -> Dict[str, Data]:
        """Splits data interface according to data split logic"""

        assert isinstance(self.interface, DataInterface), "Splitting is only support for DataInterface subclasses"
        if self.data is None:
            self.load_data()

        return self.interface.split_data()

    @property
    def data_splits(self) -> List[DataSplit]:
        """Returns data splits"""
        assert isinstance(self.interface, DataInterface), "Data splits are only supported for DataInterface subclasses"
        return self.interface.data_splits

    @property
    def data(self) -> Any:
        """Returns data"""
        assert isinstance(
            self.interface, DataInterface
        ), "Data attribute is only supported for DataInterface subclasses"
        return self.interface.data

    @property
    def data_profile(self) -> Any:
        """Returns data profile"""
        assert isinstance(self.interface, DataInterface), "Data profile is only supported for DataInterface subclasses"
        return self.interface.data_profile

    @property
    def card_type(self) -> str:
        return CardType.DATACARD.value

`card_type: str` `property` ¶

`create_data_profile(bin_size=20, compute_correlations=False)` ¶

Create data profile for the current data card

Parameters:

Name	Type	Description	Default
`bin_size`	`int`	Number of bins for histograms. Default is 20	`20`
`compute_correlations`	`bool`	Whether to compute correlations or not. Default is False	`False`

Source code in opsml/cards/data.py

def create_data_profile(self, bin_size: int = 20, compute_correlations: bool = False) -> Optional[DataProfile]:
    """
    Create data profile for the current data card

    Args:
        bin_size:
            Number of bins for histograms. Default is 20
        compute_correlations:
            Whether to compute correlations or not. Default is False
    """
    if isinstance(self.interface, DataInterface):
        return self.interface.create_data_profile(
            bin_size=bin_size,
            compute_correlations=compute_correlations,
        )

    logger.warning("Data profile is only supported for DataInterface subclasses. You have a Dataset subclass.")
    return None

`split_data()` ¶

Splits data interface according to data split logic

Source code in opsml/cards/data.py

def split_data(self) -> Dict[str, Data]:
    """Splits data interface according to data split logic"""

    assert isinstance(self.interface, DataInterface), "Splitting is only support for DataInterface subclasses"
    if self.data is None:
        self.load_data()

    return self.interface.split_data()

`load_data(**kwargs)` ¶

Load data to interface

Parameters:

Name	Type	Description	Default
`kwargs`	`Union[str, int]`	Keyword arguments to pass to the data loader	`{}`
`split`		Split to use for data. If not provided, then all data will be loaded. Only used for subclasses of `Dataset`.	required
`batch_size`		What batch size to use when loading data. Only used for subclasses of `Dataset`. Defaults to 1000.	required
`chunk_size`		How many files per batch to use when writing arrow back to local file. Defaults to 1000. Example: `- If batch_size=1000 and chunk_size=100, then the loaded batch will be split into 10 chunks to write in parallel. This is useful for large datasets.`	required

Source code in opsml/cards/data.py

def load_data(self, **kwargs: Union[str, int]) -> None:  # pylint: disable=differing-param-doc
    """
    Load data to interface

    Args:
        kwargs:
            Keyword arguments to pass to the data loader

        ---- Supported kwargs for ImageData and TextDataset ----

        split:
            Split to use for data. If not provided, then all data will be loaded.
            Only used for subclasses of `Dataset`.

        batch_size:
            What batch size to use when loading data. Only used for subclasses of `Dataset`.
            Defaults to 1000.

        chunk_size:
            How many files per batch to use when writing arrow back to local file.
            Defaults to 1000.

            Example:

                - If batch_size=1000 and chunk_size=100, then the loaded batch will be split into
                10 chunks to write in parallel. This is useful for large datasets.

    """
    from opsml.storage.card_loader import DataCardLoader

    DataCardLoader(self).load_data(**kwargs)

Overview¶

Creating a Card¶

DataCard Args¶

Docs¶

opsml.DataCard ¶

card_type: str property ¶

create_data_profile(bin_size=20, compute_correlations=False) ¶

split_data() ¶

load_data(**kwargs) ¶

`opsml.DataCard` ¶

`card_type: str` `property` ¶

`create_data_profile(bin_size=20, compute_correlations=False)` ¶

`split_data()` ¶

`load_data(**kwargs)` ¶