Skip to content

Data Profile

Data Profile

Opsml DataInterfaces support ydata-profiling with an optional extra.

poetry add opsml[profiling]

To add a data profile to your interface you can either supply a custom data profile created through the ydata-profiling library or you can call the create_data_profile method after DataInterface instantiation. Note - you can also call create_data_profile from a DataCard after instantiation (example below). The create_data_profile is optimized for performance, and thus, will omit certain analyses by defualt (interactions, character/word analysis, etc.). If you'd like more control over what analyses are conducted, it is recommended that you create a custom report via ydata-profiling and provide it to the DataCard using the data_profile arg.

Example of create_data_profile

# Data
from sklearn.datasets import load_linnerud

# Opsml
from opsml import CardInfo, DataCard, CardRegistry, PandasData

data, target = load_linnerud(return_X_y=True, as_frame=True)
data["Pulse"] = target.Pulse

interface = PandasData(data=data)

# create data profile from interface
interface.create_data_profile(sample_perc=0.5) # you can specify a sampling percentage between 0 and 1

card_info = CardInfo(name="linnerrud", repository="opsml", contact="user@email.com")
data_card = DataCard(info=card_info, data=data)

# this also works
data_card.create_data_profile(sample_perc=0.5) 

# if youd like to view you're report, you can export it to html or json
# Jupyter notebooks will render the html without needing to save (just call data_card.data_profile)
# data_card.data_profile.to_file("my_report.html")

# Registering card will automatically save the report and its html
data_registry = CardRegistry(registry_name="data")
data_registry.register_card(card=data_card)

Example of providing your own custom data profile

from ydata_profiling import ProfileReport
from opsml import PandasData 

data, target = load_linnerud(return_X_y=True, as_frame=True)
data["Pulse"] = target.Pulse

data_profile = ProfileReport(data, title="Profiling Report")
interface = PandasData(data=data, data_profile=data_profile)

Comparing data profiles

You can also leverage Opsmls thin profiling wrapper for comparing different data profiles

from sklearn.datasets import load_linnerud
import numpy as np

# Opsml
from opsml import PandasData
from opsml.profile import DataProfiler

data, target = load_linnerud(return_X_y=True, as_frame=True)
data["Pulse"] = target.Pulse

# Simulate creating 1st DataCard
interface = PandasData(data=data)
interface.create_data_profile()

# Simulate creating 2nd DataCard
data2 = data * np.random.rand(data.shape[1])
card_info = CardInfo(name="linnerrud", repository="opsml", contact="user@email.com")
interface2 = PandasData(data=data)
interface2.create_data_profile()

Docs

opsml.profile.DataProfiler

Source code in opsml/profile/profile_data.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class DataProfiler:
    @staticmethod
    def create_profile_report(
        data: Union[pd.DataFrame, pl.DataFrame],
        bin_size: int = 20,
        compute_correlations: bool = False,
    ) -> DataProfile:
        """
        Creates a `scouter` data profile report

        Args:
            data:
                data to profile
            bin_size:
                number of bins for histograms. Default is 20
            compute_correlations:
                whether to compute correlations. Default is False

        Returns:
            `DataProfile`
        """
        profiler = Profiler()

        return profiler.create_data_profile(
            data=data,
            bin_size=bin_size,
            compute_correlations=compute_correlations,
        )

    @staticmethod
    def load_profile(data: str) -> DataProfile:
        """Loads a `ProfileReport` from data bytes

        Args:
            data:
                `DataProfile` as json string

        Returns:
            `DataProfile`
        """

        return DataProfile.model_validate_json(data)

create_profile_report(data, bin_size=20, compute_correlations=False) staticmethod

Creates a scouter data profile report

Parameters:

Name Type Description Default
data Union[DataFrame, DataFrame]

data to profile

required
bin_size int

number of bins for histograms. Default is 20

20
compute_correlations bool

whether to compute correlations. Default is False

False

Returns:

Type Description
DataProfile

DataProfile

Source code in opsml/profile/profile_data.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@staticmethod
def create_profile_report(
    data: Union[pd.DataFrame, pl.DataFrame],
    bin_size: int = 20,
    compute_correlations: bool = False,
) -> DataProfile:
    """
    Creates a `scouter` data profile report

    Args:
        data:
            data to profile
        bin_size:
            number of bins for histograms. Default is 20
        compute_correlations:
            whether to compute correlations. Default is False

    Returns:
        `DataProfile`
    """
    profiler = Profiler()

    return profiler.create_data_profile(
        data=data,
        bin_size=bin_size,
        compute_correlations=compute_correlations,
    )