Skip to content

Datasets

Datasets

In addition to DataInterface classes, OpsML also provides a Dataset class that is used when working with text or image data.

Required Arguments for all Datasets (examples below)

data_dir
Path to directory containing data. This should be the root directory that contains all of the data. If you wish to define splits, you can do so by creating sub-directories within the root directory. For example, if you have a train and test split, you can create a directory structure like this:
root
├── train        # this will be inferred as a split named `train`
│   ├── file1.txt
│   ├── file2.txt
│   ├── file3.txt
│   └── metadata.jsonl
└── test          # this will be inferred as a split named `test`
    ├── file4.txt
    ├── file5.txt
    ├── file6.txt
    └── metadata.jsonl
shard_size
Size of each shard. Defaults to 512MB

Optional Arguments

splits
Dictionary of splits. Defaults to {} This is automatically inferred from directory structure
description
Description of dataset. Defaults to Description()

Dataset Saving and Loading

Datasets are saved via pyarrow reader and writers. This allows for efficient loading and saving of datasets. For saving, Dataset splits are saved as parquet files based on the specified shard size. During loading, the dataset is loaded based on both batch_size and chunk_size arguments. The batch_size argument is used to specify the number of rows to load at a time. The chunk_size argument is used to split the batch by n chunks. Both of these arguments are used to control memory usage during loading.

Metadata

The metadata.jsonl file is a jsonl file containing line separated json entries that can be written and loaded via the dataset's Metadata class. The Metadata class is a pydantic model that is used to validate the metadata.jsonl file. Each metadata subclass accepts a list of FileRecords. For subclass-specific examples, please refer to the examples below.

ImageDataset

ImageDataset is a subclass of Dataset that is used to load and save image data. It is similar to HuggingFace datasets, which was intentional in order to maintain some level of parity.

Data Type Directory of images
Save Format parquet
Source ImageDataset

ImageMetadata

ImageMetadata is the metadata subclass that is associated with ImageDataset.

Required Arguments

records
List of ImageRecords

opsml.ImageMetadata

Bases: Metadata

Create Image metadata from a list of ImageRecords

Args:

records:
    List of ImageRecords
Source code in opsml/data/interfaces/custom_data/image.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
class ImageMetadata(Metadata):
    """Create Image metadata from a list of ImageRecords

    Args:

        records:
            List of ImageRecords
    """

    records: List[ImageRecord]

    @classmethod
    def load_from_file(cls, filepath: Path) -> "ImageMetadata":
        """Load metadata from a file

        Args:
            filepath:
                Path to metadata file
        """
        assert filepath.name == "metadata.jsonl", "Filename must be metadata.jsonl"
        with filepath.open("r", encoding="utf-8") as file_:
            records = []
            for line in file_:
                records.append(ImageRecord(**json.loads(line)))
            return cls(records=records)
load_from_file(filepath) classmethod

Load metadata from a file

Parameters:

Name Type Description Default
filepath Path

Path to metadata file

required
Source code in opsml/data/interfaces/custom_data/image.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
@classmethod
def load_from_file(cls, filepath: Path) -> "ImageMetadata":
    """Load metadata from a file

    Args:
        filepath:
            Path to metadata file
    """
    assert filepath.name == "metadata.jsonl", "Filename must be metadata.jsonl"
    with filepath.open("r", encoding="utf-8") as file_:
        records = []
        for line in file_:
            records.append(ImageRecord(**json.loads(line)))
        return cls(records=records)

ImageRecord

ImageRecord is the FileRecord subclass that is associated with ImageMetadata.

Required Arguments

filepath
Pathlike object to image file

Optional Arguments

caption
Caption for the image
categories
List of categories for the image
objects
Bounding box specifications for objects in the image. See BBox

opsml.ImageRecord

Bases: FileRecord

Image record to associate with image file

Parameters:

Name Type Description Default
filepath

Full path to the file

required
caption

Optional caption for image

required
categories

Optional list of categories for image

required
objects

Optional BBox for the image

required
size

Size of the file. This is inferred automatically if filepath is provided.

required
Source code in opsml/data/interfaces/custom_data/image.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class ImageRecord(FileRecord):
    """Image record to associate with image file

    Args:
        filepath:
            Full path to the file
        caption:
            Optional caption for image
        categories:
            Optional list of categories for image
        objects:
            Optional `BBox` for the image
        size:
            Size of the file. This is inferred automatically if filepath is provided.

    """

    caption: Optional[str] = None
    categories: Optional[List[Union[str, int, float]]] = None
    objects: Optional[BBox] = None

    def to_arrow(self, data_dir: Path, split_label: Optional[str] = None) -> Dict[str, Any]:
        """Saves data to arrow format

        Args:
            data_dir:
                Path to data directory
            split_label:
                Optional split label for data

        Returns:
            Dictionary of data to be saved to arrow
        """
        path = self.filepath.relative_to(data_dir)

        with Image.open(self.filepath) as img:
            stream_record = {
                "split_label": split_label,
                "path": path.as_posix(),
                "height": img.height,
                "width": img.width,
                "bytes": img.tobytes(),
                "mode": img.mode,
            }
        return stream_record
to_arrow(data_dir, split_label=None)

Saves data to arrow format

Parameters:

Name Type Description Default
data_dir Path

Path to data directory

required
split_label Optional[str]

Optional split label for data

None

Returns:

Type Description
Dict[str, Any]

Dictionary of data to be saved to arrow

Source code in opsml/data/interfaces/custom_data/image.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def to_arrow(self, data_dir: Path, split_label: Optional[str] = None) -> Dict[str, Any]:
    """Saves data to arrow format

    Args:
        data_dir:
            Path to data directory
        split_label:
            Optional split label for data

    Returns:
        Dictionary of data to be saved to arrow
    """
    path = self.filepath.relative_to(data_dir)

    with Image.open(self.filepath) as img:
        stream_record = {
            "split_label": split_label,
            "path": path.as_posix(),
            "height": img.height,
            "width": img.width,
            "bytes": img.tobytes(),
            "mode": img.mode,
        }
    return stream_record

Example Writing Metadata

# create images
from opsml import ImageRecord, ImageMetadata, BBox

records = []
record.append(ImageRecord(
        filepath=Path("image_dir/opsml.jpg"),
        caption="This is a caption for the image",
        categories=[0],
        objects=BBox(
            bbox=[[302.0, 109.0, 73.0, 52.0]],
            categories=[0]
        ),
    )
)

ImageMetadata(records=records).write_to_file(Path("image_dir/metadata.jsonl"))

Example Using ImageDataset

from opsml import ImageDataset, CardInfo, DataCard, CardRegistry

info = CardInfo(name="data", repository="opsml", contact="user@email.com")
data_registry = CardRegistry("data")

data = ImageDataset(path=Path("image_dir"))

# Create and register datacard
datacard = DataCard(interface=interface, info=info)
data_registry.register_card(card=datacard)

TextDataset

TextDataset is a subclass of Dataset that is used to load and save test data. It is similar to HuggingFace datasets, which was intentional in order to maintain some level of parity.

Data Type Directory of text files
Save Format parquet
Source TextDataset

TextMetadata

TextMetadata is the metadata subclass that is associated with TextDataset.

Required Arguments

records
List of TextRecords

opsml.TextMetadata

Bases: Metadata

Create Image metadata from a list of ImageRecords

Args:

records:
    List of ImageRecords
Source code in opsml/data/interfaces/custom_data/text.py
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class TextMetadata(Metadata):
    """Create Image metadata from a list of ImageRecords

    Args:

        records:
            List of ImageRecords
    """

    records: List[TextRecord]

    @classmethod
    def load_from_file(cls, filepath: Path) -> "TextMetadata":
        """Load metadata from a file

        Args:
            filepath:
                Path to metadata file
        """
        assert filepath.name == "metadata.jsonl", "Filename must be metadata.jsonl"
        with filepath.open("r", encoding="utf-8") as file_:
            records = []
            for line in file_:
                records.append(TextRecord(**json.loads(line)))
            return cls(records=records)
load_from_file(filepath) classmethod

Load metadata from a file

Parameters:

Name Type Description Default
filepath Path

Path to metadata file

required
Source code in opsml/data/interfaces/custom_data/text.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@classmethod
def load_from_file(cls, filepath: Path) -> "TextMetadata":
    """Load metadata from a file

    Args:
        filepath:
            Path to metadata file
    """
    assert filepath.name == "metadata.jsonl", "Filename must be metadata.jsonl"
    with filepath.open("r", encoding="utf-8") as file_:
        records = []
        for line in file_:
            records.append(TextRecord(**json.loads(line)))
        return cls(records=records)

TextRecord

TextRecord is the FileRecord subclass that is associated with TextMetadata.

Required Arguments

filepath
Pathlike object to image file

opsml.TextRecord

Bases: FileRecord

Text record to associate with text file

Parameters:

Name Type Description Default
filepath

Full path to the file

required
size

Size of the file. This is inferred automatically if filepath is provided.

required
Source code in opsml/data/interfaces/custom_data/text.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class TextRecord(FileRecord):
    """Text record to associate with text file

    Args:
        filepath:
            Full path to the file
        size:
            Size of the file. This is inferred automatically if filepath is provided.

    """

    def to_arrow(self, data_dir: Path, split_label: Optional[str] = None) -> Dict[str, Any]:
        """Saves data to arrow format

        Args:
            data_dir:
                Path to data directory
            split_label:
                Optional split label for data

        Returns:
            Dictionary of data to be saved to arrow
        """
        path = self.filepath.relative_to(data_dir)

        # write file
        with open(self.filepath, "rb") as file_:
            stream_record = {
                "split_label": split_label,
                "path": path.as_posix(),
                "bytes": file_.read(),
            }
        return stream_record
to_arrow(data_dir, split_label=None)

Saves data to arrow format

Parameters:

Name Type Description Default
data_dir Path

Path to data directory

required
split_label Optional[str]

Optional split label for data

None

Returns:

Type Description
Dict[str, Any]

Dictionary of data to be saved to arrow

Source code in opsml/data/interfaces/custom_data/text.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def to_arrow(self, data_dir: Path, split_label: Optional[str] = None) -> Dict[str, Any]:
    """Saves data to arrow format

    Args:
        data_dir:
            Path to data directory
        split_label:
            Optional split label for data

    Returns:
        Dictionary of data to be saved to arrow
    """
    path = self.filepath.relative_to(data_dir)

    # write file
    with open(self.filepath, "rb") as file_:
        stream_record = {
            "split_label": split_label,
            "path": path.as_posix(),
            "bytes": file_.read(),
        }
    return stream_record

Example Writing Metadata

from opsml import TextMetadata, TextRecord

record = TextRecord(filepath=Path("text_dir/opsml.txt"))

TextMetadata(records=[record]).write_to_file(Path("text_dir/metadata.jsonl"))

Example Using TextDataset

from opsml import TextDataset, CardInfo, DataCard, CardRegistry

info = CardInfo(name="data", repository="opsml", contact="user@email.com")
data_registry = CardRegistry("data")

data = TextDataset(path=Path("text_dir"))

# Create and register datacard
datacard = DataCard(interface=interface, info=info)
data_registry.register_card(card=datacard)