Datasets
Datasets¶
In addition to DataInterface
classes, OpsML
also provides a Dataset
class that is used when working with text or image data.
Required Arguments for all Datasets (examples below)¶
data_dir
- Path to directory containing data. This should be the
root
directory that contains all of the data. If you wish to definesplits
, you can do so by creating sub-directories within theroot
directory. For example, if you have atrain
andtest
split, you can create a directory structure like this:
root
├── train # this will be inferred as a split named `train`
│ ├── file1.txt
│ ├── file2.txt
│ ├── file3.txt
│ └── metadata.jsonl
└── test # this will be inferred as a split named `test`
├── file4.txt
├── file5.txt
├── file6.txt
└── metadata.jsonl
shard_size
- Size of each shard. Defaults to
512MB
Optional Arguments¶
splits
- Dictionary of splits. Defaults to
{}
This is automatically inferred from directory structure description
- Description of dataset. Defaults to
Description()
Dataset Saving and Loading¶
Datasets are saved via pyarrow
reader and writers. This allows for efficient loading and saving of datasets. For saving, Dataset
splits are saved as parquet files based on the specified shard
size. During loading, the dataset is loaded based on both batch_size
and chunk_size
arguments. The batch_size
argument is used to specify the number of rows to load at a time. The chunk_size
argument is used to split the batch by n
chunks. Both of these arguments are used to control memory usage during loading.
Metadata¶
The metadata.jsonl
file is a jsonl
file containing line separated json entries that can be written and loaded via the dataset's Metadata
class. The Metadata
class is a pydantic
model that is used to validate the metadata.jsonl
file. Each metadata
subclass accepts a list of FileRecords
. For subclass-specific examples, please refer to the examples below.
ImageDataset¶
ImageDataset
is a subclass of Dataset
that is used to load and save image data. It is similar to HuggingFace
datasets, which was intentional in order to maintain some level of parity.
Data Type | Directory of images |
Save Format | parquet |
Source | ImageDataset |
ImageMetadata¶
ImageMetadata
is the metadata subclass that is associated with ImageDataset
.
Required Arguments¶
records
- List of
ImageRecords
opsml.ImageMetadata
¶
Bases: Metadata
Create Image metadata from a list of ImageRecords
Args:
records:
List of ImageRecords
Source code in opsml/data/interfaces/custom_data/image.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
load_from_file(filepath)
classmethod
¶
Load metadata from a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath |
Path
|
Path to metadata file |
required |
Source code in opsml/data/interfaces/custom_data/image.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
ImageRecord¶
ImageRecord
is the FileRecord
subclass that is associated with ImageMetadata
.
Required Arguments¶
filepath
- Pathlike object to image file
Optional Arguments¶
caption
- Caption for the image
categories
- List of categories for the image
objects
- Bounding box specifications for objects in the image. See
BBox
opsml.ImageRecord
¶
Bases: FileRecord
Image record to associate with image file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath |
Full path to the file |
required | |
caption |
Optional caption for image |
required | |
categories |
Optional list of categories for image |
required | |
objects |
Optional |
required | |
size |
Size of the file. This is inferred automatically if filepath is provided. |
required |
Source code in opsml/data/interfaces/custom_data/image.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
to_arrow(data_dir, split_label=None)
¶
Saves data to arrow format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dir |
Path
|
Path to data directory |
required |
split_label |
Optional[str]
|
Optional split label for data |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary of data to be saved to arrow |
Source code in opsml/data/interfaces/custom_data/image.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
Example Writing Metadata¶
# create images
from opsml import ImageRecord, ImageMetadata, BBox
records = []
record.append(ImageRecord(
filepath=Path("image_dir/opsml.jpg"),
caption="This is a caption for the image",
categories=[0],
objects=BBox(
bbox=[[302.0, 109.0, 73.0, 52.0]],
categories=[0]
),
)
)
ImageMetadata(records=records).write_to_file(Path("image_dir/metadata.jsonl"))
Example Using ImageDataset¶
from opsml import ImageDataset, CardInfo, DataCard, CardRegistry
info = CardInfo(name="data", repository="opsml", contact="user@email.com")
data_registry = CardRegistry("data")
data = ImageDataset(path=Path("image_dir"))
# Create and register datacard
datacard = DataCard(interface=interface, info=info)
data_registry.register_card(card=datacard)
TextDataset¶
TextDataset
is a subclass of Dataset
that is used to load and save test data. It is similar to HuggingFace
datasets, which was intentional in order to maintain some level of parity.
Data Type | Directory of text files |
Save Format | parquet |
Source | TextDataset |
TextMetadata¶
TextMetadata
is the metadata subclass that is associated with TextDataset
.
Required Arguments¶
records
- List of
TextRecords
opsml.TextMetadata
¶
Bases: Metadata
Create Image metadata from a list of ImageRecords
Args:
records:
List of ImageRecords
Source code in opsml/data/interfaces/custom_data/text.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
load_from_file(filepath)
classmethod
¶
Load metadata from a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath |
Path
|
Path to metadata file |
required |
Source code in opsml/data/interfaces/custom_data/text.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
TextRecord¶
TextRecord
is the FileRecord
subclass that is associated with TextMetadata
.
Required Arguments¶
filepath
- Pathlike object to image file
opsml.TextRecord
¶
Bases: FileRecord
Text record to associate with text file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath |
Full path to the file |
required | |
size |
Size of the file. This is inferred automatically if filepath is provided. |
required |
Source code in opsml/data/interfaces/custom_data/text.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
to_arrow(data_dir, split_label=None)
¶
Saves data to arrow format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dir |
Path
|
Path to data directory |
required |
split_label |
Optional[str]
|
Optional split label for data |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary of data to be saved to arrow |
Source code in opsml/data/interfaces/custom_data/text.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
Example Writing Metadata¶
from opsml import TextMetadata, TextRecord
record = TextRecord(filepath=Path("text_dir/opsml.txt"))
TextMetadata(records=[record]).write_to_file(Path("text_dir/metadata.jsonl"))
Example Using TextDataset¶
from opsml import TextDataset, CardInfo, DataCard, CardRegistry
info = CardInfo(name="data", repository="opsml", contact="user@email.com")
data_registry = CardRegistry("data")
data = TextDataset(path=Path("text_dir"))
# Create and register datacard
datacard = DataCard(interface=interface, info=info)
data_registry.register_card(card=datacard)