Data Formats¶

Format classes standardize domain-specific datasets into column-oriented batches for ingestion. Use them when your data has a specific structure (COCO annotations, robotics episodes, etc.) that needs custom processing.

How formats work¶

A format class has up to four methods:

Method	Required	Purpose
`normalize()`	Yes	Yields dict batches (`{column: [values]}`)
`schema()`	No	Declares Deeplake storage types (`"BINARY"`, `"TEXT"`)
`pg_schema()`	No	Declares PostgreSQL domain types (`"IMAGE"`, `"SEGMENT_MASK"`)
`image_columns()`	No	Lists columns for automatic thumbnail generation

Pass a format instance to client.ingest():

client.ingest("my_table", data, format=MyFormat())

Built-in formats¶

CocoPanoptic¶

Processes COCO panoptic segmentation datasets. Generates columns for image data, segmentation masks (PNG), dimensions, and segment metadata as JSON.

from deeplake.managed.formats import CocoPanoptic

fmt = CocoPanoptic(
    images_dir="coco_panoptic/images",
    masks_dir="coco_panoptic/masks",
    annotations="coco_panoptic/panoptic.json",
)
client.ingest("panoptic", {}, format=fmt)

Coco¶

Handles COCO detection and caption annotations. Generates combined segmentation masks, bounding boxes, captions, and category definitions.

from deeplake.managed.formats import Coco

fmt = Coco(
    images_dir="coco_detection/images",
    instances_json="coco_detection/instances.json",
)
client.ingest("detection", {}, format=fmt)

LeRobot¶

Three-table design for robotic learning datasets, with one format class per table:

LeRobotTasks: task indices and descriptions
LeRobotFrames: per-frame state/action vectors with episode and task references
LeRobotEpisodes: episode-level metadata with video streams from multiple cameras

from deeplake.managed.formats.lerobot import LeRobotTasks, LeRobotFrames, LeRobotEpisodes

dataset_dir = "lerobot_dataset/"

client.ingest("robot_tasks", {}, format=LeRobotTasks(dataset_dir))
client.ingest("robot_frames", {}, format=LeRobotFrames(dataset_dir))
client.ingest("robot_episodes", {}, format=LeRobotEpisodes(dataset_dir))

See LeRobot Integration for a complete end-to-end example.

Custom formats¶

Create your own format class for domain-specific data:

from deeplake.managed.formats import Format

class CsvWithImages(Format):
    def __init__(self, csv_path, image_dir):
        self.csv_path = csv_path
        self.image_dir = image_dir

    def schema(self):
        return {"image": "BINARY"}

    def pg_schema(self):
        return {"image": "IMAGE"}

    def image_columns(self):
        return ["image"]

    def normalize(self):
        import csv
        with open(self.csv_path) as f:
            reader = csv.DictReader(f)
            batch = {"label": [], "image": []}
            for row in reader:
                with open(f"{self.image_dir}/{row['filename']}", "rb") as img:
                    batch["image"].append(img.read())
                batch["label"].append(row["label"])
                if len(batch["label"]) >= 100:
                    yield batch
                    batch = {"label": [], "image": []}
            if batch["label"]:
                yield batch

client.ingest("labeled_images", {}, format=CsvWithImages("labels.csv", "images/"))

Guidelines for `normalize()`¶

Yield dicts with column names as keys and equal-length lists as values
Keep consistent column names across all batches
Batch 100-1000 rows per yield to manage memory
Raise exceptions for critical errors, log warnings for recoverable ones

Next steps¶

Data Types: full type reference and schema inference
Ingestion: client.ingest() API and chunking strategies