Data Formats¶
Format classes standardize domain-specific datasets into column-oriented batches for ingestion. Use them when your data has a specific structure (COCO annotations, robotics episodes, etc.) that needs custom processing.
How formats work¶
A format class has up to four methods:
| Method | Required | Purpose |
|---|---|---|
normalize() |
Yes | Yields dict batches ({column: [values]}) |
schema() |
No | Declares Deeplake storage types ("BINARY", "TEXT") |
pg_schema() |
No | Declares PostgreSQL domain types ("IMAGE", "SEGMENT_MASK") |
image_columns() |
No | Lists columns for automatic thumbnail generation |
Pass a format instance to client.ingest():
Built-in formats¶
CocoPanoptic¶
Processes COCO panoptic segmentation datasets. Generates columns for image data, segmentation masks (PNG), dimensions, and segment metadata as JSON.
from deeplake.managed.formats import CocoPanoptic
fmt = CocoPanoptic(
images_dir="coco_panoptic/images",
masks_dir="coco_panoptic/masks",
annotations="coco_panoptic/panoptic.json",
)
client.ingest("panoptic", {}, format=fmt)
Coco¶
Handles COCO detection and caption annotations. Generates combined segmentation masks, bounding boxes, captions, and category definitions.
from deeplake.managed.formats import Coco
fmt = Coco(
images_dir="coco_detection/images",
instances_json="coco_detection/instances.json",
)
client.ingest("detection", {}, format=fmt)
LeRobot¶
Three-table design for robotic learning datasets, with one format class per table:
LeRobotTasks: task indices and descriptionsLeRobotFrames: per-frame state/action vectors with episode and task referencesLeRobotEpisodes: episode-level metadata with video streams from multiple cameras
from deeplake.managed.formats.lerobot import LeRobotTasks, LeRobotFrames, LeRobotEpisodes
dataset_dir = "lerobot_dataset/"
client.ingest("robot_tasks", {}, format=LeRobotTasks(dataset_dir))
client.ingest("robot_frames", {}, format=LeRobotFrames(dataset_dir))
client.ingest("robot_episodes", {}, format=LeRobotEpisodes(dataset_dir))
See LeRobot Integration for a complete end-to-end example.
Custom formats¶
Create your own format class for domain-specific data:
from deeplake.managed.formats import Format
class CsvWithImages(Format):
def __init__(self, csv_path, image_dir):
self.csv_path = csv_path
self.image_dir = image_dir
def schema(self):
return {"image": "BINARY"}
def pg_schema(self):
return {"image": "IMAGE"}
def image_columns(self):
return ["image"]
def normalize(self):
import csv
with open(self.csv_path) as f:
reader = csv.DictReader(f)
batch = {"label": [], "image": []}
for row in reader:
with open(f"{self.image_dir}/{row['filename']}", "rb") as img:
batch["image"].append(img.read())
batch["label"].append(row["label"])
if len(batch["label"]) >= 100:
yield batch
batch = {"label": [], "image": []}
if batch["label"]:
yield batch
client.ingest("labeled_images", {}, format=CsvWithImages("labels.csv", "images/"))
Guidelines for normalize()¶
- Yield dicts with column names as keys and equal-length lists as values
- Keep consistent column names across all batches
- Batch 100-1000 rows per yield to manage memory
- Raise exceptions for critical errors, log warnings for recoverable ones
Next steps¶
- Data Types: full type reference and schema inference
- Ingestion:
client.ingest()API and chunking strategies