Training Object Detection in PyTorch¶

Deeplake makes it easy to train object detection models by streaming annotation-rich datasets directly into your PyTorch training loop. This tutorial shows how to ingest COCO, build a detection-ready DataLoader, and fine-tune a Faster R-CNN model.

Objective¶

Ingest the COCO dataset into a Deeplake managed table and train a Faster R-CNN ResNet-50 FPN model using PyTorch, streaming data directly from the cloud.

Prerequisites¶

Deeplake SDK: pip install deeplake
PyTorch with torchvision: pip install torch torchvision
A Deeplake API token.
COCO 2017 dataset files (or use the HuggingFace shortcut below).

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code¶

Python SDK

import io
import json
import torch
from PIL import Image
from torchvision import transforms
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from deeplake import Client

# 1. Setup
client = Client()

# From local COCO files (CocoPanoptic format)
from deeplake.managed.formats import CocoPanoptic

client.ingest(
    "coco_train",
    format=CocoPanoptic(
        images_dir="coco/train2017",
        masks_dir="coco/panoptic_train2017",
        annotations="coco/annotations/panoptic_train2017.json",
    ),
)

# Note: COCO requires the CocoPanoptic format above.
# The HuggingFace shortcut does not work for COCO because
# its nested "objects" column cannot be auto-converted.

# 2. Open the table and stream with ds.batches()
ds = client.open_table("coco_train")

to_tensor = transforms.ToTensor()

def parse_detection_batch(batch):
    """Parse a ds.batches() batch into (images, targets) for Faster R-CNN."""
    images = []
    targets = []
    for i in range(len(batch["image"])):
        img = Image.open(io.BytesIO(batch["image"][i])).convert("RGB")
        images.append(to_tensor(img))

        # segments_info is stored as a JSON string
        segs = json.loads(batch["segments_info"][i]) if isinstance(batch["segments_info"][i], str) else batch["segments_info"][i]
        boxes = []
        labels = []
        for seg in segs:
            bbox = seg["bbox"]  # [x, y, w, h]
            boxes.append([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]])
            labels.append(seg["category_id"])

        targets.append({
            "boxes": torch.tensor(boxes, dtype=torch.float32) if boxes else torch.zeros((0, 4)),
            "labels": torch.tensor(labels, dtype=torch.int64) if labels else torch.zeros(0, dtype=torch.int64),
        })
    return images, targets

# 3. Define the model
num_classes = 91  # COCO has 80 categories + background (standard mapping uses ids up to 90)
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 4. Training loop using ds.batches()
optimizer = torch.optim.SGD(
    model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005
)

model.train()
for epoch in range(3):
    epoch_loss = 0.0
    n_batches = 0
    for batch in ds.batches(batch_size=8):
        images, targets = parse_detection_batch(batch)
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        epoch_loss += losses.item()
        n_batches += 1
        if n_batches % 50 == 0:
            print(f"Epoch {epoch}, Batch {n_batches}, Loss: {losses.item():.4f}")

    print(f"Epoch {epoch} complete - Avg Loss: {epoch_loss / max(n_batches, 1):.4f}")

torch.save(model.state_dict(), "fasterrcnn_coco.pth")
print("Model saved.")

Step-by-Step Breakdown¶

1. Ingest COCO Dataset¶

Deeplake supports ingesting COCO directly via the CocoPanoptic format, which handles images, masks, and annotations in one call:

from deeplake.managed.formats import CocoPanoptic

client.ingest(
    "coco_train",
    format=CocoPanoptic(
        images_dir="coco/train2017",
        masks_dir="coco/panoptic_train2017",
        annotations="coco/annotations/panoptic_train2017.json",
    ),
)

COCO requires the CocoPanoptic format

The _huggingface shortcut does not work for COCO because its nested objects column (containing bboxes, categories, areas) cannot be auto-converted. Use the CocoPanoptic format class above instead.

2. Stream and Parse Detection Data¶

Object detection models expect a list of images and a list of per-image target dictionaries (each containing boxes, labels, etc.). Use ds.batches() for fast streaming, then parse each batch into the format Faster R-CNN expects:

def parse_detection_batch(batch):
    images = []
    targets = []
    for i in range(len(batch["image"])):
        img = Image.open(io.BytesIO(batch["image"][i])).convert("RGB")
        images.append(to_tensor(img))

        segs = json.loads(batch["segments_info"][i])
        boxes, labels = [], []
        for seg in segs:
            bbox = seg["bbox"]  # [x, y, w, h]
            boxes.append([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]])
            labels.append(seg["category_id"])

        targets.append({
            "boxes": torch.tensor(boxes, dtype=torch.float32),
            "labels": torch.tensor(labels, dtype=torch.int64),
        })
    return images, targets

The segments_info column is stored as a JSON string and needs json.loads() to parse the bounding boxes and category IDs.

3. Define the Model¶

We start from a pretrained Faster R-CNN with a ResNet-50 FPN backbone and replace the classification head to match the COCO category count:

model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

The pretrained backbone provides strong feature extraction out of the box; only the box predictor head is re-initialized for your target classes.

4. Training Loop¶

Faster R-CNN computes its own loss internally when given both images and targets. The model returns a dictionary of losses (loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg) which are summed for backpropagation:

loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())

What to try next¶

GPU-Streaming Pipeline: learn more about Deeplake's streaming architecture.
Massive Ingestion: prepare large-scale datasets for training.
MMDetection Integration: use Deeplake with the MMDetection framework.