Training Object Detection in PyTorch¶
Deeplake makes it easy to train object detection models by streaming annotation-rich datasets directly into your PyTorch training loop. This tutorial shows how to ingest COCO, build a detection-ready DataLoader, and fine-tune a Faster R-CNN model.
Objective¶
Ingest the COCO dataset into a Deeplake managed table and train a Faster R-CNN ResNet-50 FPN model using PyTorch, streaming data directly from the cloud.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - PyTorch with torchvision:
pip install torch torchvision - A Deeplake API token.
- COCO 2017 dataset files (or use the HuggingFace shortcut below).
Set credentials first
Complete Code¶
import io
import json
import torch
from PIL import Image
from torchvision import transforms
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from deeplake import Client
# 1. Setup
client = Client()
# From local COCO files (CocoPanoptic format)
from deeplake.managed.formats import CocoPanoptic
client.ingest(
"coco_train",
format=CocoPanoptic(
images_dir="coco/train2017",
masks_dir="coco/panoptic_train2017",
annotations="coco/annotations/panoptic_train2017.json",
),
)
# Note: COCO requires the CocoPanoptic format above.
# The HuggingFace shortcut does not work for COCO because
# its nested "objects" column cannot be auto-converted.
# 2. Open the table and stream with ds.batches()
ds = client.open_table("coco_train")
to_tensor = transforms.ToTensor()
def parse_detection_batch(batch):
"""Parse a ds.batches() batch into (images, targets) for Faster R-CNN."""
images = []
targets = []
for i in range(len(batch["image"])):
img = Image.open(io.BytesIO(batch["image"][i])).convert("RGB")
images.append(to_tensor(img))
# segments_info is stored as a JSON string
segs = json.loads(batch["segments_info"][i]) if isinstance(batch["segments_info"][i], str) else batch["segments_info"][i]
boxes = []
labels = []
for seg in segs:
bbox = seg["bbox"] # [x, y, w, h]
boxes.append([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]])
labels.append(seg["category_id"])
targets.append({
"boxes": torch.tensor(boxes, dtype=torch.float32) if boxes else torch.zeros((0, 4)),
"labels": torch.tensor(labels, dtype=torch.int64) if labels else torch.zeros(0, dtype=torch.int64),
})
return images, targets
# 3. Define the model
num_classes = 91 # COCO has 80 categories + background (standard mapping uses ids up to 90)
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 4. Training loop using ds.batches()
optimizer = torch.optim.SGD(
model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005
)
model.train()
for epoch in range(3):
epoch_loss = 0.0
n_batches = 0
for batch in ds.batches(batch_size=8):
images, targets = parse_detection_batch(batch)
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
epoch_loss += losses.item()
n_batches += 1
if n_batches % 50 == 0:
print(f"Epoch {epoch}, Batch {n_batches}, Loss: {losses.item():.4f}")
print(f"Epoch {epoch} complete - Avg Loss: {epoch_loss / max(n_batches, 1):.4f}")
torch.save(model.state_dict(), "fasterrcnn_coco.pth")
print("Model saved.")
Step-by-Step Breakdown¶
1. Ingest COCO Dataset¶
Deeplake supports ingesting COCO directly via the CocoPanoptic format, which handles images, masks, and annotations in one call:
from deeplake.managed.formats import CocoPanoptic
client.ingest(
"coco_train",
format=CocoPanoptic(
images_dir="coco/train2017",
masks_dir="coco/panoptic_train2017",
annotations="coco/annotations/panoptic_train2017.json",
),
)
COCO requires the CocoPanoptic format
The _huggingface shortcut does not work for COCO because its nested objects column (containing bboxes, categories, areas) cannot be auto-converted. Use the CocoPanoptic format class above instead.
2. Stream and Parse Detection Data¶
Object detection models expect a list of images and a list of per-image target dictionaries (each containing boxes, labels, etc.). Use ds.batches() for fast streaming, then parse each batch into the format Faster R-CNN expects:
def parse_detection_batch(batch):
images = []
targets = []
for i in range(len(batch["image"])):
img = Image.open(io.BytesIO(batch["image"][i])).convert("RGB")
images.append(to_tensor(img))
segs = json.loads(batch["segments_info"][i])
boxes, labels = [], []
for seg in segs:
bbox = seg["bbox"] # [x, y, w, h]
boxes.append([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]])
labels.append(seg["category_id"])
targets.append({
"boxes": torch.tensor(boxes, dtype=torch.float32),
"labels": torch.tensor(labels, dtype=torch.int64),
})
return images, targets
The segments_info column is stored as a JSON string and needs json.loads() to parse the bounding boxes and category IDs.
3. Define the Model¶
We start from a pretrained Faster R-CNN with a ResNet-50 FPN backbone and replace the classification head to match the COCO category count:
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
The pretrained backbone provides strong feature extraction out of the box; only the box predictor head is re-initialized for your target classes.
4. Training Loop¶
Faster R-CNN computes its own loss internally when given both images and targets. The model returns a dictionary of losses (loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg) which are summed for backpropagation:
What to try next¶
- GPU-Streaming Pipeline: learn more about Deeplake's streaming architecture.
- Massive Ingestion: prepare large-scale datasets for training.
- MMDetection Integration: use Deeplake with the MMDetection framework.