Training with Data Lineage¶
Instead of downloading an entire dataset and writing scripts to filter it locally, you can use TQL queries directly on a managed table to create focused training subsets. The same query always produces the same view at a given commit, giving you reproducible data lineage for every experiment.
Objective¶
Ingest a COCO-style dataset into Deeplake, use TQL to select only the categories relevant to a driving scenario, and train a Faster R-CNN detector on that filtered subset. All without downloading or duplicating data.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - Deep Learning framework:
pip install torch torchvision - A Deeplake API token.
Set credentials first
Complete Code¶
import io
import json
import torch
import torchvision
from PIL import Image
from torchvision import transforms
from deeplake import Client
# 1. Setup
client = Client()
# 2. Ingest COCO (skip if already ingested)
from deeplake.managed.formats import CocoPanoptic
client.ingest("coco_train", format=CocoPanoptic(
images_dir="coco/train2017",
masks_dir="coco/panoptic_train2017",
annotations="coco/annotations/panoptic_train2017.json",
))
# 3. Open the table and filter with TQL
ds = client.open_table("coco_train")
view = ds.query(
"SELECT * WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)
# 4. Verify the filtered subset
result = client.query("SELECT COUNT(*) as count FROM coco_train "
"WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
"'traffic light', 'stop sign')")
print(f"Training on {result} samples")
# 5. Parse detection batches from ds.batches()
to_tensor = transforms.ToTensor()
def parse_detection_batch(batch):
images = []
targets = []
for i in range(len(batch["image"])):
img = Image.open(io.BytesIO(batch["image"][i])).convert("RGB")
images.append(to_tensor(img))
segs = json.loads(batch["segments_info"][i]) if isinstance(batch["segments_info"][i], str) else batch["segments_info"][i]
boxes, labels = [], []
for seg in segs:
bbox = seg["bbox"]
boxes.append([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]])
labels.append(seg["category_id"])
targets.append({
"boxes": torch.tensor(boxes, dtype=torch.float32) if boxes else torch.zeros((0, 4)),
"labels": torch.tensor(labels, dtype=torch.int64) if labels else torch.zeros(0, dtype=torch.int64),
})
return images, targets
# 6. Train Faster R-CNN on the filtered subset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.to(device)
model.train()
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9)
for epoch in range(3):
n_batches = 0
for batch in view.batches(batch_size=16):
images, targets = parse_detection_batch(batch)
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
n_batches += 1
if n_batches % 50 == 0:
print(f"Epoch {epoch}, Batch {n_batches}, Loss: {losses.item():.4f}")
Step-by-Step Breakdown¶
1. Ingest the Dataset¶
COCO requires the CocoPanoptic format class because its nested annotation structure (bboxes, categories, areas) cannot be auto-converted by the _huggingface shortcut. Download COCO locally first, then ingest:
from deeplake.managed.formats import CocoPanoptic
client.ingest("coco_train", format=CocoPanoptic(
images_dir="path/to/images",
masks_dir="path/to/masks",
annotations="path/to/annotations.json",
))
2. Query and Filter with TQL¶
Open the table and run a TQL query to create a filtered view. This does not copy or move any data. It returns a lightweight view that points to the matching rows.
ds = client.open_table("coco_train")
view = ds.query(
"SELECT * WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)
You can verify the size of your subset with a count query:
result = client.query("SELECT COUNT(*) as count FROM coco_train "
"WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
"'traffic light', 'stop sign')")
print(result)
3. Train on the Filtered Subset¶
Pass the filtered view directly to a PyTorch DataLoader. Deeplake streams only the matching rows, so the training loop never touches irrelevant data.
from torch.utils.data import DataLoader
train_loader = DataLoader(
view.pytorch(),
batch_size=16,
shuffle=True,
num_workers=4,
)
From here, the training loop is standard PyTorch, nothing Deeplake-specific.
4. Track Lineage¶
Every TQL query is deterministic at a given dataset commit. This means:
- Reproducibility: re-running the same query on the same commit returns exactly the same rows.
- Auditability: you can log the query string alongside your model checkpoint to record which data was used.
- Iteration: changing the filter (e.g., adding
'pedestrian') creates a new view instantly, without re-downloading anything.
No external metadata store or DVC-style tracking is needed. The query is the lineage.
What to try next¶
- GPU-Streaming Pipeline: optimize throughput for large-scale training.
- Massive Ingestion: prepare large-scale datasets for training.
- Image Classification Training: a simpler single-label training example.