Retrieval to Training¶
Traditionally, building a training set involves complex ETL (Extract, Transform, Load) pipelines to move data from a database to a format usable by a model. Deeplake eliminates this step by allowing you to define your training set as a SQL query and stream matching rows directly to your GPU.
Objective¶
Retrieve the top-k most relevant clips from a massive video lake based on a training intent (e.g., "robotic picking") and prepare them for a fine-tuning loop.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - Deep Learning framework (PyTorch/TensorFlow).
- A Deeplake API token.
Set credentials first
Complete Code¶
from deeplake import Client
from torch.utils.data import DataLoader
# 1. Setup
client = Client()
# 2. Retrieve Training Candidates via SQL
# Use vector search to find specific actions across 1M+ clips
training_intent = "robot arm picking up a glass bottle"
# query_emb = model.encode(training_intent).tolist()
query_emb = [0.12, 0.42, -0.1] # Placeholder for intent embedding
emb_pg = "{" + ",".join(str(x) for x in query_emb) + "}"
matches = (
client.table("video_catalog")
.select("id", "file_id", "start_time", "end_time")
.order_by(f"embedding <#> '{emb_pg}'::float4[] DESC")
.limit(1000)
.execute()
)
matched_ids = [m["id"] for m in matches]
print(f"Defined training set with {len(matched_ids)} relevant clips.")
# 3. Stream to PyTorch via TQL view
# open_table returns a Dataset; query it with TQL to create a filtered view
ds = client.open_table("video_catalog")
id_list = ",".join(str(i) for i in matched_ids)
training_view = ds.query(f"SELECT * WHERE id IN ({id_list})")
# Stream the view directly to PyTorch
dataloader = DataLoader(training_view.pytorch(), batch_size=16, num_workers=4)
for batch in dataloader:
print(f"Loading batch of {len(batch['id'])} retrieved video segments...")
break
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
# Retrieve training candidate IDs via vector similarity search
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "SELECT id FROM \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" ORDER BY embedding <#> $1::float4[] DESC LIMIT 100",
"params": ["{0.1,0.2,0.3}"]
}'
Step-by-Step Breakdown¶
1. Training Set as a Query¶
In the Deeplake architecture, your training set is dynamic. If your model's accuracy is low on "night scenes", you can simply update your SQL query to WHERE is_night = True and immediately restart your training loop on the new slice of data.
2. Eliminating Data Redundancy¶
Because the training loop streams data directly from the managed lake, you don't need to maintain multiple copies of the data (e.g., one in a database and one as a .tar file for training). The lake is the source of truth for both metadata and tensors.
3. High-Performance Filtering¶
Deeplake's USING deeplake_index ensures that even when you have millions of rows, the retrieval step takes milliseconds. This allows for interactive "Human-in-the-loop" training where developers can refine the training set query in real-time.
What to try next¶
- GPU-Streaming Guide: technical details of the streaming kernels.
- Massive Ingestion: how to populate the lake with millions of candidates.
- Physical AI & Robotics: use case for robotics imitation learning.