Your Next Pit Crew: Enhancing RC Racing with Computer Vision

10 min readJan 9, 2025

I recently took on a computer vision passion project where I used Ultralytics YOLOv11 and Roboflow to detect and track off-road RC race cars in video footage. Beyond just identifying the cars, my goal was to create automatic highlight clips from these races and analyze racing lines to help drivers see the fastest way around the track. In this post, I’ll share the motivation behind the project, the steps I took to build it, and the lessons I learned along the way.

Race Line Tracking With Motion Stabilization

1. Project Overview & Motivation

Purpose

Races can be difficult to analyze manually, especially when cars move at high speeds, make quick direction changes, and occasionally disappear behind jumps or track obstacles. My main objectives:

Automatic Highlight Generation

Detect interesting or high-action moments in a race video and automatically clip them, and crop them for viewing on mobile devices which use tall aspect ratios

Racing Line Analysis

Track drivers’ paths around the track to identify which lines yield the fastest lap times.

Real-World Use Case

In the RC racing community, capturing highlights and performance data usually requires time-consuming, manual effort. Automating the detection of race events and analyzing racing lines can help drivers improve their techniques and offer spectators more engaging recaps.

2. Dataset Collection & Preparation

Data Sources

YouTube Videos: I gathered footage from my own races as well as other public RC race videos.
Google Search: Supplementary images were gathered to further diversify the dataset.

Annotation & Augmentation

I used Roboflow’s annotation tools to label the RC cars for each image in the dataset. Roboflow’s dataset augmentation features (like flipping, rotating, and color adjustments) helped expand and balance the data to improve model robustness.

3. YOLOv11 Architecture & Training

Model Customizations

Image Size (imgsz)

Because RC cars are often small or blurry, I experimented with various imgsz values. Training and inference at the same resolution proved crucial for accuracy.

Epochs & Patience

I adjusted the number of epochs and patience parameters to find a sweet spot where the model converged effectively without overtraining.

Batch Size & GPU Constraints

My GPU (RTX 3080 Ti with 12GB VRAM) limited how large of a batch size I could use, especially when using bigger model variants. Trial-and-error led me to an optimal batch size that fit into VRAM while still training efficiently.

# train.py
from ultralytics import YOLO, checks, hub
checks()
hub.login()

# Batch sizes by imgsz
batch_sizes = {
    1280: {
        'yolo11s.pt': 9,
    }
}

imgsz = 1280

# For fresh learning
model_file = 'yolo11s.pt'
model = YOLO(model_file)

batch_size = -1
if imgsz in batch_sizes and model_file in batch_sizes[imgsz]:
    batch_size = batch_sizes[imgsz][model_file]

results = model.train(
    data="dataset.v3.yaml",
    imgsz=imgsz,
    epochs=500,
    patience=100,
    batch=batch_size,
)

4. Workflow Integration: Roboflow & YOLOv11

I integrated Roboflow directly with YOLOv11 using the Roboflow API and CLI:

Dataset Download: Pulled the latest annotated dataset versions into my local environment.
Model Weights Upload: Stored trained weights in my Roboflow workspace for easy distribution or backups.

from roboflow import Roboflow
rf = Roboflow()
project = rf.workspace().project(PROJECT_NAME)
version = project.version(VERSION_NUMBER)
version.deploy("yolov11", "weights", "runs/detect/train29/weights/best.pt")

Roboflow Supervision Library: Simplified data visualization, bounding box drawing, and overlays.
Model Preview: Directly test the model from within your browser:

5. Deployment & Testing

Once the model was trained, I deployed and tested it locally on an Intel i9–12900K CPU with an Nvidia RTX 3080 Ti GPU. Here’s how I set up the pipeline in order to process 5K 60fps video footage:

Video Frame Loading

Used FFmpeg with NVIDIA encoding/decoding (NVENC) for fast frame extraction and processing.

def _init_decode_process(self):
    """ffmpeg command to decode video frames with NVIDIA hevc decoder from start_time to end_time"""
    input_kwargs = {
        'vcodec': f'{self.info.codec}_cuvid' if self.use_cuda else self.info.codec
    }
    if self.start_time > 0:
        input_kwargs['ss'] = self.start_time
    if self.end_time > 0:
        input_kwargs['to'] = self.end_time
    if self.use_cuda:
        input_kwargs['hwaccel'] = 'cuda'

    self._decode_process = (
        ffmpeg
        .input(self.video_path, **input_kwargs)
        .output('pipe:',
                format='rawvideo',
                pix_fmt='bgr24',
                vf=f'scale={self.out_size[0]}:{self.out_size[1]}')
        .run_async(pipe_stdout=True)
    )

Real-Time Resizing

Resized each frame to the optimal imgsz for inference, then scaled predictions back to the original frame for visualization. This proved to be much more performant than relying on the ultralytics library to handle automatic resizing.

import ffmpeg
import numpy as np
import supervision as sv

def process_video(self):

    # Get video metadata
    self._probe_video_metadata()
    # Set output size to be video size if it is missing
    if self.out_size is None or self.out_size > (self.info.width, self.info.height):
        self.out_size = (self.info.width, self.info.height)
    # determine the scale factor based on the output size and the yolo_size (imgsz)
    self.scale_factor = (self.out_size[0] / self.yolo_size[0], self.out_size[1] / self.yolo_size[1])

    self._init_decode_process()

    while True:
        # Read a frame from the ffmpeg process
        in_bytes = self._decode_process.stdout.read(frame_size)
        if not in_bytes:
            break
    
        frame_index += 1
        # Skip frames based on frame_interval
        if frame_index % self.vid_stride != 0:
            continue

        # Convert raw bytes to a numpy array
        frame = np.frombuffer(in_bytes, np.uint8).reshape([self.out_size[1], self.out_size[0], 3])
        
        # scale the frame for use in yolo detection if required
        if self.yolo_size < self.out_size:
            scaled_frame = cv2.resize(
                frame, self.yolo_size, interpolation=cv2.INTER_CUBIC
            )
        else:
            scaled_frame = frame
        
        # Run YOLO inference on the scaled frame
        results = self._model.track(scaled_frame,
                                    persist=True,
                                    conf=self.conf,
                                    iou=self.iou,
                                    agnostic_nms=self.agnostic_nms,
                                    tracker='./trackers/botsort_cfg.yaml')
        
        
        # convert detections for use with supervision, and then scale them
        detections = sv.Detections.from_ultralytics(results[0])
        scale_detections(detections, self.scale_factor)

def scale_detections(detections, scale_factor):
    for box in detections.xyxy:
        box[0] *= scale_factor[0]
        box[1] *= scale_factor[1]
        box[2] *= scale_factor[0]
        box[3] *= scale_factor[1]

Tracking and ID Management

Employed my own customized version of the BoT-SORT algorithm to track cars across frames. Tweaked settings to handle occlusions, sudden movements, and re-identifications.

# Ultralytics YOLO 🚀, AGPL-3.0 license
# Anthony Casagrande
# BirdRC tracker settings for BoT-SORT tracker https://github.com/NirAharon/BoT-SORT

tracker_type: botsort # tracker type, ['botsort', 'bytetrack']
track_high_thresh: 0.1 #
track_low_thresh: 0.001 #
new_track_thresh: 0.6 # threshold for init new track if the detection does not match any tracks
track_buffer:  300 # buffer to calculate the time when to remove tracks
match_thresh: 0.99  # threshold for matching tracks  (higher number more forgiving)
match_thresh_second: 0.7
match_thresh_third: 0.9
max_reactivation_distance: 20
fuse_score: True # Whether to fuse confidence scores with the iou distances before matching
# min_box_area: 10  # threshold for min box areas(for tracker evaluation, not used for now)

# BoT-SORT settings
# 'orb', 'sift', 'ecc', 'sparseOptFlow', 'none'
gmc_method: none # sparseOptFlow # method of global motion compensation

# ReID model related thresh
proximity_thresh: 0.99
appearance_thresh: 0.25
with_reid: True

6. Challenges & Solutions

GPU Memory & Performance

Challenge: Larger models and higher imgsz forced me to use smaller batch sizes, slowing down training, as well as increasing the time it took for inferencing.
Solution: Found the “sweet spot” for both model size and imgsz through careful testing, monitoring GPU usage, and model accuracy comparisons. An imgsz of 1280 paired with the yolo11s.pt sized model proved to be best, along with batch sizes between 9 and 11 depending on desired VRAM usage.

Tracking Consistency

Challenge: The RC cars move quickly and get obscured by track jumps or poles, causing frequent ID switches.
Solution: Tweaked BoT-SORT settings to prioritize consistent IDs despite short occlusions. Integrated OpenAI’s CLIP Vision Transformer to create embeddings to improve re-identification after lost tracks. I decided to monkey-patch my implementation over the official one so that I could still use the built-inmodel.track() , avoiding code changes.

# import the official ultralytics trackers
import ultralytics.trackers
from ultralytics import YOLO

# import my custom bot-sort implementation
import yolo_trackers

# monkey-patch the botsort implementation
ultralytics.trackers.BOTSORT = yolo_trackers.BOTSORT
ultralytics.trackers.track.TRACKER_MAP['botsort'] = yolo_trackers.BOTSORT

# yolo_trackers/clip.py

import torch
from transformers import CLIPProcessor, CLIPModel
import cv2
from PIL import Image

# Define the CLIP Encoder
class CLIPEncoder:
    def __init__(self, model_name="openai/clip-vit-base-patch32", device="cuda"):
        self.device = torch.device(device if torch.cuda.is_available() else "cpu")
        self.model = CLIPModel.from_pretrained(model_name).to(self.device)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.image_size = 224  # Default CLIP image size

    def preprocess(self, image, bbox):
        """
        Preprocess a bounding box image for CLIP.
        """
        x, y, w, h = map(int, bbox[:4])
        cropped_image = image[y-h//2:y+h//2, x-w//2:x+w//2]
        cropped_image = cv2.cvtColor(cropped_image, cv2.COLOR_BGR2RGB)
        pil_image = Image.fromarray(cropped_image)
        inputs = self.processor(images=pil_image, return_tensors="pt", padding=True)
        return {key: value.to(self.device) for key, value in inputs.items()}

    def encode(self, image, bbox):
        """
        Encode a bounding box image to an embedding.
        """
        inputs = self.preprocess(image, bbox)
        with torch.no_grad():
            embeddings = self.model.get_image_features(**inputs)
        return embeddings.cpu().numpy()

# yolo_trackers/botsort.py

import numpy as np

from .byte_tracker import BYTETracker
from .clip import CLIPEncoder

class BOTSORT(BYTETracker):
    def __init__(self, args, frame_rate=30):
        super().__init__(args, frame_rate)

        if self.args.with_reid:
            self.encoder = CLIPEncoder()
        self.gmc = GMC(method=self.args.gmc_method)

    def get_kalmanfilter(self):
        """Returns an instance of KalmanFilterXYWH for predicting and updating object states in the tracking process."""
        return KalmanFilterXYWH()

    def init_track(self, dets, scores, cls, img=None):
        """Initialize object tracks using detection bounding boxes, scores, class labels, and optional ReID features."""
        if len(dets) == 0:
            return []
        if self.args.with_reid and self.encoder is not None and img is not None:
            features_keep = [self.encoder.encode(img, det)[0] for det in dets]
            return [BOTrack(xyxy, s, c, f) for (xyxy, s, c, f) in zip(dets, scores, cls, features_keep)]  # detections
        else:
            return [BOTrack(xyxy, s, c) for (xyxy, s, c) in zip(dets, scores, cls)]  # detections

    def get_dists(self, tracks, detections):
        """Calculates distances between tracks and detections using IoU and optionally ReID embeddings."""
        dists = matching.iou_distance(tracks, detections)
        dists_mask = dists > self.args.proximity_thresh

        if self.args.fuse_score:
            dists = matching.fuse_score(dists, detections)

        if self.args.with_reid and self.encoder is not None:
            emb_dists = matching.embedding_distance(tracks, detections) / 2.0
            emb_dists[emb_dists > self.args.appearance_thresh] = 1.0
            emb_dists[dists_mask] = 1.0
            dists = np.minimum(dists, emb_dists)
        return dists

    def multi_predict(self, tracks):
        """Predicts the mean and covariance of multiple object tracks using a shared Kalman filter."""
        BOTrack.multi_predict(tracks)

    def reset(self):
        """Resets the BOTSORT tracker to its initial state, clearing all tracked objects and internal states."""
        super().reset()
        self.gmc.reset_params()

7. Results & Metrics

Using Ultralytics 8.3.38 on Python 3.11.10+ with Torch 2.5.1+cu124, I achieved the following metrics against my test dataset (which was not used for training or validating):

val: Scanning datasets/roboflow.v3/test/labels.cache... 466 images, 3 backgrounds, 0 corrupt
Class                Images    Instances   Box(P)    R      mAP50   mAP50-95
--------------------------------------------------------------------------------
all                    466        2361     0.974   0.925   0.964    0.857
RC Car                 341        1465     0.950   0.872   0.931    0.700
Short Course Truck      54         187     0.988   0.891   0.956    0.772
Stadium Truck           80         307     0.986   0.921   0.964    0.797
1/8 Scale               16         109     0.952   0.906   0.945    0.800
Exalt                   93          93     0.981   0.968   0.992    0.980
ORCA Blue               81          83     0.980   0.940   0.972    0.963
ORCA Orange            117         117     0.978   0.974   0.991    0.982
Speed: 0.4ms preprocess, 5.7ms inference, 0.0ms loss, 0.4ms postprocess per image

You may notice extra classes in there, but I’ll save that for a future blog post 😉.

Model Evolution

My model improved over time as I added better annotations on a larger variety of images:

**Early Model**: Lower mAP due to limited data and unoptimized hyperparameters.

**Midway Model:** Improved detection but still struggled with small or partially obscured cars.

**Final Model**: Achieved high precision and recall while keeping inference time reasonable.

8. Lessons Learned & Key Takeaways

Data Quantity & Quality

A large, well-annotated dataset was crucial to achieving robust detections. Once you have a decent sized dataset and have trained a model, you can make use of Roboflow’s Auto-Label to assist in annotating new images using your existing models. I also utilized strategies that saved the video frame when a tracker went missing in order to use it to further train the model.

Optimization

Fine-tuning imgsz , batch size, and model selection had a massive impact on both model performance, training speed, and final model accuracy.

Tracking Complex Motion

Tracking off-road RC cars is a challenge due to unpredictable motions and visual occlusions. Adjusting algorithms like BoT-SORT and integrating additional re-identification (Vision Transformers) can help maintain track IDs.

9. Future Improvements

Going forward, I plan to:

Explore Advanced Models and Toolkits

Investigate other architectures or merging YOLO with other strong re-identification modules.
Integrate with SAHI (Slicing Aided Hyper Inference) for improved detection of smaller objects.
Integrate with Nvidia DeepStream SDK and/or Triton Inference Server.

Release Tools for the RC Community

Provide an easy interface so other racers can analyze their lap times and create highlight reels.

Expand to New Domains

Add more object classes such as On-Road cars, Mini-Z, Drift cars, etc.

11. Conclusion

Building an automated highlight reel generator and racing-line analyzer for RC cars was a fun and challenging project. YOLOv11 provided reliable detection, while Roboflow simplified dataset handling and visualization. Despite hurdles like GPU memory constraints and tricky tracking scenarios, I learned the immense value of thorough data prep, methodical tuning, and custom tracking algorithms.

I hope this post gives you insights into combining advanced object detection with robust tracking in a practical, real-world application. Feel free to reach out if you have questions or want to collaborate!

Check out the GitHub Repo for the full code!

Check out the Roboflow Universe Project to check out the dataset and try it out!

Thank you for reading! If you found this useful, feel free to connect with me on LinkedIn or follow me on Medium for more updates on computer vision and machine learning projects!

About the Author

Anthony Casagrande is a results-driven Technical Lead and Staff Software Engineer with almost 15 years of expertise in designing scalable IoT, cloud, and edge computing systems. He has a proven track record of delivering high-performance solutions in AI-powered computer vision, cloud-native architectures, edge analytics, and open source development. He is recognized for optimizing complex systems, reducing operational costs, and leading technical teams to success.