Presentation: How to Unlock Insights and Enable Discovery Within Petabytes of Autonomous Driving Data

Kyra Mozley: Let's kick things off with a video. As you watch it, I want you to think about how you might go about describing what's happening in the scene, and how maybe you could implement something to find scenes similar to it in a large corpus. Maybe you're thinking you could use object detection methods to work out there's something in the road. It's a good approach, but what about this scene? How would you describe it? This scenario is what we would call an edge case. It's a rare, unexpected, or unusual scenario that falls outside our typical operating conditions. Most driving scenarios encountered by autonomous vehicles involve routine tasks such as lane following, maintaining safe following distances, and navigating predictable road conditions. Because these common scenarios often dominate driving datasets, it's rare but critical edge cases like these that often become underrepresented.

Finding these edge cases, such as a cyclist suddenly falling off their bike in front of us, it's a bit like searching for a needle not just among one haystack, but many haystacks. Despite their rarity, these scenarios pose significant safety risks and represent the toughest challenges for autonomous systems. Therefore, explicitly identifying, retrieving, and understanding these edge cases is essential, not just for enriching training data, but also for rigorously evaluating the robustness and safety of autonomous vehicles. By prioritizing edge case discovery, we ensure that our models can handle not only the predictable majority of cases, but also the unpredictable minority of events critical for real-world safety and reliability.

Before diving deeper into how we can discover these edge cases, let me first introduce myself. I'm Kyra. I have a background in computer science. I spent some time doing some research on deepfake video detection before ending up as a machine learning engineer for Wayve, who are an autonomous vehicle startup company based in London. At Wayve, we're building AV 2.0. It's an end-to-end approach to driving. This means we don't need fancy LiDAR sensor stacks or HD maps. We're learning how to drive just from raw input data. We've seen that this approach has enabled us to expand across multiple vehicle platforms and multiple geographies very quickly. My day-to-day focuses not on training models for the cars to drive themselves, but on finding useful data in our large corpus that we can then use for training and evaluation.

Here's what we'll be covering today. First, we'll start with the traditional computer vision approach, how we've historically solved perception tasks using task-specific models and annotated datasets. I'll walk through why this pipeline has worked well in the past, but also where it starts to fall apart at scale. Then, I'll introduce what I'm branding Perception 2.0, a new way of working with video data that's built on semantic embeddings. We'll explore how this approach allows us to move from raw video to structured understanding with foundation models without needing to train a new model for every task.

From there, we'll take a step back and look at the landscape of foundation models available today. I'll highlight what types of inputs they support, what kind of outputs they produce, and how to pick the right one or right combination for your workflow. Once we've set the foundation, we'll dig into the workflows you can build on top of embeddings, covering things like search, RAG-inspired techniques, and clustering. These methods let us retrieve edge cases, explore structure in our data, and organize unlabeled video at scale.

Then, we'll look at how we go from unstructured data to labels with auto-labeling. I'll show how we can use models like SAM and CLIP to automatically generate labels, both object-level and structured scene metadata, and how we ensure quality with techniques like consensus labeling. Finally, we'll talk about how we can go beyond zero-shot prompts and attach zero-to-few shot adapters, small task-specific heads trained on top of our embeddings for classification and regression. This lets us tune our perception system to specific needs with just a handful of labeled examples without having to retrain a whole model.

To understand where we're going, we need to look at where we've been. For years, we had these perception tools in our computer vision toolbox for understanding scenes. All of these techniques are powerful. They're the standard ways we use to figure out what was going on in an image or a video. Each one typically focuses on a specific task, so segmentation assigns a class to every pixel. Object detection localizes objects with bounding boxes. Optical flow estimates motion across frames. With some of these methods, such as segmentation or cuboids, you define a fixed set of classes you want to detect up front, like vehicle, pedestrian, bike, and the model learns to assign those classes to those pixels.

This predefined taxonomy gives you a structured and consistent way to interpret the scene, which is great for downstream processing and creating metrics off of them. These tools are also incredibly useful for generating structured signals from our data. With segmentation and cuboids, we can calculate following distances to other vehicles. We can measure the traffic density in the scene and estimate objects' speeds of vehicles around us. We can track the number of drivable lanes. These kinds of metrics are key for mining specific behaviors from the dataset, whether we're searching for merging scenarios or simply trying to balance our training data between different conditions.

Now that we've established we want to detect whether there are people in the scene using segmentation, or maybe we want to localize traffic signs with bounding boxes using object detection, how do we actually go about training and deploying these models? In computer vision, we've followed a fairly standard pipeline for years, and it looks something like this. Your first step, get your data. In the autonomous driving world, that means a fleet of vehicles driving around, capturing sensor data, typically multi-camera video feeds, sometimes LiDAR or radar.

Then we'll end up with countless hours of driving footage across different weather conditions, lighting conditions, and locations. From that raw stream, we can filter out segments that are irrelevant or unusable, say, runs with bad calibration or bad sensor data. We try to focus on scenes that are diverse across different locations, so motorway, urban, day, night, busy, empty. You want to get a real range of data to give the model you're about to train a best shot at generalizing.

Then comes often the most resource intensive part. Traditionally, we rely on teams of human annotators to go frame by frame, drawing these bounding boxes around cars, outlining lane markings, tagging traffic lights, pedestrians, and so on. Before they even begin the task of labeling, you often have to develop a detailed label specification that you will then give to the labeling providers. This spec defines exactly how every object should be interpreted. Writing this detailed document can be very time consuming, and you also normally have to develop a contract with an external labeling provider. Managing this relationship means more of your time, as you now have to have meetings with them to describe exactly what you want from the project. Labeling hundreds of thousands of frames can take weeks or months, and this step can also cost you hundreds or thousands of pounds annually. Unfortunately, this step is the most vital, as without ground truth, we can't train a good model.

Once the data is labeled, it feeds into the model training pipeline. Historically, this meant using CNNs, like ResNet or EfficientNet, or more recently, vision-based transformers. Training these large-scale models typically requires distributed GPU clusters, or TPUs, depending on your dataset size, and can also take days or weeks to complete. Once the model is trained, we evaluate it on a held-out validation set, tracking metrics like mean intersection over union, average precision, recall, accuracy, depending on what the task was. If it passes our internal benchmarks, we'll deploy it, perhaps for online inference, or as part of our offline data mining tools. The work doesn't stop there. A critical part of the lifecycle is ongoing monitoring, because once a model is deployed in the real world, data drift is inevitable. Let's say we've trained a traffic sign detection model, and it performs well in the UK, but now we've expanded our fleet to Germany, and suddenly we're seeing unfamiliar signs. Blue circular signs to indicate lane instructions, or new construction markers.

The model starts to struggle, and it might misclassify signs, or fail to detect them entirely. What do we do? We have to go right back to the start, collect the footage from the new region, create examples of unfamiliar signage, annotate thousands of frames again with the bounding boxes and sign types. We might have had to write a new labeling spec for this new German taxonomy, and then retrain the model, and reevaluate, redeploy. This entire process is costly, both in terms of time and annotation budget, and this was just for one task, just for traffic sign detection. While the traditional pipeline is effective at delivering high-performance perception models, it's also inflexible. Every time the world changes, whether that's a new country, or a new sensor stack, or a new behavior we care about, we're forced to loop through this whole pipeline again. If that wasn't complex enough, remember that you're rarely just solving one vision task in autonomous driving.

Typically, you need parallel pipelines for each distinct perception objective. Here we have one pipeline for semantic segmentation to get pixel-level environmental context, another dedicated to depth estimation to get accurate distance measurements, another for 3D cuboid detection for tracking and modeling moving objects like cars, pedestrians, cyclists. Each pipeline independently demands its own labeled dataset, distinct annotation specs, and specialized training procedures. Managing all of these pipelines in parallel quickly becomes expensive and operationally challenging, especially as datasets grow in size. Coordination becomes cumbersome, and scalability is a significant bottleneck.

We've touched on how these methods help us extract metrics about the scene. For example, we said that we could get following distance by combining cuboids with depth, and we could find out if there was a large traffic density or if there was a person on the road using segmentation. This was our attempt to make unstructured video data structured. Is this actually enough to capture all the diverse scenes we might see? How does it enable us to find the edge cases? Let's revisit our early examples to illustrate this point. Using semantic segmentation, we identified that there was something on the road, but table wasn't in the original segmentation class, so that's not technically right. These traditional methods generally cannot provide the accurate semantic descriptions of such rare events, as we see the person actually ends up being labeled as road furniture when she is on the floor, probably because people on the floor weren't in the training dataset for the segmentation model.

Similarly, we see that she loses the cuboid mask when she's on the floor. These traditional computer vision methods haven't actually helped us understand the scene, and if anything, they failed on the edge cases themselves. Let's take a moment to reflect on the downsides of this traditional perception pipeline. First, it's expensive. Getting high-quality human-labeled data is costly, both in terms of money and time. You often don't just label once. For every new perception task you want to support, you often need to start the annotation process again. That means more data, more labeling contracts, more budget. Second, it's time-consuming. Each loop through this pipeline, collecting new data, labeling it, evaluating, is long, and because it's a sequential process, any bottleneck will slow everything down. This delays iteration, innovation, and ultimately your ability to respond quickly to new business needs.

Third, it's limited with predefined labels. Models like segmentation and object detection rely on fixed class lists. That label set needs to be carefully thought through up front. What are you detecting? What aren't you? What level of granularity do you want? Finally, it doesn't capture complex scenarios. Even if you were to detect a person and a bicycle, the system won't infer that the person is in the middle of falling. Capturing that level of nuance requires action recognition models, adding more complexity, more modules, and still limited generalization. You will often end up with a collection of narrowly focused models that struggle to stitch together the full picture of what's really going on in the scene. In short, traditional computer vision can do a lot of the heavy lifting, but it only sees what it's been trained to see. It struggles with anything outside its known classes, and it fails to capture the narrative, the why, the how, that are so important for understanding data.

You've probably heard the phrase, a picture is worth a thousand words, and today I want to take that idea seriously and show how it can transform the way that we approach perception problems. Traditionally, we've reduced scenes to simple outputs that we've seen, like bounding boxes and segmentation masks, and we've seen how they're limited in what they can express. What if, instead, we could describe a scene the way a human might, in rich, expressive, natural language? Not just bike, person, road, but, a woman on a bicycle swerves into traffic and falls directly in front of the car. Better yet, what if we could turn those descriptions into something that a machine could understand and search across at scale? That's where embedding is coming.

If we embed images and text into the same space, then we can do some really powerful things. We can search video datasets using natural language. We can find similar moments without needing explicit labels. We can train classifiers with just handfuls of examples instead of thousands. Let's make this real. Here, I gave the same cyclist video we saw earlier, and I gave it to our favorite, ChatGPT. We tried to originally interpret this with perception tools, but none of them could really tell us what was happening. Now we pass the video to GPT, and what do we get? A woman riding a bicycle suddenly swerves into traffic cones and falls off her bike directly in front of a moving car. That's not just detection, it's not just classification, that's narrative understanding.

The model identifies actions, swerves, falls. It understands relationships in front of a moving car. It's these kinds of insights that are almost impossible to extract unless you've built very custom pipelines. Here, we get it all from a single-purpose foundation model. No fine-tuning, no manual labeling, just pre-training at scale. This demonstrates an extraordinary shift. Instead of training models to detect predefined labels, we allow a powerful pre-trained language and vision model to generalize from its vast background knowledge, effectively zero-shot labeling this complex scene.

This brings us to what I call Perception 2.0, a new way of understanding visual data that's been made possible by the rise of foundation models and embeddings. These foundation models, such as CLIP or GPT-4 Vision, are trained on internet-scale data, often across multiple modalities like images and text. They've learned to build deep, general-purpose representations of the world. At the heart of this shift is a concept called an embedding. An embedding is a dense numerical vector, usually hundreds of thousands of dimensions. It captures the semantic meaning of an input, whether that's an image, a sentence, or even a short video. Embeddings let us represent data in a shared space where similar context lives near each other.

One example you might have heard of is that a king minus man plus woman equals queen. The model has learned the relationships between concepts, not just the labels. It knows the concepts of gender, and it knows that king is to man as queen is to woman. Image models like CLIP extend this idea, except they're not just embedding words, they're embedding images and text into the same space. What does that mean? It means if I embed a photo of a cyclist and I embed the sentence, a person on a bicycle, these two vectors will land near each other in the shared space. This is the core of Perception 2.0. Instead of training separate models for every task, we use a foundation model to embed our data once. That embedding encodes rich semantic meaning, and we can build fast, flexible workflows on top, search, clustering, classification, without having to touch raw pixels or labels. In other words, a picture is worth a thousand words, and today it's also worth a high-dimensional vector.

There are a lot of foundation models out there, and it was hard to decide which one should make the cut for this slide. I wanted to provide a quick overview of the landscape, and I went for the ones that were most relevant to the kind of work we've been doing. Ones that have native good semantic understanding and are good for retrieval and large-scale video analysis. There are plenty more out there. You can check on Hugging Face for all the rankings. It will only be a matter of time before this slide's also out of date. When evaluating what model you want to use, there are a key few dimensions you can consider. First, access.

Some models are completely open source, meaning we can run them on our hardware at scale or fine-tune them if we wanted to. Others are API-only. These models tend to be more capable of reasoning and generation, but they often cost to call. It really depends on your workflow. Do you want the full control and scale, or are you optimizing for rapid prototyping? Input modality is another key access here. Most models work with images and text, but a few go further. InternVideo2 is designed specifically for video, so it's great for RAV use cases where temporal understanding matters. ImageBind is another really interesting one, and it supports up to six input modalities, including audio, depth, and IMU. That opens up a lot of possibilities for cross-modal learning. You'll also see that most of these models output embeddings, which is critical for the kind of similarity search, clustering, and classifier training we'll be talking about. Some also output text, making them useful for structured scene descriptions or visual question answering, VQA.

If your goal is structured metadata extraction or even automating annotation, these might be worth exploring. You can also combine models. For example, we might use InternVideo2 to embed a 5-second video segment and then use CLIP or GPT-4 for text prompts and label generation. Rather than trying to pick the best one, you want to make sure that you just build a flexible foundation model agnostic inference pipeline that lets you easily swap out components as the model landscape evolves. The key takeaway here is foundation models give us flexible modular toolkit for perception tasks, but the real power comes from understanding what each model is good at and how to compose them together for your specific problem. I mentioned that you had a choice between using Open or API. Your choice will depend whether you want to self-host it.

It can potentially be cheaper at larger scale, and it gives you full control over the data. You need in-house expertise and infrastructure to manage these GPUs at scale and to perform versioning and stuff. However, if you use API-based, such as OpenAI or Claude, there's obviously now minimal overhead as you're not having to manage these GPU clusters. They're often very easy to integrate into your own tools, but it can get expensive if you're processing millions of images. Also, you might have data privacy concerns. Can you be sending your data to these external services? That's something you'll have to check with your in-house legal teams before you do. If you're a startup, an API may get you value faster. Or if you're a larger org with a dedicated infra, in-house will give you the longer-term control.

Let's walk through a concrete example of how embedding-based search works in practice. Say we have a dataset of 1 million video frames captured during highway driving. For each frame, we compute an embedding using a foundation model, like InternVideo, and that outputs a high-dimensional vector representation of the frame semantic content. Once we've generated these embeddings, we store them in a vector database. These databases are optimized to support fast similarity search over high-dimensional spaces, even when you're working with tens or hundreds of millions of vectors. Suppose we want the scenes in our dataset where someone is falling off a bicycle. With traditional perception methods, we've seen this is really hard, but with embeddings, it's dramatically simpler.

Instead of designing a bespoke model, we treat it as a semantic search problem. We write the natural language query, a person falling off a bicycle. We then embed that query using the same foundation model, and this gives us a query vector in the same semantic space as our image embeddings. We can then perform a similarity search to find the closest vectors to our query. Under the hood, vector databases use algorithms to perform this search, such as cosine similarity.

To make this fast at scale, they rely on approximate nearest neighbor algorithms. These nearest neighbor algorithms can retrieve the top K most semantically similar frames in milliseconds, even from a database with hundreds of millions of entries. The magic here is that we never trained a model to detect falling. We didn't even label falling. We didn't predefine a class, but the model still retrieves relevant frames. Why? Because the concept of falling is already encoded in the shared image text embedding space, thanks to the model's large-scale pre-training across internet data. The semantic relationships between falling, bike, person, and dangerous event are already there.

We've seen how embedding-based search allows us to find semantically similar scenes by embedding a natural language query into the same space as our video or image embeddings, but like any tool, it has limitations, especially when working with real user queries and long-tail concepts. One of the biggest challenges I've found is something we deal with all the time in driving data, and that's positional language. Consider these two descriptions. A cyclist is in front of the car, and we are behind a cyclist. They mean exactly the same thing, but some vision language models don't handle these kinds of perspective shifts very well. Why? Because their text embedding space isn't spatially grounded. They're trained to align what's in the image, not always where it is or how it relates to the camera's position. This is where we can borrow ideas from RAG, and apply them to perception search. Here, we're not generating responses, but the RAG philosophy still holds. We can use external retrieval and model augmentation to improve results. In our case, we can apply this by rewriting, enriching, and reranking queries. Instead of embedding a single query and hoping it hits, we can generate multiple rephrasings.

For example, we take biker falls in traffic, and we can use a language model to generate variants of that phrase. These variants are embedded individually and searched in parallel. This helps us cover gaps in the model's understanding, especially for directional or causal phrases that embeddings don't always encode reliably. Once we've embedded all the variants, we can combine them in a few ways. We can either average their embeddings to create a single fused vector that we can then search for similarity to, or we can search each separately and then merge the top results. This is called multi-query fusion, and it increases recall, especially for vague or context-dependent queries.

After retrieval, we often have a list of pretty good matches, but they're not always perfectly sorted, especially when we're working across phrasing or visual ambiguity. We could apply a reranking step, use a cross-modal model to rescore retrieved scenes based on alignment with the original query, so we could use another captioning model to describe the scenes and check, does this really show a cyclist falling in front of a vehicle? Reranking can help us prioritize precision after we've maximized recall. This technique works incredibly well for driving datasets, where users describe scenes from the ego perspective.

Positional reasoning, such as in front of, to the left, coming towards us, is crucial. By applying RAG style techniques to search, we can overcome limitations in the embedding model's linguistic grounding and get better coverage of the scenes we actually care about. Embedding-based search is powerful, but not perfect. When we augment it with query rewriting and reranking, we turn it into a robust tool for retrieving high-value data at scale.

So far, we've talked about how embeddings allow us to perform powerful semantic search, but embeddings aren't just useful for retrieval. They also enable us to group similar data together, even when we don't know exactly what we're looking for. This is where clustering comes in. By running clustering algorithms on embedding vectors, we can automatically discover structure in our data and generate labels from the bottom up, without any supervision. Clustering algorithms allow us to find dense regions or natural groupings in the embedding space. We can think of this as unsupervised labeling. Instead of relying on handcrafted taxonomies, we let the data organize itself based on its semantic structure. Let's say we embed 1 million frames again from our driving dataset.

Now we have 1,001,024 dimensional vectors, and we can feed these into a clustering algorithm like k-means, which is the classical approach. You specify the number of clusters you want, and it will find group centroids to minimize the cluster distance. Or maybe you use DBSCAN. This is what vector databases will normally use. It will find the clusters of points that are densely packed together and labels outliers as noise. These algorithms help us to partition our embedding space into meaningful clusters. What do we do with the clusters? This gives us very powerful capability. We can assign cluster-based labels to data, even if we never defined those labels ahead of time.

Let me give you a couple of concrete examples from the autonomous driving domain. When we run clustering over a large dataset, we might find that one cluster contains frames where the vehicle is just following clear lanes on a straight road. These frames are semantically similar. They're low complexity, low interaction, and visually consistent. We can label this cluster as lane following and treat it as one behavioral competency.

If we want to balance our training data or avoid overfitting to common patterns, we might downsample from this cluster. Another cluster might contain frames that are visually very different from the rest. Dark images, overexposed scenes, or frames with raindrops or lens glare. These may represent bad data, frames with poor visual quality that we don't want in our training or evaluation sets. By clustering, we discover these outliers automatically. We don't need to write a rule to detect raindrop on lens. The model already knows what normal looks like, and clustering reveals the frames that don't belong. This approach works well for a few reasons. Embeddings from foundation models already encode a lot of semantic information.

Clustering is unsupervised, so we don't need manual labels to get started, and it automatically surfaces patterns or anomalies. Once we've identified interesting clusters, we can assign soft or hard labels and use those for filtering, training, or analysis. We can even build simple UIs that let humans quickly scan frames from each cluster and assign high-level labels in minutes, what used to take days of work. Instead of starting with labels and finding data, we start with the embeddings and let the data find the labels. In the era of petabyte-scale autonomy data, that's not just a nice-to-have, it's essential.

Earlier, I walked you through the traditional perception pipeline, and I highlighted one of the biggest pain points was manual data annotation. Clustering gives us one way to discover structure in our data. It helps us surface patterns and group similar scenes together.

The next question becomes, how could we create more structured data automatically at scale? That's where auto-labeling comes in. Auto-labeling is the process of generating labels automatically from raw data using pre-trained models or heuristics without involving human annotators. It's a fundamental shift from manual labeling to programmatic labeling, and foundation models have made it vastly more powerful and practical. These models give us tools that don't require custom training for every task. They don't need ten-thousands of examples per class, and they don't lock us into rigid label taxonomies. Let's walk through an example. Suppose we start with SAM, the Segment Anything Model. It's great for identifying objects in an image, even if it doesn't know what they are. It gives us segmentation masks, so here's one object, here's another.

Next, we can take each of these segments and pass them through CLIP to determine what they represent. We do this by comparing the embedding of the segment to the embedding of a set of natural language prompts, like an image of a car, a person, a road sign. In the shared embedding space, the segment will naturally cluster closest to the most semantically similar text prompt. In this case, road sign. That similarity gives us a label without any human input, training, or bounding box drawing. Alternatively, we could use models like Grounding DINO, which already directly align text with bounding boxes, so you can say a person crossing the road, and get a spatially grounded result straight from the model. This kind of pipeline gives us object-level labels at scale with no manual effort and no task-specific model training.

We're not just limited to classifying objects. Foundation models can now generate structured scene descriptions, too. Let's say we wanted to label metadata, like scene features, such as weather or road conditions. Traditionally, you'd have to train separate classifiers for each of those fields, collect data, write specs, annotate thousands of examples. Now, we can just define a JSON schema like this and prompt a multimodal model, like GPT-4 Vision, to fill it in directly from an image or video segment. The model sees the whole scene, it understands the semantics, and it returns a structured output in a single pass. Auto-labeling can be incredibly powerful, but it's not perfect. Models can hallucinate, they can misinterpret scenes, or give inconsistent results depending on your phrasing. When we rely on foundation models to generate labels at scale, we need tools to validate and refine those outputs without throwing humans back into the loop full-time.

One of the most effective techniques we use is called consensus labeling. The idea is simple. Instead of trusting a single model or a single prompt, we can generate multiple candidate labels and look for agreement across them. You might run the same prompt across multiple models, or run multiple prompts through the same model. Maybe you have one prompt that's more verbose, one that's concise, and one that's schema-based. You can then aggregate the outputs using voting, confidence thresholds, or even small ensemble rules to decide on what the final label should be. This helps filter out outliers, reduce hallucinations, and add stability to your labels. Some models even output token-level or field-level confidences, so you could use these to drop low-confidence labels or flag frames for manual review. These techniques make auto-labeling more than just fast, they make it reliable.

Let's revisit the original perception pipeline we looked at earlier. For many use cases, we can now actually cross out the two of the most expensive and time-consuming stages. Instead, now we rely on pre-trained foundation models to extract the semantic meaning from raw data, and vector databases to store these embeddings and power search, clustering, and labeling workflows. Rather than building a new model every time we need a new perception output, we can prompt, embed, retrieve, and validate. Whereas previously, you'd spend weeks designing label schemas, months sending data out for annotation, and more time retraining models, now you can just prompt a model, label at scale, and generate insights across your video in hours, not weeks. Evaluation still matters, but it's evolving.

Previously, we might have evaluated segmentation models by checking if it drew a mask correctly over a pedestrian, but now, if we're using a VQA model or a captioning model, we'll need to ask, did the model say that a pedestrian was present? Did it understand the scene correctly? Are we surfacing the right results during search? While our metrics might shift from pixel-wise IOU to things like recall at 10, the underlying principle remains. We still need to verify that what we're getting is what we care about. Perception 2.0 doesn't throw the old pipeline away, it retools it. It makes it faster, more flexible, and more scalable.

By now, we've seen how foundation models can help us auto-label massive datasets quickly, flexibly, and often with surprisingly good results. Here's the next natural question, what if we need more accuracy? Maybe you're building a safety critical evaluation set, maybe you're doing fleet-wide metrics on an edge case, or maybe you're training a downstream model that needs really high-quality labels. In those cases, it's not enough to rely entirely on auto-labels or zero-shot outputs, but that doesn't mean we go back to full-scale manual annotation or training a model from scratch. Instead, we take the embeddings we've already generated from foundation models and we train a small task-specific classifier on top. This approach gives us the best of both worlds, the semantic richness of large foundation models with the precision and control of custom classifiers tuned to our exact use case. The pipeline looks like this. Start with your raw inputs, whether that be images, video, text. Pass them through the foundation model to get your dense embedding vectors, and store those vectors in a vector database or keep them in memory for training.

Then, train a simple classification head using a small neural network on top of these embeddings. These models are incredibly lightweight, I'm talking several layers. We've found that they converge with just 10 to 30 labeled examples per class. Because the embeddings already encode so much context, they generalize really well with such tiny training sets. Let me give you an example. Let's say we wanted to detect scenes where we failed to complete a lane change. These are hard to capture with rules and heuristics, and even harder to generalize with standard classifiers trained from scratch, since we need a genuine understanding of the scene over time. With this approach, we could collect a handful of positive and negative examples of us failing to complete lane changes, embed them using our foundation model, and train a small classifier on top of these embeddings. Just like that, we have a high accuracy model for a niche perception task trained in minutes.

This approach also fits perfectly with everything we've built so far. We've already embedded the corpus. We've already used clustering or auto-labeling to explore and identify edge cases, and now we can refine those results using lightweight training on top. Because we now have a small labeled ground truth set, we can immediately run k-fold validation to get the metrics, like accuracy, precision, and recall, without needing to scale up annotation efforts. This allows us to evaluate the classifier quality with confidence before scaling further. Instead of retraining massive model every time a new requirement appears, we can just add another small head and fine-tune to our needs. This is where Perception 2.0 becomes truly modular. You don't start from scratch, you build on top of a shared embedded understanding of the world, whether it's classification, filtering, or evaluation. Everything becomes faster and more composable.

It's not just classification that it would be good for. Some perception tasks, for example, involve continuous values like distances, angles, speeds. This is where few-to-zero shot regression comes in. Just like with classification, we can use the same embeddings generated by the foundation models, but instead of predicting a class, we can train regression heads to output a continuous value. Suppose we want to estimate the following distance between two vehicles, a metric we use frequently for safety evaluation and behavioral analysis.

Traditionally, this would require 3D cuboid annotations, but now with this approach, we can embed the frames using a foundation model and annotate a small set of examples with known following distances. Like we did before, we train light head regression heads on top of these embeddings. Because this embedding space already encodes information about objects and scene context, we can reach useful accuracy levels with very little data. Unfortunately, you might still need some manually labeled data to get this ground truth if you don't have access to LiDAR or radar to provide it for you. Now we're talking in the scale of a hundred samples rather than hundreds of thousands we needed to train a cuboid model from scratch.

The same approach can be applied to many tasks, for example, distance to lane boundaries, gap estimation during merges, or time to collision approximations. The beauty of this method is that you don't need to run an expensive 3D perception pipeline to get useful metrics. You can extract meaningful, fine-grained insights directly from the embeddings, and do so quickly, flexibly, at scale. Just like with classification, few-shot regression gives us fast, low-friction way to extract continuous measurements from video data without custom models, heavy annotation, or complex geometry. That's what makes it a powerful tool in the Perception 2.0 toolkit, especially for large-scale evaluation pipelines, scene mining, and behavior tracking.

Now that brings us full circle. We've moved from traditional rigid pipelines to modular embedding-based workflows that scale with our data and adapt to new tasks quickly and stay responsive to change. Let's take a step back to recap the key ideas. Traditional computer vision as we know it is dead. We're no longer training bespoke models for every task. We're embedding once and building everything else on top. Once these embeddings are computed, they enable a wide range of downstream tasks such as search, clustering, classification, regression, with minimal additional effort. We're no longer bound to rigid taxonomies. With natural language queries and open vocabulary models, we can label and retrieve data without defining every class in advance. Auto-labeling changes the scale, but it still requires care. With techniques like consensus labeling and prompt engineering, we can improve label quality without having to fall back on manual annotation.

Lastly, instead of training end-to-end models for every task, we can build small heads on top of shared embeddings. These lightweight classifiers and regressors can match traditional performance with a fraction of time, data, and cost.

I started this talk by showing you two unusual scenes, the table sitting in the middle of the motorway and a woman falling off her bike in front of our vehicle. I posed the question how you would go about finding similar scenes in this dataset. Everything we've walked through, embeddings, search, clustering, auto-labeling, and zero-shot adapters was designed to solve exactly that kind of problem. I wanted to end by showing you two more clips, scenarios that we found using the exact methods that I've shared in this talk. We have a bin in the road. Lastly, this one's quite dark. Yes, you can see we managed to find another cyclist falling off their bike in front of us in the road.

Participant 1: Have I understood correctly that the new perception method collects the same representation data as the previous one, or you change the things that you collect? Because this data, you need to make a control after that, not only perception. What are the changes this method gives you in the whole pipeline of self-driving?

Kyra Mozley: Does Perception 2.0 approach change the data we're collecting for our autonomous driving pipeline? This will depend, depending on the AV company you ask. As I mentioned at the start, Wayve has always been an end-to-end driving approach. We only take camera frame inputs in. We've never driven with LiDAR data or anything else. In terms of the way we approach driving, it hasn't changed it, because obviously all the foundation models that we use work on video understanding. Video in is still very much important, both for retrieving the scenes and for the task of driving for us. I think maybe it might make other companies that use very rigid AV 1.0 approaches reconsider perhaps all the stack they collect.

Participant 2: After your reclassification of all the data you had, what were the most surprising or unexpected scenes which the computer showed you?

Kyra Mozley: After classifying all of our data, what were the most unexpected scenes that we retrieved? Some of them were some visual camera errors that went unnoticed for a lot longer than they should. We found that, yes, some runs the cameras were just purple for a while on one camera which was quite bad. Now we have ways to detect camera failures automatically, which is good. I think, yes, just these interesting edge cases like the cat causing the guy to fall off the bike was interesting. There was another one where it was a dog pooping in the road that we had to stop for. That was pretty funny. A mix of serious and funny.

Luu: Real life is messy.

Participant 3: To me it seems that you can't quite deploy this to the vehicle yet because the foundational models are a little bit too slow. Would that be correct?

Kyra Mozley: Can we deploy these to the car at the moment? The answer is no, they're way too big currently. Was there going to be a follow-up of what are we going to do? We've definitely seen a shift now to more embedded systems. We're doing more like edge filtering on the car, rather than sending back all the data that we get, maybe we only need to send back a sample. We can start looking at edge filtering techniques. Maybe we need to distill the foundation models so that way they're smaller and can run on car. Or, yes, a lot of quantization. I think distillation is the biggest future area.

Luu: Performance and tuning these models.

Participant 4: You talked about clustering and removing bad data from faulty sensors, for example. I'm curious, would that bad data ever be useful in a real-world example where a car might have a faulty sensor at some point? Would it be useful to explore that any further?

Kyra Mozley: I spoke about removing the bad data cluster earlier, but is there a world where actually we need that bad data? I think it depends on why it was bad. For example, when your lenses are covered, as I mentioned that we only drive on camera inputs, so if there is no visual input, is that really useful for the model? We'd rather have more fail-safe measures in place for the car to stop driving rather than trying to carry on and drive if all of its cameras were covered, for example. If it's bad data in maybe like rainy conditions, that could be useful to obviously definitely put into the training data. It's just more about discovering why the data is bad and then allowing the people training the driving models to make those calls on how they want to balance on those different types of bad, would be my answer.

 

See more presentations with transcripts

 

Comments (0)

AI Article