Clemens Viernickel: Helping software engineers to focus on algorithms rather than data handling

Clemens Viernickel: Helping software engineers to focus on algorithms rather than data handling

SiaSearch, a Berlin-based start-up, specialises in the handling of high data volumes generated from ADAS equipment such as cameras, LiDARS and radar. This data is considered the new oil that will drive the vehicles of tomorrow and will enable them to operate autonomously. SiaSearch has identified a need to structure such data and enable engineers to effectively interact with it in training and developing scenarios. In order to expedite the development of automated driving and commercial deployment of autonomous vehicles (AVs) at scale, there is a need to streamline the data-driven development process. We spoke to Clemens Viernickel, the Co-Founder & CEO of SiaSearch, about the story behind SiaSearch and how they are changing the way automakers work with raw sensor data.

Can you tell us a little about your company and its background?

We are a software start-up based in Berlin. SiaSearch started within Merantix, which is a leading company builder in the space of artificial intelligence. What this means is that Mernatix hires founders who want to start a company in the space of AI. These founders then develop an idea and hire a core team, which is later spun-out into its own venture. The whole operation is financed through a dedicated fund, which also invests in the new spin-out companies. We went through this process to start SiaSearch.

When it comes to the idea, this was very much inspired by the research of my co-founder Mark Pfeiffer. He completed his PhD at ETH Zurich in data-driven development for robotics and was heavily involved with autonomous driving research. In his work, he often got very frustrated by the difficulty of working with the data assets. There was simply no software that helped engineers like him deal with the mass of sensor data that are required to train and test self-driving robots. Mark and I met at Merantix, where I was already working with some German automotive OEMs on self-driving projects. At the time, I was surprised to see how slow and painful the development was at these huge companies. We then joined forces to change this with SiaSearch.

Can you explain the nature of the SiaSearch concept and how it stands out from other search tools?

SiaSearch means "Filter-Search". Sia is Icelandic for filtering and this is what we try to do.

Building robust, industrial AI-like automated driving is extremely difficult. Everyone talks about algorithms, but up to 75% of engineering time is spent on low-skilled labour tasks, namely manually reviewing and selecting - hence filtering - masses of data instead of actually building the algorithms.

Building robust, industrial AI-like automated driving is extremely difficult. Everyone talks about algorithms, but up to 75% of engineering time is spent on low-skilled labour tasks, namely manually reviewing and selecting - hence filtering - masses of data instead of actually building the algorithms. This is because selecting the right data is the single most important thing to make AI work in a robust way. For frontier technologies like self-driving, most of this data is raw recordings from sensors. This data is inherently unstructured, which makes it impossible to access it with any kind of metadata or based on its content. There are petabytes of it already and it is growing quickly. And yet at the same time, there are no tools to work efficiently with this data.

That is why we've developed SiaSearch, a data management platform that automatically extracts frame-level, contextual metadata, and utilizes it for fast data exploration, selection, and evaluation. Automating these tasks with metadata can more than double engineering productivity and remove the bottleneck to building industrial AI. When we talk about metadata, it is important to note that we mean such data, that fully and comprehensively describes the content of the data.

The metadata needs to be so rich, that any given situation can be readily described using it, so that if the metadata is available, any kind of data can be easily found and accessed. In practice, an engineer could be looking for situations in say Germany, with rainy weather that features two pedestrians and two cars at an intersection, where the car is turning right and the pedestrian is crossing the street. SiaSearch enables engineers to search and find this kind of specific data sequence in an effortless and instantaneous way.

So there's a huge time saving?

Yes, the type of situation just described needs to be identified manually, by sifting through thousands of hours of data by hand. SiaSearch replaces any type of manual data search, aggregation, selection or evaluation which is performed by thousands of automotive engineers today. Instead of physically looking through recordings, 100% of the stored data can be searched and queried with rich metadata just as easy as any other type of database.

This is huge, because of the massive amount of time engineers spent on these manual data handling tasks today, either themselves or through overwhelmed data engineering teams. By automating away the data engineering, we truly accelerate development and make engineers more productive, which can hopefully accelerate the path to fully self-driving vehicles.

How are the tags created that enable the data to be sorted according to the needs of the user and is there a limit to the number of tags?

The tags, or metadata that SiaSearch extracts are a collection of algorithms, which we then run with a modern distributed processing pipeline on the customers' data storage. They can range from simple heuristic things such as spikes in vehicle dynamics, geo-location or weather, to highly complex models that might detect complex driving manoeuvres performed by the ego vehicle or by other traffic agents.

In general, we have a good balance of pre-trained models and more simple heuristics to create the tags. What they all have in common is that the algorithms output only the start and end timestamps of the given event. In our metadata storage, we keep just the timestamps and reference where an event starts or stops. This enables us to store much less metadata because there are no empty frames. At the same time, it enables SiaSearch to query the metadata very much faster than conventional databases, outperforming the query speed of common SQL and NoSQL solutions by three orders of magnitude.

What is 'automatic metadata generation'?

Automatic here means that we create the metadata without any manual human labour involved. This is crucial, because on the growing masses of data, manual labour will not scale. Generating metadata is a key element of the development process today as well, but there is not enough metadata and by far not on all the data, because it needs to be done by hand. We automate this and make it scale to really large datasets. Think of the internet, where search engines also work with metadata tags of individual websites. In the early days creators of websites had to create all metadata themselves. As search engines like Google became smarter, they would index and generate some metadata based on the website content themselves, to make search more relevant. We do the same on sensor data: generate metadata automatically and make it super fast to search it.

 So, a lot obviously depends on the quality of the original datasets?

For the pretrained algorithms in our portfolio, it is important that they are developed with high quality data in the first place. We have some of our own data recorded in Berlin and were lucky to win a couple of super valuable and generous data partners early on, which helped us develop the first iteration of our models. However, our software also becomes smarter and better over time, as we see and work with more data sets. Many companies are okay with us using a subset of their data in order to make the metadata extraction more accurate and efficient, which is how our technology stack grows and becomes more defensible.

This database search capability will potentially speed up development times for autonomous vehicle technologies considerably. Can you give us an idea of how significant that could be?

We see a minimum efficiency increase of about 80-100% of key machine learning and for validation engineers, even if SiaSearch is just used in its early version as today. The reason is that engineers spend so much time with data handling, that even just automating parts of their search efforts can generate massive efficiencies. If we expand the time these engineers spend on actual software development from 30% to 60% we are happy.

Our goal however is to make them spend as little as 5-10% on data handling. To achieve this we need to improve SiaSearch still further and add features that make the data search more intelligent.

Can you say a little bit about the tie-up with Motional? Who does what and what does SiaSearch bring to the party?

First of all, we are super happy that Motional agreed to publish their flagship public data set with SiaSearch.

With strong ties to academia, we strongly believe in the value that these datasets provide to progress research and development of highly automated vehicles, which is why we're excited that we have found a way to make this value more accessible to a larger number of engineers around the globe.

In our experience and having spoken to many researchers, one of the key challenges with large driving data sets is that they are difficult to explore and access, especially if like most, you're looking for highly specific situations to train or validate models. The data sets lack semantic searchability. We've started SiaSearch to enable this searchability on very large sensor data sets through automatic scene and event detection.

First, we published the KITTI dataset on SiaSearch, which was already a huge step. Based on the great response, we thought, why not take more of these data sets and publish them on SiaSearch? We started to reach out to the authors of the datasets and with Holger from nuScenes it immediately clicked. We started integrating nuScenes to SiaSearch and after signing the official partnership agreement with Motional, and were able to publish in October.

Motional provides the dataset and the user base on this data set, while we bring our indexing and search technology to the table, which enables nuScenes users to explore the data in a whole new way.

We hope that by making nuScenes available in a fully searchable format will help researchers in the field, but also inspire many more engineers to work on automated driving.

What is the future for SiaSearch in terms of the applications of this technology?

We strongly believe that the problem we are solving is a growing pain point generally.

More engineers will be faced with the challenge, that they need to work with raw sensor data to build products, but have no means to manage that data. This is a true market gap, which is hard to believe in a market flooded with tools, especially for AI and machine learning. However, there is virtually no data management technology for unstructured sensor data and we want SiaSearch to become this technology.

After vehicles, there are many exciting use cases that we are already discussing with some companies because the demand and interest are huge. Most prominent are robotics/computer vision applications with other moving objects like massive trucks, agriculture machines, drones, boats, or trains, but also the ones with stationary cameras like surveillance, retail, and sports.

In all of these fields, engineers are just starting to grasp the problem they are facing and we are confident that there will be many opportunities to structure this data with SiaSearch.

However, on the vehicle side, there are also some great opportunities ahead. Actually, our first customer in Germany was FSD, the company mandated by the government to create the certification standards for automotive. Regulators like FSD need to get a very detailed understanding of driving patterns and events in order to be able to effectively regulate automated driving and SiaSearch is proud to be an enabler of that. This shows that the utility of metadata extraction does not have to stop at developers.

We are on a mission to make many interesting datasets available through SiaSearch, and we already have some great partnerships in the making. We will be able to talk about them more in the near future. But there is a lot more to expect on this front - both on the non-commercial research datasets but also an existing partnership we are building towards selling licenses to an ultra diverse new automated driving dataset which would be exclusively available on SiaSearch.

Upcoming webinar: How to manage 3 billion miles of driving data?