With regard to autonomous and connected driving (AVF), data must be handled with great care, as it is essential for training AI models. The typical problems arise when collecting real-world data: Do we have enough “critical” cases for training the models? Where and how are the collected data volumes stored and processed? How do we handle personal data? What are the terms of use for project-related data? Who is responsible for the data, and how should the data transfer work?
A well-tested approach to circumvent these issues regarding data protection is to work with simulation data. However, it is important to consider how effective working solely with simulation data is, as real data adds value through realism and variety, authenticity of sensor data such as sensor error and noise, and environmental error. Simulation data offer the advantage of ensuring controlled, repeatable environments and help systematically investigate specific scenarios. Real training data, on the other hand, are essential to ensure that the developed systems function reliably and safely in the real, unpredictable, and complex world. It is therefore sensible and optimal to integrate the simulation data into the collected, recorded real data.
Some of these questions are addressed in jbDATA. Publicly available data are primarily used for the work in the project. Before this data is utilized, it is important to clarify exactly which requirements must be fulfilled and which data are needed. For example, metadata, which are additional information about the data, play a significant role, as does ensuring that the recorded scenarios match the cases to be tested. Once the requirements have been defined, a suitable data set is sought that can be used for the scenarios. Public datasets can often be used in a research context, but the question is whether this also applies to industrial research and whether corresponding requirements are met. At jbDATA, an exchange regarding the requirements between partner projects is planned, where external and internal requirements are to be combined.
Furthermore, in the jbDATA project, an exemplary dataset will be collected, which is intended to serve as basis for an industrially usable dataset. Solutions to the common problems mentioned could involve ensuring data protection through anonymization procedures, the removal or obliteration of persons, or to recreate the images or scenes with actors who assign their data protection rights as part of a contract.
The project partners have set themselves the goal of making this data set accessible for research outside the consortium. It is particularly important to ensure that the dataset can be processed by any user for their specific purposes, primarily consisting of relevant corner cases, datasets, and scenarios.
At jbDATA, the first step is to define scenarios and corner cases for the data set. These are used to describe specific, challenging situations. The requirements for the corner cases were developed and defined in a workshop. Milestone report 1 (02/24) identifies the two specific use cases for jbDATA and defines them within the context of the project.
- VRU (Vulnerable Road Users): Wide range of applications for object detection, instance segmentation, motion prediction, pose estimation, etc. of VRUs
- Corner case detection: Focus on unknown objects and behavior patterns.
The goal of just better DATA is to generate a distinctive, variable, and long-term usable dataset. If possible and technically feasible, requirements from other projects will also be integrated.
In the next step, at the project meeting in June 2024, the requirements for an industrially relevant dataset will be discussed and explored. The path to relevant data and datasets for AI training will be successfully continued within the just better DATA project.