One of the most limiting factors of ML and AI for EO applications is the scarcity of suitable and accessible training datasets. Currently, the main barrier is that the generation of such datasets is a time consuming and expensive process. Typically access to high quality training datasets is very restricted; in some cases, domain experts or in-situ data annotation campaigns are necessary to generate the ground truth for remote sensing applications. Consequently, the field of AI/ML for EO is lagging when compared to other sectors, hindering the development of new applications that can fully exploit AI capabilities.
The ESA Earth Observation Training Data Lab (EO-TDL) will address these key limitations by providing a cloud repository to create, share, and improve training datasets as well as ML/DL algorithms. The goals of EO-TDL are:
- host, import and maintain a wide range of dataset types: training, validation, test, benchmark and reference datasets (in-situ data, product validation datasets)
- offer a set of integrated open-source tools compatible with the major ML/DL frameworks to develop and export processing pipelines for Extract Transform Load (ETL) operations, data ingestion, model training and inference
- enable the description, versioning and tracking of data using Spatio Temporal Asset Catalog (STAC) to guarantee data discoverability and accountability
- allow data exploration to uncover biases, detect anomalies, verify assumptions maximizing the understanding of the data (Exploratory Data Analysis – EDA)
- build a centralised Feature Store to access, search, create EO data derived features and serving them at training and inference time thus increasing model efficiency
- enable automated data quality mechanisms through deterministic and non-deterministic testing
- deploy a containerized multi-GPU environment for distributed training processing
- provide interoperability with third party platforms, such as Radiant Earth MLHub
- implement accessibility at multiple levels by means of user interfaces, web APIs, CLIs and Python libraries
Moreover, community engagement will be incentivised through a reward-mechanism to stimulate collaboration in dataset creation, enhancement and quality assurance. All the code will be hosted on GitHub and a public Discord server will enable further discussion between members.
Within the first year of activity the data population will comprise over 100 selected datasets covering a wide range of applications: from computer vision tasks (such as object detection), super resolution to bio/geophysical parameter estimation or 3D applications on different data sources (such as Sentinel 1 and 2, Airbus SPOT and PLEIADES, UAV imagery or vector data).
Many users will benefit from this training data laboratory: the availability of quality training data will strengthen science and industry capabilities of exploiting EO data as a whole helping accelerate EO market penetration. Researchers and engineers can take advantage of using EO-TDL to build highly accurate models of the Earth system such as Digital Twin Earth simulations.