Interview

“The bottleneck that all teams have to navigate”

Dagmar Kainmüller is the head of the “Integrative Imaging Data Sciences” research group at the Max Delbrück Center and one of the initiators of the Helmholtz Foundation Model Initiative. Photo: Pablo Castagnola/MDC

The Helmholtz Foundation Model Initiative (HFMI), launched two years ago, supports AI projects focused on processing enormous amounts of data. In this interview, AI expert Dagmar Kainmueller offers an interim assessment—including her own HFMI project, AqQua.

In my view, the HFMI is a huge success. In a relatively short time, we have built a highly motivated interdisciplinary community with an extremely steep learning curve. The expertise gained in this way is incredibly valuable and will remain so, even after the project period ends. And it has gained significant international visibility.

One example is the Helmholtz-ELLIS Workshop, which we organized in Berlin in the spring of 2025. ELLIS stands for “European Laboratory for Learning and Intelligent Systems” and is, so to speak, the flagship among European AI research networks. The event was highly productive and brought together representatives from a wide range of scientific disciplines, leading AI researchers, and high-profile speakers from global players such as Meta and Microsoft Research. In addition, the European Commission has cited the HFMI as a case study and now uses it as a reference for its own activities in the field of “Artificial Intelligence (AI) in Science.” In June, we will jointly hold a workshop in Brussels with newly funded EU pilot projects. There is great interest in our experiences and insights.

The four projects funded from the outset have largely compiled their datasets, trained initial models on them, and are currently working on testing and refining them. A particularly advanced example is the “Human Radiome Project” (THRP), which is developing a foundation model for analyzing medical image data from MRI and CT scans. The team has now collected, curated, harmonized, and trained models on approximately 3.7 million radiological 3D images. The compilation and processing of the image data alone is a monumental achievement. The team has thus created the largest dataset of medical 3D image data ever used to train a foundation model. The model trained on this data already achieves highly competitive accuracy in image recognition. The team will soon publish its “flagship paper”—that is, a compilation of its results—and then make the model available to the user community. And the user community is already waiting for it.

The major challenge is actually very similar across most projects. Namely, the collection and processing of vast amounts of data. Because for every foundation model, you first need a very large, harmonized, and AI-suitable dataset. Whether it’s medical image data like in THRP, weather data like in the HClimRep project, or plankton image data like in our AqQua project—creating AI-suitable datasets is the bottleneck that all teams must navigate. That is another reason why the mutual exchange within our Synergy Unit is so crucial.

Simply put, the Synergy Unit makes the initiative as a whole larger and more effective than the individual projects. Here, we have built a community of members from all projects who support one another because we all face similar challenges. Together, we develop new ideas that subsequently inform the individual projects. The atmosphere within the HFMI community is very positive. We work effectively, in a goal-oriented manner, and very “down to earth,” as the saying goes. Grounded. In addition, new outreach activities are emerging from the Synergy Unit that showcase our expertise internationally, such as the Helmholtz-ELLIS Workshop. We’re currently planning a workshop for early 2027 in collaboration with CIFAR, the Canadian Institute for Advanced Research, on Self-Improving Discovery Systems in AI for Science—it’s going to be an extremely exciting event. And last but not least, three of the HFMI projects, together with the Synergy Unit, have secured a major Compute Grant that provides them with computing time on Europe’s fastest supercomputer, Jupiter, at the Jülich Research Center, to effectively further develop AI methodologies based on a wide range of use cases from various scientific disciplines.

With AqQua, we aim to make billions of plankton images from a wide variety of sources accessible and analyze them using a foundation model to draw global conclusions about carbon transport, species composition, and the state of the plankton. First, we wrote to about 1,000 laboratories and institutes worldwide and asked for collaboration. We had expected 3 billion images; we’re now already at 5 billion. These massive amounts of data first need to be transferred to us without the source infrastructure collapsing. This process required several custom solutions. The real work then begins. The images originate from a wide variety of sources, standards, and metadata structures, resulting in a broad range of formats. Here, we first had to define a uniform format, including a metadata standard, and then curate and harmonize the images. Numerous additional technical processing steps are then required before the data can genuinely be considered AI-ready and used effectively by the foundation model.

As of now, we have 3.8 billion AI-ready images. The remaining images are currently being processed and will soon complete the first full version of the final dataset. Training of the foundation model will then commence, while the corresponding training framework is already under development in parallel. At the same time, we are already building the infrastructure for our final product: a tool that continuously and globally collects plankton image data, generates automated profiles on biodiversity, condition, and carbon fluxes from it, and then extrapolates these onto global maps. Our goal is to work with the global plankton community and manufacturers of imaging equipment to create an international standard that enables much more accurate and detailed mapping of plankton than has been possible to date—freely accessible not only to the marine and climate research community, but also to policymakers in light of climate change.

Absolutely. Above all, of course, on a scientific level, because the HFMI’s mission is not only to build new AI systems but also to make them sustainably available to all researchers. But there is also significant added value for networking and collaboration within the Helmholtz Association, including on the organizational side. Just two examples from this perhaps less obvious category. Because we have to use massive amounts of data from a wide variety of sources, legal questions also arise. 

If the data is not publicly accessible, each party must conclude a legally binding data-sharing agreement that requires approval from the respective legal departments of the participating institutions. We have succeeded in establishing a standardized template agreed upon by four Helmholtz Centers. Such a framework had not existed previously. Because each dataset was relatively small, projects typically relied on individually tailored contracts. That is hardly feasible on the scale of HFMI. Now there is a template and a process that can serve as a model for similar cases.

In these projects, several Helmholtz Centers collaborate on a major topic. For example, in our AqQua project, a core team of about 10 people develops cross-center solutions. The overall setup is highly efficient, and the group collaborates in line with industry standards. This collaborative approach differs considerably from the typical way PhD students or postdocs work in academia, where working relatively independently on one’s own project is still the norm. And in the end, there’s usually a publication where one reaps the rewards as the “sole first author.” For large-scale projects spanning multiple centers, this way of working is counterproductive. Here, we need new standards in science. I believe our core teams can serve as a good model for how Helmholtz can even better leverage its potential to create added value across the 18 individual centers. 
 

Readers comments

As curious as we are? Discover more.