Is everything fake? Verifying data in the AI era

Jens Strüker, Fraunhofer-Institut für Angewandte Informationstechnik (FIT) und Pascal Spano, Metzler Capital Markets

Until now, data has been shared mostly internally or bilaterally between companies. Companies in the current AI era are now facing the question: Are data verification and data sovereignty a business necessity or merely a moral imperative? Pascal Spano, Head of Research at Metzler Capital Markets, spoke with Professor Jens Strüker from the Fraunhofer Institute for Applied Information Technology (FIT) about how data can be verified in order to build confidence in the digital environment – confidence that is so vital to the economy and its business models.

Pascal Spano: Due to digitization and AI, the volume of data required and collected is growing almost exponentially. How can transparency regarding data origin and use be ensured in order to build confidence in the digital environment?

Jens Strüker: Data verification is one way to close the confidence gap in input data for AI models and minimize the risk of AI-based results. Ultimately, information systems are always about protecting the goals of information security: confidentiality, integrity and availability. It’s no different when using large language models, aka LLMs.

Pascal Spano: What does this mean specifically?

Jens Strüker: Looking at LLMs, I see a confidence gap in master and transaction data with regard to origin and quality. The models regarding training and test data as well as possible distortions are equally non-transparent. This means the risk classification of the output remains rather nebulous, making it virtually impossible for companies to manage risks in a way that makes good business sense. Digital verification of data promises to remedy this.

Pascal Spano: In your opinion, how important is data verification for AI systems, data protection and information security?

Jens Strüker: In the age of LLMs, agentic AI, and soon perhaps general artificial intelligence (i.e. AI that learns and thinks like a human), digital verification of data and information is becoming the focus of all internet-based economic activity and commerce. Currently, due to market-dominant internet platforms like Microsoft/OpenAI, Meta and Google, data is more valuable than the AI models themselves. But business models on the internet are coming under pressure, as seen by the dramatic loss of income suffered by internet content providers due to changes in user search behavior. For example, people are searching more and more using ChatGPT instead of Google Search. This increases the incentive to maintain control over data or to release it in a targeted way in exchange for payment. Blocking websites for web crawlers is one example of this. In this way, data sovereignty can lead to the controlled provision of huge amounts of new data – and thus, in relative terms, make models more valuable than the data itself. This is certainly not just a European pipe dream; it’s also being pursued by many start-ups in the USA. Economically, this would certainly be desirable.

Pascal Spano: Which technologies can be used to implement regulatory requirements for data security and validation?

Jens Strüker: Technological solutions like self-sovereign identities (SSIs), blockchains, zero-knowledge proofs (ZKPs) and data rooms promise to implement regulatory requirements securely and economically. These and other decentralized technologies are often grouped under the term Web3 technologies.

Pascal Spano: How can legislators help establish standards for data validation in the digital age?

Jens Strüker: Existing regulatory frameworks already provide guidelines for using sensitive data. For example, eIDAS 2.0 is a unique, Europe-wide digital trust framework for individuals, machines and organizations. Specifically, it can be used to create unique, cryptographically secured identities that are valid throughout Europe.

Pascal Spano: Can you give an example of how such a digital ID could be used?

Jens Strüker: Identity verification for data linking can be carried out digitally from end to end. This means it would be possible for a company based in Denmark to do business in Austria without having to set up a subsidiary there, which is currently the norm. Accordingly, a managing director would also be able to acquire real estate on behalf of the company in Austria. If the EU member states implement eIDAS 2.0 quickly enough, this could lead, in the medium term, to the creation of a dynamic identity ecosystem for verifiable master data from machines, companies and people – thus simplifying, among other things, the data-sovereign provision of company data for training AI models.

Pascal Spano: As far as European regulation and legislation on data and AI is concerned, we must ask ourselves whether Europe is – once again – standing in the way of technological progress.

Jens Strüker: The European Union has the world's first comprehensive and, to date, best AI regulation. As with data protection, there is much to suggest that this regulatory artifact might become a model for the whole world. And a well-defined regulatory framework undoubtedly makes a major contribution to economic development and prosperity. That said, however, I believe the order is mixed up, as Europe lags far behind the USA and China in offering proprietary and open AI models. In this early phase, I believe it would have been more important to improve the environment for European AI providers who fear that comprehensive AI regulation may not help them, but rather burden them with requirements. Regulation could also have been developed over time as we learn more.

Pascal Spano: As “intelligence” increases, it should be easier for AI to recognize and detect errors and deception. Is the data verification problem perhaps only temporary? Or is there more reason to fear that errors will continue to spread due to AI trained with false data?

Jens Strüker: No, I think digital verification will become even more important over time. Images and sounds can already be copied and people can’t tell the difference. Among other things, this will have a significant impact on identification procedures on the internet. Here, the digital signing of content with verifiable digital identities can help. Furthermore, in complex models like deep learning, the decision-making logic is fundamentally difficult to understand; this is known as the black box problem. Here, it will become more important to know where the training data came from, whether it has been modified, and whether one has interacted with AI at all.

Pascal Spano: How can companies avoid losing data sovereignty when large volumes of data are used for AI training purposes?

Jens Strüker: One promising way is to share data via so-called data spaces, i.e. federated clouds. Data rooms are digital infrastructures that enable secure, decentralized and sovereign exchange of data between organizations. They are based on common standards, rules and technologies and data remains at its source rather than being stored centrally. The aim is to preserve data sovereignty, build trust and promote cross-sector innovation. Data spaces are often domain-specific but can also federate to form larger data ecosystems. The EU Data Act has created a legal and technical framework for such systems, and there are now numerous implementations, e.g. the data ecosystem (see Catena-X for the automotive industry).

Pascal Spano: What other options are being explored?

Jens Strüker: Another approach to achieving greater data sovereignty involves training large language models (LLMs) not centrally, but decentrally with distributed data sources. Instead of collecting all data in a central data center, data remains with the respective data owners (e.g. companies or institutions) and model training is coordinated through distributed data nodes. This reduces data protection risks and enables the use of sensitive data, for example in medicine or industry, without disclosing it. Zero-knowledge proofs (ZKPs) also play a key role here because they allow the correctness of calculations or data contributions to be verified without revealing the underlying data. This enables a participant to prove that they have contributed correctly to model training without disclosing any data. In decentralized training scenarios, ZKPs enable trustworthy collaboration, even between parties who do not know or trust each other. Following remarkable progress, the current focus is on transferring these concepts to scalable systems – for example, by combining ZKPs with distributed computing architectures and cryptographic protocols.

Pascal Spano: Thank you very much for the interview.

Jens Strüker is Professor of Business Informatics and Digital Energy Management at the University of Bayreuth in Germany, co-director of the Fraunhofer Blockchain Laboratory, and deputy director of the Business Informatics division of Fraunhofer FIT. As a qualified business informatics expert and economist, he and his team conduct research into the use of Web3 technologies like blockchains, digital machine identities and data rooms for the decarbonization of the economy.

Pascal Spano joined Metzler in 2017 as Head of Research in one of Metzler’s core business areas Capital Markets. Prior to joining Metzler, from 2013–2017, he was cofounder and Managing Director of the German FinTech startup company PASST Digital Services GmbH in Cologne, Germany. Before that, he headed the Cash Equities division at UniCredit Group in Munich and Frankfurt/Main, Germany for two years. From 2007 to 2010, Mr. Spano was Head of German Research for Credit Suisse Ltd. Prior to that, he worked for ten years at Deutsche Bank's Global Markets Research division and helped build up the research activities for ABN Amro in Frankfurt/Main and London, UK.

back