NEDO's Strategic Analysis: the Untapped Potential of Proprietary Enterprise Data in AI

NEDO's Strategic Analysis: the Untapped Potential of Proprietary Enterprise Data in AI

The rapid emergence of Generative AI represents a monumental technological leap, but the long-term trajectory of its advancement now faces a critical threat: an impending data exhaustion problem. The vast, publicly accessible datasets that fueled the current generation of models are finite and are being consumed at an unsustainable rate. The core thesis of this analysis is that the solution to this bottleneck lies not in finding more public data, but in unlocking the immense, untapped potential of proprietary enterprise data. This will be achieved through a new infrastructure paradigm known as "data spaces." This post provides a strategic analysis of this challenge, outlines the data space framework as the definitive solution, and quantifies the multi-trillion-yen market opportunity it is poised to create, based on the New Energy and Industrial Technology Development Organization's (NEDO's) document published earlier in 2025.

1. The Impending AI Bottleneck: The Data Exhaustion Problem

To fully appreciate the strategic importance of data spaces, one must first grasp the scale and nature of the data supply crisis confronting the AI industry. The current model of training AI on publicly available internet data is approaching a hard limit. This section quantifies the problem by examining the finite nature of this resource and projecting when this well of information will run dry.

1.1 Analysis of the Global DataSphere

The volume of data being created globally is expanding at an explosive rate. According to an IDC report, the "Global DataSphere"—the total amount of data generated, captured, or replicated worldwide in a given year—is projected to reach 175 zettabytes (ZB) by 2025, growing at a compound annual growth rate (CAGR) of over 27%.

However, not all of this data is useful for training novel AI models. A critical distinction must be made between two categories:

  • Unique Data: This is newly created or captured data, such as a new document, image, or sensor reading. This is the only type of data that provides new information for AI training.
  • Replicated Data: This consists of copies, caches, or streams of existing data.

Crucially, unique data constitutes only about 10% of the Global DataSphere. The remaining 90% is replicated data, which offers no new value for training foundational AI models.

1.2 The Limits of Public Data

A seminal 2022 analysis by EPOCH AI projects that the well of publicly available, human-generated unique data is rapidly being depleted. Their findings indicate the following exhaustion timelines:

  • High-Quality Language Data: This resource was projected to be exhausted around 2024.
  • Low-Quality Language Data: This larger pool is expected to be exhausted between the 2030s and 2040s.
  • Image Data: The supply of unique public images is projected to last longer, with exhaustion expected between the late 2030s and late 2040s.

It is vital to note a key limitation of this analysis: it only considers publicly available, human-derived unique data found on the internet. It does not account for the vast quantities of data that exist behind corporate firewalls.

1.3 The Strategic Implication

The conclusion from these findings is unambiguous. The AI industry's reliance on public internet data is unsustainable, creating a fundamental bottleneck that will cap further advancements in AI model performance. As publicly available data becomes a commoditized and depleted resource, the new competitive frontier for AI leadership will be defined by access to unique, high-quality proprietary data streams. This transforms the challenge from a simple resource shortage into a strategic race for a new class of data assets.

While the reservoir of public data is finite, an even larger, almost entirely untapped ocean of high-value data exists within the world's enterprises.

2. The Untapped Frontier: The Strategic Value of Enterprise Data

Enterprise data is not merely an alternative to public data; it is a superior, high-value asset that holds the key to the next wave of specialized, high-impact AI applications. This section quantifies this untapped resource and identifies the primary barrier preventing its widespread utilization.

2.1 Quantifying the Enterprise Data Reservoir

The scale of enterprise data is staggering. Based on IDC projections, by 2025, enterprise-generated data will constitute 60% of the Global DataSphere, amounting to an estimated 104 zettabytes.

Further analysis reveals the profound extent to which this resource is underutilized. Of all the unique data suitable for training Large Language Models (LLMs), an estimated 92% (approximately 16 ZB) is currently unutilized by AI. The vast majority of this untapped potential is restricted-access enterprise data.

2.2 Industry-Specific Data Concentrations

This enterprise data is not evenly distributed. Certain industries are massive generators of valuable data, presenting unique opportunities.

Industry Sector

Projected Data Volume (2025, in ZB)

Projected CAGR

Manufacturing

25.7

+30%

Infrastructure

22.5

+29%

Financial

9.4

+26%

Healthcare

3.6

+36%

Analysis of these figures reveals critical insights. The Manufacturing sector is by far the largest generator of enterprise data, with Infrastructure and Financial services also representing massive data concentrations. Concurrently, the Healthcare sector, driven by high-resolution medical imaging and analytics, exhibits the highest growth rate (+36%), signaling a rapidly expanding data frontier.

2.3 The Sovereignty Barrier

The primary reason this vast resource remains locked away is the inadequacy of traditional data-sharing models. In a conventional platform-based approach, data owners must surrender their data to a central entity, thereby losing control over how it is used. This is a non-starter for enterprises, for whom data is a core strategic asset.

The fundamental challenge is the lack of data sovereignty—the ability for a data owner to retain control and self-determine how, when, and by whom their data is accessed and used.

To unlock the immense value of enterprise data, a new framework is needed that respects data sovereignty as its core principle.

3. The Solution Framework: Data Spaces and Retrieval-Augmented Generation (RAG)

Overcoming the "sovereignty barrier" requires a new technological and philosophical approach to data sharing. The solution is a powerful combination of a new data-sharing architecture (data spaces) and an enabling AI technology (Retrieval-Augmented Generation, or RAG).

3.1 Defining the Data Space Paradigm

A data space is a decentralized infrastructure that enables data sharing among participants while guaranteeing data sovereignty. In this model, data remains with its owner, who can set precise rules and policies governing its use. This is not a centralized data lake or platform; rather, it is a peer-to-peer system built on trust and control.

This paradigm is made possible by a "middleware infrastructure that ensures semantic interoperability" (i.e., it acts as a universal translator, allowing different systems to understand the meaning and context of exchanged data without a central intermediary).

3.2 The Role of Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is the key AI technology that operationalizes the value of data within a data space. RAG is a technique that enhances the capabilities of large language models by allowing them to access and reference external, proprietary knowledge sources in real-time.

In simple terms, RAG allows an AI model to learn from specific enterprise data—such as internal engineering specifications, proprietary customer service logs, or unique financial rules—at the moment of need, without having to be fully and expensively retrained on that data.

3.3 Synergistic Impact

Together, data spaces and RAG form a powerful, synergistic solution.

  1. The data space provides the secure, controlled, and sovereign access to high-value enterprise data.
  2. RAG provides the efficient mechanism for AI models to consume and utilize that data to perform specific, high-value tasks.

This combination directly enables the next generation of AI, unlocking widespread business process automation and expanding AI use cases deep into the core functions of the enterprise. This conceptual framework is not just a technical solution; it is the foundation of a massive new economic market.

4. Quantifying the Opportunity: Data Space Market Size and Projections (TAM & SAM)

To validate the strategic opportunity and guide investment, it is essential to quantify the potential market size. By forecasting the Total Addressable Market (TAM) and the Serviceable Addressable Market (SAM), we can understand the full economic potential that the data space paradigm unlocks.

4.1 Market Definitions

  • Total Addressable Market (TAM): The total market size of all applications and middleware that data space technologies can directly influence and potentially disrupt.
  • Serviceable Addressable Market (SAM): The portion of the TAM that can realistically be substituted by or newly created for companies deploying new data space-centric technologies.

4.2 Overall Market Projections to 2040

The economic shift driven by data spaces is projected to be immense, creating a market worth hundreds of trillions of yen by 2040.

Market Layer

Global TAM (JPY Trillion)

Global SAM (JPY Trillion)

Domestic TAM (JPY Trillion)

Domestic SAM (JPY Trillion)

Application

~147

~21.7

~3.8

~0.55

Middleware

~198

~30.2

~7.6

~1.1

4.3 Analysis of Market Growth

The data reveals several crucial strategic insights. The total global addressable market (TAM) is forecast to reach 345 trillion JPY by 2040, signaling a transformative economic opportunity.

Notably, the Middleware market is projected to be both larger and to grow slightly faster (Global CAGR of 7.1%) than the Application market (Global CAGR of 6.6%). This trend is characteristic of a foundational technology shift; it reflects the market's need to build the underlying "rails" and infrastructure before the full potential of the application layer can be realized. This positions middleware not just as a larger market, but as the leading indicator of the entire ecosystem's development.

While the overall market is vast, a more granular analysis is needed to identify the most promising segments for strategic focus.

5. Strategic Focus Areas: Identifying High-Growth Market Segments

While the total market size is compelling, strategic success requires identifying and targeting the specific application and middleware segments with the highest potential for growth and value capture.

5.1 Analysis of the Application Layer Opportunity

In the application layer, industry-specific solutions are poised for significant growth as enterprises adopt AI to solve unique business problems.

Application Segment

2040 Global TAM (JPY Trillion)

2040 Global SAM (JPY Trillion)

Vertical SaaS

~62.6

~7.6

Customer Management

~33.6

~4.8

Production Management

~14.4

~2.0

Knowledge Management

~13.6

~3.0

The analysis shows that Vertical SaaS—software tailored to the specific needs of an industry—represents the single largest market opportunity. With a global TAM of over 62 trillion JPY and a serviceable market of 7.6 trillion JPY, the potential rewards are enormous. However, this will also be a highly competitive domain as incumbents and new players race to build data-space-native applications.

5.2 Analysis of the Middleware Layer Opportunity

The middleware layer, which provides the foundational plumbing for the entire data space ecosystem, presents a different and arguably more strategic opportunity.

Middleware Segment

2040 Global TAM (JPY Trillion)

2040 Global SAM (JPY Trillion)

Identity Management

~25.4

~4.5

Integrated Data Management

~38.0

~5.4

Semantics Management

~15.9

~3.5

Cross-Platform

~6.2

~0.7

The key strategic insight lies in the emerging, foundational segments. Unlike mature middleware categories, areas like Semantics Management (ensuring data is understandable across systems) and Identity Management (securely verifying participants) are comparatively new fields. These nascent middleware segments represent a rare opportunity to establish "de facto standards." Early leadership here is not just about capturing market share, but about influencing the entire architecture of the future data economy, creating a much more defensible long-term position.

The Global SAM for Semantics Management alone is projected to be ~3.5 trillion JPY by 2040. Because this is a new field, the obtainable share of this market is high (an estimated 22%), rewarding pioneers who can establish the standards for this foundational ecosystem component.

These market dynamics point toward a clear strategic imperative for businesses, technologists, and investors looking to capitalize on the next wave of AI.

6. Conclusion: The Strategic Imperative for the Data Space Ecosystem

The continued advancement of Artificial Intelligence is fundamentally contingent on solving the data exhaustion problem. This analysis has established that the definitive solution is to securely unlock the vast and high-value reservoir of proprietary enterprise data. The framework to achieve this is the data space, a decentralized infrastructure built on the principle of data sovereignty and enabled by AI technologies like Retrieval-Augmented Generation.

The scale of this shift is not merely technological but profoundly economic, creating a global addressable market projected to be worth 345 trillion JPY by 2040. While significant opportunities exist across the technology stack, the most strategic area for immediate focus is clear.

The final recommendation of this analysis is that strategic investment, research, and development must be directed toward the foundational middleware layer, particularly in the nascent and critical segments of Semantics Management and Identity Management. Leadership in these areas is more than a market opportunity; it is the critical path to building the essential infrastructure for the next generation of enterprise AI. For those who act decisively, this represents a generational opportunity to architect the core of the next-generation digital economy and secure an unassailable leadership position within it.


NEDO’s Innovation Outlook
At the beginning of July 2025, the Innovation Strategy Center of Japan’s New Energy and Industrial Technology Development Organization (NEDO) published its “Innovation Outlook Version 1.0,” outlining a new strategic direction for national research and development. It pivots from a traditional, linear model of R&D to a dynamic,

Read more