Rethinking Test Coverage for AI-Enabled Systems: Beyond Line Coverage
Traditional test coverage metrics (line, branch, MC/DC) are inadequate for AI-enabled system components. This paper proposes a multi-layer coverage framework addressing data coverage, behavioral coverage, and distributional robustness coverage for ML-based components.
Test Coverage for AI Components: A New Framework
MC/DC coverage tells you nothing useful about whether a neural network is adequately tested. The field needs new metrics, and some are starting to emerge.
Data coverage: Are you testing across the full operational data distribution, including edge cases and distribution boundary conditions? Simple metric: what fraction of the operational design domain's variation is represented in your test data?
Behavioral coverage: For classification systems, are you testing all intended output classes? For regression systems, are you testing across the full output range with density proportional to operational importance?
Distributional robustness coverage: How does performance degrade as inputs move toward and beyond the training distribution boundary? This is the SOTIF-relevant metric — what happens at the ODD edge?
Adversarial coverage: For safety-critical applications, are you testing against adversarial inputs that could fool the model? Not about security attacks, but about ensuring the model doesn't produce safety-critical errors near realistic input perturbations.
Integration with existing frameworks: The paper proposes integrating these metrics with DO-178C supplemental guidance (DO-330) and IEC 62061 software requirements to create a coherent coverage argument for mixed software architectures.