Project Details
Artificial resilience using learning-based test and debug for future intelligent systems
Applicant
Professor Mehdi B. Tahoori, Ph.D.
Subject Area
Computer Architecture, Embedded and Massively Parallel Systems
Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Term
since 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 495168954
In complex computing systems of today and future, consisting of several hardware and software components with mixed degrees of criticality, with changing requirements (e.g., performance and functionalities) and resources (e.g., energy), coupled with new and complex failure mechanisms due to nanometer scale fabrication technology, and in addition, subject to complex design bugs and malicious attacks, brute force schemes with hardcoded redundancies in the design and fixed runtime recovery policies are no longer sufficient. The increasing design complexity together with complex interactions of runtime behavior of software and device-level parameters, which are not very fully understood nor accurately modeled, cause Heisenbugs to escape pre-silicon validation. The interdependent fields of fault tolerance, manufacturing test, and post-silicon debug, although sharing common sets of complex and hard-to-model resiliency root causes, have been diverging in the past, each trying to address these problems with disjoint sets of design infrastructures and solution. This is partly because these are separate fields, in different academic research communities and industry divisions. They each target particular classes of failures and malfunctions, based on very rigid spatial and temporal redundancies, hardcoded into the design of the system, which incur exorbitant amount of complexity and costs in terms of area, energy, and performance. However, due to the increasing complexity and non-determinism of the problems, spanning from hardware design, manufacturing and in-field operation, traditional deterministic solutions mostly based on analysis of binary logic states of the circuit and system, becomes more and more inefficient. Also, the complexity of adding different design infrastructure and solutions, to target these issues in isolation, is no longer scalable. The main purpose of this proposal is the pursuit of artificial resilience, through a set of holistic approaches spanning over different design, test and operation lifecycles of computing systems, such that they can continue to provide required functionality despite malfunctioning components, environmental disturbances, and design bugs. Our proposed artificial resilience paradigm, based on a sensor rich architecture and learning-based analysis of sensor data, is the main novelty of this project. We will also investigate, as a feasibility study and proof-of-concept, how this concept can be tailored to deal with specific challenges and problems in different fields of manufacturing test, post-silicon-debug, and reliable system design.
DFG Programme
Research Grants