Project Details
Projekt Print View

System-Physician-on-a-Chip (SPOC): Chip Health-Monitoring Infrastructure IP and Run-Time Adaptation

Subject Area Computer Architecture, Embedded and Massively Parallel Systems
Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Term from 2015 to 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 269744693
 
Design-time solutions and guard-bands for resilience are no longer sufficient for nanoscale integrated circuits (ICs). Each chip, due to process variations, is born with a unique personality (nature), and because of operating conditions, environment, and workload, grows uniquely (nurture). This project is motivated by the need to guarantee that each system, despite different nature and nurture, has an acceptable behavior (resilience). Resilience has been defined as the persistence of performance level that can justifiably be trusted in the presence of change. Hence static solutions based on pre-determined adaptation strategies cannot provide adequate resilience as systems evolve with time. While todays ICs incorporate a large number of sensors (thermal, voltage, delay, etc.) for runtime monitoring, breakthroughs are needed to extract useful information from sensor data, perform real-time analysis, and make decisions about online adaptation. Appropriate reasoning methods are also needed to deal with inconsistent or contradictory sensor data due to stress-, process-, and workload-induced spatiotemporal variations. It is important to predict system state so that countermeasures can be taken before a failure occurs. The proposed research is focused on data-driven techniques for guiding dynamic adaptation policies. This level of dynamic decision-making and prediction-based control is a significant step forward towards resilient systems. The intellectual merit lies in the advancement of data analytics solutions for reasoning about on-chip behavior, the integration of prediction-based adaptation, and the update of adaptation strategies based on success, or lack thereof, of past adaptation decisions. Specific tasks include: 1. Monitoring of performance, power, and reliability: In addition to the use of on-chip sensors, the research will involve the reuse and redesign of BIST infrastructure to accurately monitor events such as voltage droop and aging by aggregating through compaction. 2. Reasoning under uncertainty: Dempster-Shafer theory will be used to aggregate sensor data and derive coherent conclusions about system behavior. The challenge lies in determining domain-specific belief functions. An evaluate-update procedure will be used to tune belief functions. 3. Performance/reliability prediction: Machine-learning techniques will be adapted for extracting useful features, detecting anomalies, and learning from sensor readouts. The goal is to predict system performance and incorporate incremental learning for continuous refinement. 4. Run-time adaptation: Feedback and evaluation will form the basis for proactive updates to adaptation policies (voltage/frequency scaling, task scheduling, etc.), to predict the reliability trajectory, and take appropriate decisions. This project will lead to a health-monitoring infrastructure IP for a system to respond to changes in behavior occurring at different time scales.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung