ACE Meeting on May 13th 2015 Attendants: Behrooz, Fatemeh, Alexandra, Erik, Yannis, Per, Miquel, Jacob Skype: Christos Fatemeh and Behrooz presented their research directions/ideas for the project: 1) Testing the reliability of approximate architectures compared to the "accurate" architecture I.e.: are faults in the approximate architecture less/equal/more severere that the accurate architecture? Bahrooz and Fatemeh mention a possibilitiy is to use gem5 to test this. Jacob points out he has done some work on analyzing faults (seems to be mainly at the netlist level) There is also tool based on Gem5 to analyze the impact of faults. The name of the tool is GemFI. See the DSN'14 paper: "GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates" Jacob mentions that he has built a static analysis tool that computes the impacts of fault injection 2) Approximate error handling mechanisms The idea is to protect only the parts of the system that are more critical to the computation For example, provide recovery method only for severe failures. They mention as an example the case of brake-by-wire 1) Some critical errors can be caught via exception mechanism and a recovery procedure can be used for those. 2) Other errors are silent (like when values are erroneous) and those can lead to catastrophe. The idea would be to protect these. Finally, there are non-critical errors that we do not need to care about. The following question is rised: Is approximate computing interesting for safety-critical systems? The is some discussion on application requirements. Some applications are really computationally intensive and also there are power-limited scenarios. These motivate the usage of approximate computing in these environments Summarizing, the idea is not to waste time handling errors that are benign, but instead focus only on critical errors Question: how can the algorithm designer communicate what is ok and what is not ok in terms of accuracy? Miquel wonders if a neural network or machine learning can be used for automated quality checking. Christos points out that this has been attempted in the past, but it works only in very limited situations. In general there is a question on adaptive (online learning) vs non-adaptive error checking We probably need the algorithm designer specifices a certain checking function. Behrooz: how straightforward is it for each application to design fault injection experiments? This is a topic that needs to be studied 3) Testing all possible (transient) faults is too expensive. Want to use program analysis techniques to identify liveness of registers and limit testing Yannis mentions "ACE bits" as related work on this. See reference: ACE bits, Todd Austin Micro 2003 For example, an opportunity in floating point: In floating point, the exponent bits are more critical than the mantissa. Testing the injection of faults in the lower bits of the number is not critical and can thus be avoided. There is an opporunity to perform "Format aware fault analysis" Another option is to detect if there are critical errors based on profiling the number of accesses to memomry, registers etc and comparing the result with the profile of the correct execution. As the next topic, Yannis shows and discusses the slides from the thematic session on error-aware systems which took place during the HiPEAC CSW in Oslo. An interesting topic is to formalize the definition of "approximate computing". We should also think about that (see the slides in Box concerning the WAPCO workshop). During the meeting we also discussed what category of fields are interesting for approximation. Miquel points out the classification provided by Ceze et al: 1) analog input, 2) analog output , 3) applications with no single correct answer (eg web search), and 4) iterative and convergent applications (see CACM 2015, No 01) Next we discussed the meeting that we had with Marc Casas in Oslo. Marc worked on the resilience of the Algebraic Multigrid while he was at LLNL. AMG belonngs to the category of iterative and convergent algorithms. Because of the properties of AMG, injection of faults into registers of AMG results either in crashes, completion in the same time, or takes longer time to converge. Interestingly, a scenario of non-convergence does not occur. These properties can make it an interesting case study for the project. Marc has pointed out his willingness to collaborate on this topic. The next meeting will take place on May 26th, at 10h. Next "entertainer" has not yet been decided