Intelligent Measurement Data Processing for the Construction of Dependable IT Systems

2004
2005
Contact: 
István Majzik

Sponsors:

Hungarian-Portugal Bilateral Scientific and Technology Development Cooperation Agreement

Period:

2004-2005

Participants:

CISUC - Center of Informatics and Systems, University of Coimbra, Portugal
Department of Measurement and Information Systems, Budapest University of Technology and Economics, Hungary

Motivation:

Dependability is by definition the complex ability to guarantee that a user can justifiably rely on the services delivered by IT systems. The factors influencing the reliability of microelectronics based equipment have changed drastically during the last decade. Improved technologies have reduced the rate of occurrence of catastrophic faults originating by circuit defects. Transient faults originated by power and signal noises, natural background radiation effects become more and more dominant due to the miniaturization reducing both physical size and energy levels in the circuitry.

A proper protection of the QoS (Quality of Service) against the effects of such faults necessitates the use of fault tolerance (FT) techniques. Many mission critical systems use rough granular redundancy, like modular replication and voting to compensate the effects of faults. However a wide spectrum of applications including the majority of embedded systems cannot tolerate large cost overheads resulting from a high level of redundancy. These systems have to exploit a FT scheme based on a fine granular redundancy.

The key factor in the assurance of proper fault coverage is a good fitting between FT measures and fault occurrence profiles, thus a protection of the system against the most frequently occurring faults. This harmonization of faults and protective FT measures is usually done by using benchmarks supporting the reuse of the observations made in previous systems of a similar architecture and technology as the target system.

The trustworthiness of the fault model is a crucial factor in the design for dependability process.  The main source of difficulties in the creation of a realistic fault model is the lack of a direct observability of faults, as only their manifestation in the form of failures can be measured. Failures result from a complex, time depending interaction between faults and hardware and software architecture, workload etc. Accordingly, a faithful characterization of faults poses a highly demanding measurement problem due to the large number of factors and complexity of the interactions:

  • A large number of observations is needed to deduce their origin and propagation mechanisms in a statistically meaningful way, occasionally reaching the range of hundreds of thousands patterns. Usual operational logs provide an insufficient input for the analysis due to the limitations both in the number and detail of observations. The best practice to get a proper input data set is the artificial injection of faults under the supervision of a special measurement setup recording the reactions in details.
  • Fault campaign log records contain the fault relevant information hidden in long temporal sequences appearing mixed from different sources due to the complexity of the fault manifestation mechanisms. This necessitates the identification of the relevant parts, the suppression of the measurement noise and the separation of qualitatively different manifestations corresponding to different fault propagation mechanisms. Traditionally this was done by expert judgement using heuristic analysis, as simple statistics is unfeasible for the analysis of such complex phenomena. However, nondominant effects corresponding to relatively rare events are frequently overlooked this way resulting in fault escapes in the target design reducing the QoS.
Obviously, fault models have to be incorporated into the design and analysis process of IT systems. This can be done either by simply publishing statistics for the system designer on the types, effects and frequency of faults to be anticipated. A more efficient way is provided by extending the design database of the system engineering tool with the fault model thus supporting automated dependability analysis and incorporation of FT measures.

This way, an effective design for dependability technology requires an experimental assessment of the potential faults, the development of FT measures and their experimental validation, and the formulation of the fault models and FT measures in a form reusable in the design process.

Project aim:

The main objective of the research proposal is to explore the usefulness and efficency of intelligent data processing methods in the field of fault modelling with a special emphasis on the comparison of heuristic and automatically generated models. The practical usefulness of the models will be proven by applying and validating the models in design for dependability pilot applications.

In order to assess the feasibility of the approach, a complete design for dependability roundtrip will be carried out on pilot examples:

  • In the first phase the existing records from the DBENCH database will be analysed by means of data mining. Subsequently, the efficacy of automated model extraction will be evaluated by comparing the results with those from the OLAP based analysis.
  • Selected FT measures will be implemented in a pilot application based on the fault model generated in the previous phase. The measures should represent typical paradigms like architecture-level solutions (like control-flow checking by means of a watchdog processor, use of redundant data structures) and implementation near techniques (like self-checking code).
  • The pilot system will undergo a fault injection campaign. The efficiency of the individual measures will be estimated by the combined methodology developed in the first phase.
  • Finally, the results will be generalized and published in a directly reusable form for the academia and industry. The main candidate is to form the methodology to an analysis pattern over the standard XML based workflow description language, and in the form of standard UML and CWM (Common Warehouse Metamodel) patterns. Similarly, architectural level FT measures will be published as parametrizable UML 2.0 templates.

Further information:

István Majzik, Ph.D.
András Pataricza, Ph.D.