RODES

Designing Robust Systems through Synthesis of Parametric Markov Chains

Radu Calinescu, Milan Ceska, Simos Gerasimou, Marta Kwiatkowska, Nicola Paoletti,

We present a method for the synthesis of system designs that satisfy strict quality requirements, are Pareto-optimal with respect to a set of quality optimisation criteria, and are robust to variations in the system parameters. To this end, we model the design space of the system under development as a parametric continuous-time Markov chain (pCTMC) with discrete and continuous parameters that correspond to alternative system architectures and to the ranges of possible values for configuration parameters, respectively. Given this pCTMC and required tolerance levels for the configuration parameters, our method produces a sensitivity-aware Pareto- optimal set of designs, which allows the modeller to inspect the ranges of quality attributes induced by these tolerances, thus enabling the effective selection of robust designs. Through application to two systems from different domains, we demonstrate the ability of our method to synthesise robust designs with a wide spectrum of useful trade-offs between quality attributes and sensitivity.

The main contributions of this work are:

The definitions of the parametric Markov chain synthesis problem and of the sensitivity-aware Pareto dominance relation.
An efficient method that combines probabilistic model synthesis and precise parameter synthesis to automate the generation of sensitivity-aware Pareto fronts for quality engineering.
Our Robust DEsign Synthesis (RODES) tool that implements our method for designing robust systems.
An extensive evaluation of our tool-supported method in two case studies: a replicated file system used by the Google’s search engine, and a cluster availability management system.

RODES: Robust Design Synthesis Method

System Modelling

A parametric continuous-time Markov chain (pCTMC) defines the design space of the system under development.

A candidate design C(P',q) defines a fixed discrete parameter valuation and continuous parameter values from a small region.

A quality attribute over pCTMCs-defined design spaces is defined using continuous stochastic logic (CSL).

A quality requirement of a system with design space given by a pCTMC C(P,Q) can be:

An objective function f_i, i ∈ I, that corresponds to a quality system attribute and defined in terms of a CSL path formula φ_i, i ∈ I, such that:

A boolean constraint c_j, j ∈ J, that corresponds to a quality system attribute and defined in terms of a CSL path formula φ_i, i ∈ I, and a threshold ~_jr_j such that:

Parametric Markov Chain Synthesis Problem

Consider a system with design space C(P,Q), quality requirements given by objective functions {f_i}_{i ∈ I} and constraints {c_j}_{j ∈ J}, and designer-specified tolerances {γ_k}_{k ∈ K} for the continuous parameters of the system.

The set of feasible designs F for the system, i.e., of candidate designs that meet the tolerances {γ_k}_{k
∈ K} and satisfy the constraints {c_j}_{j ∈ J} is given by:

The parametric Markov chain synthesis problem consists of finding the Pareto-optimal set PS of candidate designs (i.e. pCTMCs) with tolerances {γ_k}_{k ∈ K} that satisfy the constraints {c_j}_{j ∈ J} and are non-dominated with respect to the objective functions {f_i}_{i ∈ I}:

The sensitivity-aware dominance relation '≺' between two candidate designs is defined as follows.

Parametric CTMC Synthesis Method

Computing the Pareto-optimal design set PS is typically unfeasible, as the design space C(P,Q) is infinite due to its real-valued parameters. Moreover, every candidate design C(P′, q) corresponds to an infinite set of CTMCs that cannot all be analysed to establish its quality and sensitivity.

Our pCTMC synthesis method, whose high-level steps are shown in Algorithm 1 below, combines search-based software engineering (SBSE) techniques with techniques for the effective analysis of pCTMCs, and produces a close approximation of the Pareto-optimal design set.

Function AnalyseDesign

This function establishes the quality attributes and sensitivity of candidate designs by employing precise parameter synthesis techniques to compute safe enclosures of the satisfaction probability of CSL formulae over pCTMCs.

Given a pCTMC C(P′,q) and a CSL path formula φ these techniques provide a safe under-approximation ∧^q_min and a safe over-approximation ∧^q_max of the minimal and maximal probability that C(P′,q) satisfies φ:

This allows us to safely approximate the bounds {f_i^⊥, f_i^⊤}_{i ∈
I} of the objective functions, and the constraints {c_j}_{j ∈ J}.

Function GetCandidateDesigns

This function uses established multiobjective optimisation genetic algorithms (MOGAs), e.g., NSGA-II or SPEA2, and encodes a candidate design C(P′,q) that satisfies a fixed set of tolerances {γ_k}_{k ∈ K} by the gene tuple (p, q), where p ∈ P is the centre point of the continuous parameter region P′.

Given a population of candidate designs, several GetCandidateDesigns executions evolve the population into 'fitter' designs using MOGA selection, crossover and mutation. The evolution of the design population terminates after a fixed number of design evaluations or when a predetermined number of successive iterations generate populations with no significantly fitter designs.

Software Systems Used to Evaluate RODES

Google File System

The Google File System (GFS) represents the replicated file system used by Google’s search engine. GFS partitions files into chunks of equal size, and stores copies of each chunk on multiple chunk servers. A master server monitors the locations of these copies and the chunk servers, replicating the chunks as needed. During normal operation, GFS stores CMAX copies of each chunk. However, as servers fail and are repaired, the number c of copies for a chunk may vary from 0 to CMAX.

We adapted the CTMC model of the lifecycle of a GFS chunk by considering several continuous and discrete parameters that a designer of the system needs to decide. To evaluate our design synthesis method, we assume that GFS designers must select the hardware failure and repair rates cHardFail and cHardRepair of the chunk servers, and the maximum number of chunks NC stored on a chunk server within the designer-specified ranges. These parameters reflect the fact that designers can choose from a range of physical servers, can select different levels of service offered by a hardware repair workshop, and can decide a maximum workload for chunk servers.

The GFS model is encoded in the PRISM modelling language extended with the evolve constructs.

We formulate a pCTMC synthesis problem for quality requirements given by two maximising objective functions and one constraint:

ID	Description
f₁	φ₁1 = ¬SL1 U[10,60] SL1, where SL1 = M_up ∧ c > 0 holds in states where service level 1 (master up and at least one chunk copy available) is provided
f₂	φ₂ = C≤60, where a reward of 1 is assigned to the states where the number of running chunk servers is at least 0.5M.
c₁	ψ₁ = C≤60 with threshold ~₁ r₁ ≡ '≤ 5', where a transition reward of 1 is assigned to each chunk replication transition.

Workstation Cluster

We extend the CTMC models of a cluster availability management system. This CTMC models a system comprising two sub-clusters, each with N workstations and a switch that connects the workstations to a central backbone. For each component, we consider failure, inspection and repair rates (where repairs are initiated only after an inspection detects failures), and we assume that designers must decide these rates for workstations, i.e., the real-valued parameters wsFail, wsFail and wsRepair for our pCTMC, respectively. Additionally, we assume that designers must select the sub-cluster size N, and must choose between an expensive repair implementation (i.e., pCTMC module) with a 100% success probability and a cheaper repair module with 50% success probability.

The cluster model is encoded in the PRISM modelling language extended with the evolve constructs.

We formulate a pCTMC synthesis problem for quality requirements given by two maximising objective functions and one constraint:

ID	Description
f₁	φ₁ = ¬premium U[20,100] premium, where premium denotes a system service where at least 1.25N workstations are connected and operating
f₂	φ₂ = C≤100, where the reward 1 is assigned only to the states where the number of operation clusters is between 1.2N and 1.6N.
c₁	ψ₁ = C≤100 with threshold ~₁ r₁ ≡ '≤ 80', where transition rewards are associated with repair actions of the workstations, switches and backbone.

Evaluation

We used RODES to generate Pareto-optimal design sets for the Google File system and the cluster management system, with tolerances γ ∈ {0.01, 0.02, 0.05} for the evolvable parameters of the two pCTMCs, and parameters ε₁ = ε₂ = ε ∈ {0, 0.05, 0.1}.

Evaluating RODES on the Google File System

The figure below shows the Pareto fronts obtained using the "lower bound" definition for the objective functions f₁ (x-axis) and f₂ (y-axis) over candidate designs, and parameters ε₁ = ε₂ = ε ∈ {0, 0.05, 0.1}. Boxes represent quality-attribute regions, coloured by sensitivity (red: sensitive, blue: robust). Red-bordered boxes and arrows indicate the sub-optimal robust designs. For each front, we report mean sensitivity and mean volume.

Sensitivity-aware Pareto fronts for the GFS model over objectives f₁ (x-axis) and f₂ (y-axis).

The figure below shows the corresponding synthesised sensitivity-aware Pareto-optimal designs set. Rectangles in x-y plane correspond to the continuous parameter regions (cHWF: hardware failure rate; cHWR: hardware repair rate) induced by the tolerance γ. The box heights show the value of the discrete parameter NC. Boxes are coloured by sensitivity.

Synthesised Pareto-optimal designs for the GFS model (cHWF: hardware failure rate; cHWR: hardware repair rate; NC: number of chunks).

Statistical Analysis

We also analysed the performance of our NSGA-II-based RODES against a tool variant that uses random search (RS). For each tool variant and combination of γ ∈ {0.01, 0.02} and ε ∈ {0, 0.05, 0.1} we carried out 30 independent runs. Then, we compared the Pareto fronts achieved by each variant against this reference front by using the following metrics

M₁ = w x I_{ε_norm}+ (1−w) x sens_norm
M₂ = w x I_{IGD;_norm}+ (1−w) x sens_norm

which use a weight w ∈ [0, 1] to combine normalised versions of the normalised design sensitivity with the following established (but sensitivity-agnostic) Pareto-front quality indicators:

Unary additive epsilon (I_ε), which measures the minimum additive term by which the elements of the objective vectors from a Pareto front approximation must be adjusted so as to dominate the objective vectors from the reference front. Smaller values denote Pareto front approximations with better convergence.
Inverted Generational Distance (I_IGD), which provides an "error measure" as the Euclidean distance in the objective space between the reference front and the Pareto front approximation. Smaller values denote Pareto front approximations with better convergence and diversity.

The figure below compares RODES and RS across our ε, γ combinations using metrics M₁ and M₂ with w = 0.5.

RODES vs. random search (RS) comparison for combinations of γ ∈ {0.01, 0.02} and ε ∈ {0, 0.05, 0.1}, over 30 independent runs. For both metrics M₁(left) and M₂(right) smaller is better.

Evaluating RODES on the Cluster Availability Management System

As for GFS, the figure below shows the Pareto fronts obtained using the "lower bound" definition for the objective functions f1 (x-axis) and f2 (y- axis) over candidate designs, and parameters ε₁ = ε₂ = ε ∈ {0, 0.05, 0.1}. Boxes represent quality-attribute regions, coloured by sensitivity (red: sensitive, blue: robust). Red-bordered boxes and arrows indicate the sub- optimal robust designs. For each front, we report mean sensitivity and mean volume.

Sensitivity-aware Pareto fronts for the cluster model over objectives f₁ (x-axis) and f₂ (y-axis).

The figure below shows the corresponding synthesised sensitivity-aware Pareto-optimal designs set. Rectangles in x-y plane correspond to the continuous parameter regions (cHWF: hardware failure rate; cHWR: hardware repair rate) induced by the tolerance γ. The box heights show the value of the discrete parameter NC. Boxes are coloured by sensitivity.

The "projection" of the quality-attribute regions on the f₁-N plane corresponding to synthesised sensitivity-aware designs. Synthesised Pareto-optimal designs for the GFS model (N: number of workstation in each sub-cluster).

The table below shows the average design sensitivity for two variants of the synthesis problem corresponding to different ranges of possible values for the parameter N, for several combinations of $γ and ε. The results show that sensitivity-aware designs (i.e. where ε > 0) for N ∈ {10..15} have lower sensitivity than for N ∈ {11..14}, validating our observation that more robust designs are obtained for N=10 or N=15.