1. Introduction
Relational properties which relate abstract entities that can be viewed as vertices in a graph via edges that define the relations, e.g., the connectivity structure of a social or biological network, or of an object graph on the program heap, offer various benefits in the development of software systems (Jackson01AlloyAlpha; RumbaughETAL98UML; Spivey92Z). For example, they enable automated analyses of software requirements, designs, specifications, and implementations (10.1007/9783642243721_5). However, in most cases, to benefit from the analyses the properties must be written manually. For complex systems (ZaveChrod2012; Sangal:2005:UDM:1103845.1094824), writing them correctly is challenging, and faults in the properties’ statements can lead to erroneous confidence in correctness.
Our motivation is to leverage advances in machine learning (ML) to enhance software analyses, thereby improving software quality and reducing the cost of software failures. Our focus is on training binary classifiers that characterize relational properties. Such classifiers, once trained to have high accuracy, have the potential to offer much value in automated software analysis. For example, an executable classifier can be used as a
runtime check, e.g., in an assertion, to validate that the program states at that point conform to the property represented by the classifier. Moreover, executable checks enable automated test input generation (57624; Boyapati:2002:KAT:566172.566191), static analysis (Garg:2016:LIU:2837614.2837664; Si:2018:LLI:3327757.3327873), error recovery (DBLP:conf/oopsla/DemskyR03; Ke:2015:RPS:2916135.2916260), and automated theorem proving (MouraKADR15).In this paper, we introduce the MCML approach for empirically studying the learnability of a key class of relational properties that can be written in the wellknown software design language Alloy (Jackson01AlloyAlpha). Our aim is not to formalize learnability (Blumer:1989:LVD:76359.76371) of relational properties; rather, we aim to perform controlled experiments and rigorously study various wellknown properties over small relations and graphs in order to gain insights into the potential role of ML methods in this important domain. Specifically, we consider training binary classifiers with respect to relational properties such that the trained classifiers accurately represent the properties, e.g., training a decision tree classifier for the directedacyclic graph (DAG) property to accurately classify each input as a DAG or not a DAG.
A key novelty of MCML is that it allows quantifying the performance of trained decision tree models with respect to the entire input space (for a bounded universe of discourse) by utilizing given ground truth formulas, thereby enabling an evaluation of learnability that is not feasible using the commonly used ML approaches based on training and test datasets alone. Likewise, MCML allows quantifying differences among trained decision tree models for the entire (bounded) input space.
To quantify performance of and differences among trained models – irrespective of the datasets – MCML reduces the quantification problems to the classic complexity theory problem of model counting, which is to compute the number of solutions to a logical formula (Gomes08modelcounting). The formulas in our case represent semantic differences between the ground truth and the trained model, or between the two models. Given ground truth and decision tree , the false negative count for is the model count of “” where is ’s logic that leads to the output “0” because any solution to “” conforms to the ground truth but leads the decision tree to output “0”; likewise, the false positive count for is the model count of “” where is ’s logic that leads to the output “1”. The true positive and true negative counts are defined similarly. Using these counts accuracy, precision, recall and F1 score can be derived. Furthermore, given two decision trees and , their semantic difference, i.e., the number of inputs for which the tree outputs differ, is the sum of the model count of “” and the model count of “”, where is ’s logic that outputs “0”, and is ’s logic that outputs “1” ().
Model counting generalizes the propositional satisfiability checking (SAT) problem, and is #Pcomplete (Sahni:1976:PAP:321958.321975). Despite the theoretical complexity, recent technological and algorithmic advances have led to practical tools that can handle very large state spaces (4268160; 10.1007/3540460020_19)
. To embody MCML and quantify the performance and differences by means of model counting we employ two stateoftheart model counters: ApproxMC, which uses approximation techniques to estimate the number of solutions with high precision and provides formal guarantees on the counts
(SM19); and ProjMC (ProjMC), which uses an effective disjunctive decomposition scheme to compute the exact number of solutions. As is common with model counters for propositional logic, ApproxMC and ProjMC take as input propositional formulas in conjunctive normal form (CNF) – the standard input format for SAT solvers. To create model counting problems in CNF, we define a translation from decision trees to CNF in the spirit of previous work (article1987). The translation creates succinct CNF formulas that are linear in the size of the input trees.To train ML models, we use the standard practice of employing training datasets. To evaluate trained ML models, we use both the standard practice of employing test datasets and the MCML metrics. Our study has three key distinguishing characteristics: 1) for each property, we use bounded exhaustive sets of positive samples which contain every positive sample (up to partial symmetry breaking) within a bounded universe of discourse; 2) we leverage ground truth formulas to quantify the performance of trained decision tree models with respect to the entire input space, not only the given training/test datasets; and 3) we quantify semantic differences among different trained decision tree models with respect to the entire input space without the use of ground truth formulas. Moreover, we evaluate different strategies for splitting the datasets into training sets and test sets, including ratios where the amount of training data is much smaller than the amount of test data.
To create the datasets for learning, we rely on logical formulas that describe the relational properties in Alloy, which is a firstorder language with transitive closure (Jackson01AlloyAlpha). For each property, we use the Alloy toolset’s SATbased backend to enumerate all the solutions, i.e., valuations that exhibit the property, up to Alloy’s default symmetry breaking
which heuristically removes many but not all isomorphic solutions. The solutions created by the Alloy analyzer serve as the samples for training and test/evaluation. The solution spaces are very large – even with small bounds on the number of entities in the relations. For nontrivial properties, the number of positive samples is far smaller than the number of negative samples, and exhaustive enumeration of all negative samples is intractable. To avoid erroneously biasing the ML models to simply predict false if the datasets overwhelmingly consist of negative samples, we create
balanced sets that contain the same number of positive and negative samples (doi:10.1162/evco.2009.17.3.275).As subjects, we use six ML models, including decision trees, SVMs, and multilayer perceptrons, and train them using datasets from 16 relational properties over small relations and graphs. We use the adjacency matrix representation for each data item in the training and test datasets; for example, for a relation over 7 entities, i.e., a graph with 7 vertices, we use 49 Boolean inputs for the binary classifier which outputs true or false as the predicted value, and the space of all possible inputs has
elements.The results show that relatively simple ML models can achieve surprisingly high performance (accuracy and F1 score) at learning relational properties when evaluated in the common setting of using training and test datasets – even when the training dataset is substantially smaller than the test dataset – indicating the seeming simplicity of learning these properties. However, the use of MCML metrics based on model counting shows that the performance can degrade substantially when tested against the whole (bounded) input space, indicating the high complexity of precisely learning these properties, and the usefulness of model counting in quantifying the true performance.
This paper makes the following contributions:

Learning relational properties. We present a systematic study of learning 16 relational properties using 6 offtheshelf machine learning models;

Model counting to quantify performance. We reduce the problem of evaluating the performance of trained decision tree models over the entire input space with respect to ground truth formulas to the problem of model counting, and employ cutting edge approximate and exact counters to embody the reduction; and

Model counting to quantify semantic differences. We also introduce the use of model counting to quantify semantic differences between trained decision tree models over the entire input space – without the need for ground truth formulas or evaluation datasets.
We believe the use of model counting in learning is a promising research area that can help gain deeper insights into the trained models, which can inform practical decisions on how to best utilize the models. For example, if a trained model in a deployed system is to be upgraded to a more sophisticated model, model counting could be a metric that in part informs the decision to upgrade.
2. Related Work
To our knowledge, MCML is the first work to introduce the use of model counting to quantify performance of trained decision tree models with respect to ground truth formulas and to quantify semantic differences among different decision trees. In this section, we discuss the related work in five broad categories: 1) model counting applications in machine learning; 2) verification and testing of machine learning models; 3) analyzing learnability; 4) learning program properties; and 5) Alloy.
2.1. Model Counting Applications in ML
The use of model counting in machine learning has focused largely on probabilistic reasoning (Chavira2008OnPI; FierensETAL2012; GensDomingos2013; LiangETAL2017UAI). Recent work by Baluta (Baluta2019quantitative)
introduced model counting for quantifying differences between binarized neural networks
(hubara2016binarized) that admit a translation to SAT/CNF (NarodytskaKRSW18). This translation enables our MCML metrics to generalize beyond decision trees and become applicable to quantify the performance of binarized neural networks with respect to the entire input space. The MCML metrics also directly generalize to other techniques that use different solvers for model counting, e.g., techniques based on ODDs and OBDDs
(ChanDarwiche2003; ShihETAL2018).2.2. Verification and Testing of ML models
Verification and testing of machine learning models is an active area of research, including work on novel decision procedures such as Reluplex (HuangKWW17; KaBaDiJuKo17Reluplex)
, which has been optimized for the analysis of neural networks with ReLU activation functions, testing trained models
(PeiETAL17; TianETAL18DeepTest), applying symbolic execution (SunETAL2018ASE; GopinathETAL2018Arxiv), as well as inferring verifiable policies by mimicking deep reinforcement learning agents
(BastaniETAL18VerifiableRL; VasicETAL19MOET). A key difference between MCML and previous work on verification of properties of trained models is the focus of previous work on either verifying properties of one trained model, or checking equivalence or implication between two models. In contrast, MCML metrics apply in a more general setting, even when two models are neither equivalent nor such that one implies the other.For the translation from decision trees to CNF formulas, Hastad (article1987) introduced the idea that a decision tree can be represented as a Disjunctive Normal Form (DNF) formula. He also showed that these DNF formulas are convertible to Conjunctive Normal Form (CNF) formulas. MCML leverages this work and uses CNF formulas of trained decision trees to create the quantification problems. While the formulas MCML creates are optimal in terms of the size of the CNF formula with respect to the given input tree in the general case, an alternative approach is to use rewriting, e.g., aka compilation (ChanDarwiche2003; ShihETAL2018), to create smaller CNF formulas in some specific cases. However, rewriting itself has a cost, which can be substantial for nontrivial formulas, e.g., with hundreds of variables (as for our subjects). Moreover, when compilation transforms a decision tree to a simpler decision tree, MCML works in tandem with compilation: first apply rewriting to reduce the tree, and then use MCML’s translation to create a reduced CNF formula.
2.3. Analyzing Learnability
Analyzing learnability of a machine learning model is of fundamental importance for understanding the model’s ability to generalize. Frameworks that enable such analysis include the Probably Approximately Correct (PAC) learning
(Valiant:1984:TL:1968.1972), VapnikChervonenkis (VC) theory (article1971a), and the general learning setting (Vapnik:1995:NSL:211359; ShalevShwartz:2010:LSU:1756006.1953019). These frameworks provide formal analytical characterization of the number of examples needed to train models for binary classification tasks and may provide useful intuition and guidance about the design of algorithms for learning. Blumer (Blumer:1989:LVD:76359.76371) showed that finiteness of VapnikChervnenkis(VC) dimension is a basic requirement for distributionfree learning. Using these VC Dimensions, he analyzed the properties in terms of closure and complexity and provided a detailed set of conditions for learnability. In contrast, our study provides insights based on empirical evidence from controlled experiments using various relational properties that are common in software systems.2.4. Learning Program Properties
In the context of learning properties of code, specifically of dynamic data structures in Java programs, two recent projects (MolinaETAL2019ICSE; UsmanETAL2019SPIN) used a variety of machine learning models to show the effectiveness of offtheshelf models. However, both these projects used the traditional metrics with training and test datasets, and did not evaluate the performance of the trained models with respect to the entire (bounded) input spaces. More broadly, machine learning has enabled automated detection of program errors and repair of program faults (1317470; LongRinardPOPL2016).
2.5. Alloy
Alloy has been used in several projects: for design and modeling of software systems (JacksonS00; KhurshidJackson00ExploringDesignIntentional; ZaveChordTSE; BagheriETAL2018; WickersonETAL2017; ChongETAL2018); for software analyses, including deep static checking (JacksonVaziri00Bugs; TACOGaleottiETALTSE2013), systematic testing (MarinovKhurshid01TestEra), data structure repair (SamimiETALECOOP2010; ZaeemKhurshidECOOP2010), and automated debugging (GopinathETALTACAS2011); for analysis of hardware systems (CheckMateMICRO2018; CheckMateMicro2019; CheckMateGitHub); and testing and studying model counters (TestMC2019; StudySymmetry). MCML introduces the use of Alloy for creating training and test data for machine learning models and leverages Alloy’s backend for creating CNF formulas that represent the ground truth.
3. Example
Figure 1 shows an Alloy specification that declares a set (sig) of atoms, a binary relation , and four predicates (i.e., formulas) that specify reflexive, symmetric, transitive, and equivalence relations. The keyword “all” is universal quantification, “in” is subset, “and” is conjunction, and “implies” is implication. The operator ‘>’ is Cartesian product; for scalars and , > is the pair (). The command “E4: run Equivalence for exactly 4 S” instructs the Alloy analyzer to solve the formula(s) that define equivalence relation with respect to a scope, i.e., bound, of exactly 4 atoms in set S. The analyzer uses the bound to translate the Alloy specification to a propositional satisfiability formula, and uses offtheshelf SAT solvers to solve it. The Alloy analyzer supports incremental solvers that can enumerate all solutions.
Executing the command E4 and enumerating all solutions creates the 5 solutions illustrated in Figure 2. Note, each solution is nonisomorphic. The Alloy analyzer adds symmetry breaking predicates during the translation, which breaks several (but, in general, not all) symmetries (Shlyakhter01EffectiveSymmetryBreaking). The space of all candidate solutions for the command E4 has size =65,536 since each candidate is a matrix of boolean variables. Note, how quickly the state space grows as the number of vertices increases, e.g., for just 7 vertices, the state space has size which is greater than .
ApproxMC. To illustrate the use of ApproxMC, consider estimating the number of solutions for the Equivalence predicate with scope 20. The Alloy command “E20: run Equivalence for exactly 20 S” defines the constraint solving problem. The Alloy analyzer translates this problem to a CNF formula that has 18,666 variables (of which 400 are primary variables) and 27,202 clauses. ApproxMC solves this CNF formula in 17.8 seconds and reports an approximate model count of 11,264. The exact model count, which we calculate using the Alloy analyzer (which uses an enumerating SAT solver), is 10,946, i.e., the ApproxMC count is within 3% error rate.
ProjMC. To illustrate the use of ProjMC, consider computing the exact model count the Equivalence predicate for scope 20. Given the CNF formula for the Alloy command E20, ProjMC reports the exact model count of 10,946 in 351.1 seconds. The count reported by ProjMC is as expected the exact number of solutions we get using the Alloy analyzer.
4. MCML Approach
Our approach, called MCML, introduces the use of model counting for quantifying performance of decision tree models in machine learning (ML) and is embodied by two techniques: 1) Acc, which quantifies the accuracy of trained ML models with respect to ground truth formulas; and 2) Diff, which quantifies the semantic differences between two trained models. Both Acc and Diff compute the results with respect to the entire input space and do not require any datasets.
We view a decision tree as a set of paths; any input follows exactly one path, and each path is a conjunction of branch conditions such that each condition contains one input variable (doi:10.1080/00224065.1981.11978748). Since our focus is on relational properties over graphs that are represented using adjacency matrices, the input variables for the decision tree are all binary and so is the output. Moreover, each branch condition on any path is simply of the form either “”, i.e., input is “0” (false), or “”, i.e., input is “1” (true). Therefore, each branch condition is simply a literal, i.e., a variable or its negation. Figure 3 illustrates a decision tree with 2 inputs ( and ) and 4 paths ([], [], [], and []).
Acc: Quantifying model accuracy.
Assume property (ground truth) is learned by a decision tree
that is a binary classifier with inputs and so the input space has
size . Assume has unique paths () that predict label true, and unique paths
() that predict label false. For path , let be
the conjunction of branch conditions along ; we refer to as
the path condition for (pathconditions).
We define the following four metrics based on model counting to
generalize (for the entire input space) the traditional metrics
of true positives (), false positives (),
true negatives (), and false negatives (), where
represents the model count for the formula with respect
to bound :
(1) 
(2) 
(3) 
(4) 
The formula characterizes the inputs that the decision tree classifies as true; the formula is a disjunction of the path conditions for the paths that predict true and any input that is classified as true must be a solution to exactly one of these paths. For example, for the decision tree depicted in Figure 3, the path condition formula for label true should be . Likewise, the formula characterizes the inputs that the tree classifies as false.
While a direct application of our metrics requires the ground truth formula that characterizes the property of interest, the metrics are also applicable when is not known. Specifically, the metrics naturally generalize to allow quantifying differences in two trained models, again without requiring any datasets for quantification.
Diff: Quantifying differences in models
Let and be decision trees that are trained as binary classifiers (using the same
or different datasets) with inputs.
Let have paths () that
predict and paths () that
predict . Moreover, let have paths () that predict and paths ()
that predict . The following four metrics quantify the numbers
of inputs (in the entire input space) for which and
make the same decision – , i.e., both predict true,
or , i.e., both predict false – and numbers of inputs for which
the decisions differ – , i.e., predicts but
predicts false, and , i.e., predicts but
predicts :
(5) 
(6) 
(7) 
(8) 
The semantic difference () in and is
quantified as the ratio of the number of inputs such that
decisions for differ between and , to the total number
of inputs; and the semantic similarity () in and
is quantified as the ratio of the number of inputs such that
predicted labels for match between and , to the total
number of inputs:
(9) 
(10) 
Framework. We embody the metrics into a prototype framework that translates decision trees to CNF and leverages offtheshelf SATbased technology to solve the quantification problems. Figure 4 illustrates the key steps to quantify the accuracy of a trained decision tree model with respect to the ground truth written in Alloy. is translated by the Alloy analyzer with respect to bound (e.g., 20 vertices) into a CNF formula . The module MCPGen takes as input and , translates the relevant parts of with respect to the desired metric (, , , and ) into CNF using the submodule Tree2CNF, and outputs the CNF formula which defines the model counting problem. The CNF is an input to the model counter (e.g., ApproxMC) that outputs the (approximate) model count and exact model count respectively. Figure 4 illustrates the key steps to quantifying the semantic differences between two trained decision tree models and .
Translating decision tree logic to CNF. The goal of the submodule Tree2CNF is to translate a formula of the form , which represents either all the paths that predict label true or all the paths that predict label false, to CNF. is originally in disjunctive normal form and can be translated to CNF using various techniques. One standard technique is to apply propositional equivalences and De Morgan’s laws (oldpaper); however, this technique can lead to a blowup in the size of the formula that can negatively impact the backend solver’s performance. Another standard technique is to apply the Tseitin transformation (Tseytin1966) which creates formulas linear in the size of the input; however, this technique uses auxiliary variables and creates an equisatisfiable (but not necessarily equivalent) formula which can have a different model count than the original formula.
Our translation uses the following observation (article1987) that allows a translation which does not cause a blow up in the formula size and preserves the model counts: any input that does not get classified as true gets classified as false and vice versa. Therefore, if each is a path condition for a path that leads to label true and represents the decision tree logic that predicts true, the negation of characterizes the decision tree logic that predicts false. The formula immediately simplifies to CNF because and each is a conjunction of literals; therefore, each is a disjunction of literals. For example, for the decision tree in Figure 3, the path condition formula for label false is the negation of the path condition formula for label true, that is which can be directly translated into CNF formula .
Analysis. Our translation creates a compact formula directly in CNF without introducing any auxiliary variables – both for output label true and for label false. For a decision tree with leaf nodes, the number of path conditions (PCs) is , i.e., linear in the tree size. If there are features, the size of the CNF formula is : each PC has conditions, each condition is a literal, and the formula is a conjunction of the negation of each PC. Thus, in terms of the size of the formula for the backend solver, the translation is an optimal choice for CNF with no auxiliary variables used in the general case. Our tool embodiment of MCML supports two stateoftheart backends: 1) the exact model counter ProjMC (ProjMC); and 2) the approximate model counter ApproxMC (SM19).
5. Study
Property  Scope  State  Valid  EstValid  EstValid  Valid  Valid 

Space  SymBr  SymBr  NoSymBr  SymBr  NoSymBr  
(Alloy)  (ApproxMC)  (ApproxMC)  (ProjMC)  (ProjMC)  
Antisymmetric  5  56723  55296  1998848  56723  1889568  
Bijective  14  25043  25088    25043    
Connex  6  148884  147456  14680064  148884  14348907  
Equivalence  20  10946  11264    10946    
Function  8  16531  16640  17563648  16531  16777216  
Functional  8  35017  35328  48234496  35017  43046721  
Injective  8  16531  16640  17563648  16531  16777216  
Irreflexive  5  35886  36352  2686976  35886  1048576  
NonStrictOrder  7  26387  26112  6422528  26387  6129859  
PartialOrder  6  82359  88064  8126464  82359  8321472  
PreOrder  7  43651  43008  9175040  43651  9535241  
Reflexive  5  35886  35840  1048576  35886  1048576  
StrictOrder  7  26387  29184  6815744  26387  6129859  
Surjective  14  25043  25088    25043    
TotalOrder  13  15511  14848  5502926848  15511    
Transitive  6  95564  102400  9306112  95564  9415189 
This section describes our study methodology and summarizes the evaluation results for the 16 relational properties (Table 1) and 6 machine learning models (decision trees (DT
), random forest decision trees (
RFT), Adaboost decision trees (ABT), gradient boosting decision trees (
GBDTSVM) and multilayer perceptrons (MLP) studied. The properties include: antisymmetric, bijective, connex, equivalence, function, functional, injective, irreflexive, nonstrict order, partial order, preorder, reflexive, strict order, surjective, total order and transitive.Generation of positive and negative samples. For each property, we used the Alloy toolset to generate two datasets namely and . For each dataset, we generated positive samples, which satisfy the property specified, and negative samples, which negate it. Each data sample consists of a feature vector and a binary label which is 1 for the positive class and 0 for the negative class. The features linearly represent the adjacency matrix. For example, if the scope of a property is 7, a 49bit feature vector is used. We use all positive samples generated for that scope. Since the number of negative samples is much larger than the number of positive samples due to the complex nature of the properties, it is not feasible to exhaustively enumerate the negative samples. We create negative samples by creating a bit vector at random and checking it against the property using the Alloy evaluator to confirm that it is a negative sample. We create the same number of negative samples as the positive samples for each datasets. For , we pick the smallest scope such that Alloy generates positive samples using the default symmetry breaking. We generate datasets for in the same way except we select the scope such that Alloy generates positive samples without using the default symmetry breaking.
Table 1 tabulates for each property, the scope (i.e., number of atoms in set ), the size of the input space, the number of positive samples created by Alloy, the count of positive samples estimated by ApproxMC, the count of positive samples estimated by ApproxMC when symmetry breaking is turned off, the count of positive samples given by ProjMC and the count of positive samples given by ProjMC when symmetry breaking is turned off.
Training models. A key decision in evaluating ML models is to select the training and test ratio. We use 5 different ratios for training:test, specifically 75:25, 50:50, 25:75 10:90 and 1:99 to evaluate a wide range of settings. There is no overlap in the training and test datasets. We performed the experiments using basic outofthebox models. We explored tuning the hyperparameters but the results were only marginally improved and the small increase in accuracy was offset by a much larger increase in time spent on tuning. We report the results obtained using the basic models without tuning of hyperparameters. We used Python programming language and ScikitLearn Library to implement machine learning models. We kept timeout of 5000 seconds which is common in field of model counting.
Performance metrics. We used four standard metrics to determine the quality of classification results: accuracy, i.e., ; precision, i.e., ; recall, i.e., ; and F1score, i.e., .
Due to space limitations we show results for select properties and
focus on ApproxMC. The supplementary material includes the complete
results of all properties for ApproxMC and ProjMC with respect to all
ratios and also includes list of metadata items such as number of
primary variables, total variables and total clauses for each
property. Alloy specifications of the properties and the detailed
results are also available in the supplementary material. All
experiments were performed on Ubuntu 16.04 with an Intel Corei7 8750H
CPU (2.20 GHz) and 16GB RAM. The training and test/evaluation datasets used in our study together with detailed results are publicly
available at:
https://github.com/muhammadusman93/mcmlstudy.
Research questions. We answer the following research questions in our empirical study.

[topsep=2pt,itemsep=1pt,partopsep=0ex,parsep=0ex,leftmargin=30pt]

How effective are ML models in learning relational properties?

How well do ML (decision tree) models generalize (outside of the test set)?

How do symmetries in the dataset affect performance of the ML models?

How does mismatch in symmetries present in the training and evaluation sets affect the performance?

What is the quantitative difference between two decision trees models?
5.1. Answers to the Research Questions
This section presents the results of the experimental evaluation and answers the research questions.
5.1.1. RQ1: How Effective are ML Models in Learning Relational Properties?
Ratio  Model  Accuracy  Precision  Recall  F1Score 

75:25  DT  0.9996  0.9992  1.0000  0.9996 
RFT  0.9999  0.9998  1.0000  0.9999  
GBDT  0.9951  0.9905  0.9997  0.9951  
ABT  0.9412  0.9396  0.9425  0.9411  
SVM  0.9993  0.9985  1.0000  0.9993  
MLP  0.9996  0.9993  1.0000  0.9996  
25:75  DT  0.9983  0.9968  0.9999  0.9983 
RFT  0.9996  0.9993  1.0000  0.9996  
GBDT  0.9952  0.9906  0.9998  0.9952  
ABT  0.9419  0.9401  0.9441  0.9421  
SVM  0.9987  0.9973  1.0000  0.9987  
MLP  0.9991  0.9982  1.0000  0.9991  
1:99  DT  0.9798  0.9656  0.9949  0.9801 
RFT  0.9931  0.9937  0.9926  0.9931  
GBDT  0.9904  0.9846  0.9964  0.9904  
ABT  0.9387  0.9351  0.9428  0.9389  
SVM  0.9769  0.9767  0.9772  0.9770  
MLP  0.9909  0.9867  0.9953  0.9910 
Property  Accuracy  Precision  Recall  F1  Accuracy  Precision  Recall  F1  Time[s] 
Test  Test  Test  Test  
Antisymmetric  0.9913  0.9871  0.9957  0.9914  0.9690  0.4576  0.9960  0.6271  0.9 
Bijective  0.9971  0.9942  1.0000  0.9971  0.9729  0.0000  1.0000  0.0000  219.3 
Connex  0.9996  0.9993  1.0000  0.9996  0.9992  0.0698  1.0000  0.1304  1.3 
Equivalence  0.9958  0.9920  0.9997  0.9958      0.9997     
Function  0.9954  0.9928  0.9980  0.9954  0.9924  0.0000  0.9982  0.0000  6.0 
Functional  0.9991  0.9983  0.9998  0.9991  0.9943  0.0000  0.9998  0.0000  6.3 
Injective  0.9979  0.9961  0.9998  0.9980  0.8889  0.0000  0.9998  0.0000  6.1 
Irreflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.4 
NonStrictOrder  0.9994  0.9989  1.0000  0.9994  0.9944  0.0000  1.0000  0.0000  3.1 
PartialOrder  0.9963  0.9936  0.9990  0.9963  0.9675  0.0059  0.9991  0.0116  2.3 
PreOrder  0.9992  0.9985  0.9999  0.9992  0.9909  0.0000  0.9999  0.0000  3.5 
Reflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.4 
Strictorder  0.9991  0.9982  1.0000  0.9991  0.9915  0.0000  1.0000  0.0000  3.0 
Surjective  0.9980  0.9961  1.0000  0.9980  0.9993  0.0000  1.0000  0.0000  208.1 
TotalOrder  0.9994  0.9988  1.0000  0.9994  0.9983  0.0000  1.0000  0.0000  95.7 
Transitive  0.9949  0.9910  0.9989  0.9949  0.9866  0.0030  0.9990  0.0059  2.1 
Ratio  Model  Accuracy  Precision  Recall  F1Score 

75:25  DT  0.9985  0.9970  1.0000  0.9985 
RFT  0.9981  0.9963  1.0000  0.9981  
GBDT  0.9788  0.9593  1.0000  0.9792  
ABT  0.8414  0.8636  0.8098  0.8359  
SVM  0.9940  0.9881  1.0000  0.9940  
MLP  0.9989  0.9978  1.0000  0.9989  
25:75  DT  0.9966  0.9935  0.9997  0.9966 
RFT  0.9963  0.9927  1.0000  0.9963  
GBDT  0.9780  0.9578  1.0000  0.9784  
ABT  0.8394  0.8613  0.8085  0.8341  
SVM  0.9901  0.9806  1.0000  0.9902  
MLP  0.9977  0.9954  1.0000  0.9977  
1:99  DT  0.9692  0.9511  0.9893  0.9698 
RFT  0.9821  0.9699  0.9950  0.9823  
GBDT  0.9734  0.9495  1.0000  0.9741  
ABT  0.8352  0.8409  0.8268  0.8338  
SVM  0.9635  0.9320  1.0000  0.9648  
MLP  0.9842  0.9704  0.9988  0.9844 
Property  Accuracy  Precision  Recall  F1  Accuracy  Precision  Recall  F1  Time[s] 
Test  Test  Test  Test  
Antisymmetric  0.9997  0.9996  0.9998  0.9997  0.9996  0.9935  0.9998  0.9967  2.1 
Bijective  0.9991  0.9982  1.0000  0.9991  0.9981  0.0000  1.0000  0.0000  225.8 
Connex  0.9957  0.9933  0.9982  0.9957  0.9935  0.2258  0.9985  0.3683  0.8 
Equivalence  0.9997  0.9994  1.0000  0.9997  0.9995  0.0000  1.0000  0.0000  34.1 
Function  0.9946  0.9900  0.9993  0.9946  0.9899  0.0001  0.9993  0.0001  2.4 
Functional  0.9968  0.9940  0.9997  0.9969  0.9945  0.0003  0.9997  0.0006  2.6 
Injective  0.9968  0.9940  0.9997  0.9969  0.9877  0.0001  0.9989  0.0001  2.3 
Irreflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.5 
NonStrictOrder  0.9990  0.9985  0.9994  0.9990  0.9983  0.0011  0.9995  0.0022  1.9 
PartialOrder  0.9934  0.9879  0.9991  0.9935  0.9864  0.2407  0.9992  0.3879  1.2 
PreOrder  0.9985  0.9974  0.9996  0.9985  0.9972  0.0012  0.9997  0.0024  2.0 
Reflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.5 
StrictOrder  0.9988  0.9979  0.9997  0.9988  0.9979  0.0009  0.9998  0.0019  1.9 
Surjective  0.9988  0.9979  0.9997  0.9988  0.9984  0.0000  1.0000  0.0000  283.3 
TotalOrder  0.9999  0.9997  1.0000  0.9999  0.9997  0.0000  1.0000  0.0000  8.4 
Transitive  0.9999  0.9997  1.0000  0.9999  0.9760  0.1588  0.9902  0.2737  2.0 
Table 2 summarizes performance of ML classifiers for one selected property (PartialOrder). The dataset is generated using default Alloy symmetry breaking. The dataset is further split into training and test datasets using different ratios (:, : and :) to measure the performance change when using different splits of the datasets. All models exhibit high accuracy as well as F1 score, where the accuracy is in range , the precision is , the recall is , and the F1 score is . Overall, all models achieve good performance, and surprisingly even for a small training:test ratio () models achieve good performance. We have also performed similar experiments for all other relational properties (shown in Supplementary Materials), and results are similar as for the PartialOrder – accuracy is in range , and the F1 score is , where the lowest accuracy is for AntiSymmetric property. It is surprising to note that on most of the properties, all models report accuracy even when trained on only 1% of dataset. The results achieved demonstrate that relational properties are learnable, and that even simple ML models are effective in learning them.
5.1.2. RQ2: How well do ML (Decision Tree) Models Generalize (Outside of the Test Set)?
Property  Accuracy  Precision  Recall  F1  Accuracy  Precision  Recall  F1  Time[s] 
Test  Test  Test  Test  
Antisymmetric  0.9913  0.9871  0.9957  0.9914  0.9442  0.5098  0.2241  0.3114  1.3 
Bijective  0.9971  0.9942  1.0000  0.9971           
Connex  0.9996  0.9993  1.0000  0.9996  0.9992  0.0520  0.1731  0.0800  1.9 
Equivalence  0.9958  0.9920  0.9997  0.9958           
Function  0.9954  0.9928  0.9980  0.9954  0.9922  0.0000  0.0667  0.0000  86.6 
Functional  0.9991  0.9983  0.9998  0.9991  0.9980  0.0000  0.2907  0.0000  277.5 
Injective  0.9979  0.9961  0.9998  0.9980  0.9961  0.0000  0.2565  0.0000  79.6 
Irreflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.5 
NonStrictOrder  0.9994  0.9989  1.0000  0.9994  0.9990  0.0000  0.6263  0.0000  8.9 
PartialOrder  0.9963  0.9936  0.9990  0.9963  0.9937  0.0068  0.3435  0.0134  2.9 
PreOrder  0.9992  0.9985  0.9999  0.9992  0.9987  0.0000  0.5180  0.0000  16.6 
Reflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.6 
StrictOrder  0.9991  0.9982  1.0000  0.9991  0.9983  0.0000  0.4660  0.0000  8.3 
Surjective  0.9980  0.9961  1.0000  0.9980           
TotalOrder  0.9994  0.9988  1.0000  0.9994  0.9990  0.0000  0.4737  0.0000  3059.5 
Transitive  0.9949  0.9910  0.9989  0.9949  0.9914  0.0038  0.2394  0.0074  3.6 
In RQ1 we show that ML models can exhibit high accuracy when learning relational properties. Now, we move to the question of how well do those models generalize outside of the test set. Our framework allows us to answer that question for decision tree models, which is focus of this section. Specifically, we answer the question of how well the previously trained decision tree models perform in respect to the whole input space. Table 3 compares performance of the (same) decision trees models when evaluated: (1) on the test set, and (2) in respect to the whole input space. Decision trees are trained using % of the constructed dataset (used in the previous section), while the test set represents the remaining part of the dataset. Time shown is the total time taken by MCML for computing all performance metrics, where MCML leverages the model counting techniques to predict the number of true positives and negatives, and false positives and negatives. To evaluate in respect to the whole input space we use the ground truth formula constrained with symmetry breaking conditions ^{1}^{1}1Symmetry breaking conditions are added so as to make distributions of examples similar to the ones present in the training set. We later show evaluation where we remove this constraint..
When decision trees are evaluated on the test set, accuracy, recall and F1 score is while precision is . However, when decision trees are evaluated on ground truth, minimum accuracy decreases to , minimum precision and F1 decreases to and minimum recall remains at . Precision and F1 score is spread all over the range 0 to 1, and is generally low. Infact, out of 15 properties that did not timeout, precision is around 0 on 12 properties. This is because of high false positive rate showing that the decision trees are classifying many negative instances as positives. This indicates that the models are biased towards classifying an example as positive, likely learning patterns present in training dataset but not generalizable—applying the models outside of the dataset will likely incur many false positives. For the properties Reflexive and Irreflexive, the models continue to have perfect performance since establishing these properties requires only checking the diagonal. To summarize the findings, while the results in previous section showed encouraging results of learnability of relational properties, evaluation in respect to the whole input space shows concerning issues (with false positives) if models are to be used in the wild.
In summary, MCML’s ability to quantify w.r.t. the entire state space is of unique value as it quantifies model’s generalizability, and avoids a false sense of confidence in traditional ML metrics. In addition, MCML is also time efficient as it reported results for 12 properties within 10 seconds and 3 properties within 220 seconds. It timedout on only one property (Equivalence), where the state space for possible solutions is around .
5.1.3. RQ3: How do Symmetries in the Dataset Affect Performance of the ML Models?
We now move on to study the effect of symmetries in the dataset on ML models performance. We perform experiments similar as in the RQ1 and RQ2 now shown in Table 4 and Table 5, but without performing symmetry breaking on training and evaluation sets.
Table 4 summarizes performance of ML classifiers for one selected property (PartialOrder). The dataset is generated without performing symmetry breaking. Across all ratios, the accuracy is , the precision is , the recall is , and the F1score is . We have performed experiments for other properties (shown in Supplementary Material), and noticed similar trends. Across all 16 properties, all 5 ratios, and all 6 ML models, the accuracy is , the precision is , the recall is , and the F1score is . Similar to before, all models achieve good performance, and good performance is preserved even for small training:test ratio. However, in comparison to the previous results (Table 2) there is a noticeable decrease in terms of accuracy and F1 score, introduced by symmetries. This shows that if symmetries are broken (in both train and test set) than model can better learn the properties, as it only uses the representatives of distinct groups for learning. This is analogous to training a digit classifier where all digits are upright. Introducing symmetries, e.g., digits at different orientations makes it harder for classifier to learn well.
Table 5 compares results of the decision tree models on the test set and in respect to the whole input space. Unlike in RQ2, now the training set is generated without symmetry breaking, and the whole input space encoded by the formula is now not constrained with symmetry breaking. On the test set accuracy, recall and F1 score is at and precision is at . However, in respect to the whole input space, minimum accuracy is , minimum precision and F1 score decreases to , but minimum recall remains at
. The outliers are properties Reflexive and Irreflexive, where the models continue to have perfect performance. In summary, the results shows a similar trend as before, where even with enhancing the training set with symmetric examples, decision tree models still generalize poorly.
5.1.4. RQ4: How does Mismatch in Symmetries Present in the Training and Evaluation Sets Affect the Performance?
Property  Accuracy  Precision  Recall  F1  Accuracy  Precision  Recall  F1  Time[s] 
Test  Test  Test  Test  
Antisymmetric  0.9997  0.9996  0.9998  0.9997  1.0000  1.0000  1.0000  1.0000  0.7 
Bijective  0.9991  0.9982  1.0000  0.9991  0.9992  0.0000  1.0000  0.0000  10.1 
Connex  0.9957  0.9933  0.9982  0.9957  0.9959  0.1510  1.0000  0.2624  0.6 
Equivalence  0.9997  0.9994  1.0000  0.9997  0.9991  0.0000  1.0000  0.0000  15.7 
Function  0.9946  0.9900  0.9993  0.9946  0.9922  0.0000  0.9912  0.0001  1.6 
Functional  0.9968  0.9940  0.9997  0.9969  0.9964  0.0003  1.0000  0.0005  1.5 
Injective  0.9968  0.9940  0.9997  0.9969  0.9914  0.0002  1.0000  0.0003  1.5 
Irreflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.4 
NonStrictOrder  0.9990  0.9985  0.9994  0.9990  0.9972  0.0016  1.0000  0.0032  1.5 
PartialPrder  0.9934  0.9879  0.9991  0.9935  0.9862  0.2784  1.0000  0.4356  1.0 
PreOrder  0.9985  0.9974  0.9996  0.9985  0.9971  0.0028  0.9997  0.0056  1.6 
Reflexive  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  0.5 
StrictOrder  0.9988  0.9979  0.9997  0.9988  0.9977  0.0021  1.0000  0.0041  1.5 
Surjective  0.9988  0.9979  0.9997  0.9988  0.9968  0.0000  1.0000  0.0000  10.9 
TotalOrder  0.9999  0.9997  1.0000  0.9999  0.9996  0.0000  1.0000  0.0000  9.4 
Transitive  0.9999  0.9997  1.0000  0.9999  0.9906  0.1673  0.9912  0.2863  0.8 
Subject  TT  TF  FT  FF  Diff  Time [s] 

Antisymmetric  7.86E+05  3.28E+04  3.17E+04  3.36E+07  0.19  1.1 
Bijective  9.19E+54  5.88E+56  3.86E+56  1.00E+59  0.96  117.8 
Connex  3.62E+07  1.15E+07  1.21E+07  6.87E+10  0.03  1.7 
Equivalence  7.88E+116  1.95E+118  3.91E+118  2.54E+120  2.25  852.8 
Function  6.31E+16  8.11E+16  8.11E+16  1.84E+19  0.87  8.1 
Functional  2.03E+16  1.58E+16  1.58E+16  1.84E+19  0.17  7.4 
Injective  7.21E+16  0  0  1.84E+19  0.00  4.2 
Irreflexive  1.05E+06  0  0  3.25E+07  0.00  0.5 
NonStrictOrder  2.92E+11  2.92E+11  2.92E+11  5.63E+14  0.10  4 
PartialOrder  3.02E+08  1.07E+08  1.17E+08  6.87E+10  0.32  2.4 
PreOrder  3.78E+11  3.78E+11  3.78E+11  5.63E+14  0.13  3.8 
Reflexive  1.05E+06  0  0  3.25E+07  0.00  0.5 
StrictOrder  6.18E+11  3.44E+11  3.44E+11  5.63E+14  0.12  3.8 
Surjective  2.45E+55  3.68E+56  3.68E+56  1.00E+59  0.73  119.2 
TotalOrder  3.65E+47  3.65E+47  3.65E+47  7.48E+50  0.10  79.5 
Transitive  4.36E+08  2.01E+08  1.93E+08  6.87E+10  0.57  2.4 
We now move to the next question of how the mismatch in presence of symmetries in training and evaluation sets affect the performance. Specifically, we look at the two scenarios: (1) symmetries are not present in the training set but are in the evaluation set, and (2) symmetries are present in the training set but not in the evaluation.
Table 6 shows results of decision tree models trained and tested on datasets with symmetry breaking, but evaluated on the whole input space (without symmetry breaking). When decision trees are evaluated on the test set accuracy, precision and F1 score is at , while precision is at . However, when decision trees are evaluated on ground truth (without symmetry breaking constraints), minimum accuracy decrease to , minimum precision and F1 decreases to , while minimum recall decrease to . Since symmetries are present in the whole input space, the performance decreases compared to the case where symmetries are also present in the training set (Table 5) These experiments show that decision trees perform worst when they are trained on the dataset without symmetries and evaluated w.r.t entire state space (with no symmetry breaking). This is expected result since the trained model did not see any symmetrical instances in the training dataset and is likely to perform incorrectly when tested on permutations of same instances. This is similar to an example of neural networks failing to generalize on digit recognition task, when seeing digits at different orientation than observed during the training (nnLearnInvariants).
Table 7 shows results of decision tree models trained and tested on datasets without symmetry breaking, and evaluated on the whole input space with added symmetry breaking constrains. When decision trees are evaluated on the test set, accuracy, recall and F1 score is while precision is . However, when decision trees are evaluated on ground truth (with added symmetry breaking), minimum accuracy remains at , minimum precision and F1 decreases to while minimum recall remains at . This results show that even when training set is richer (contains symmetries) than evaluation set, ML models still fail to generalize in respect to the whole input space.
5.1.5. RQ5: What is the Quantitative Difference Between Two Decision Tree Models?
We next employ MCML to quantify the difference between the two decision tree models. Our framework allows us to get rigorous measure of the differences between the two models, i.e., the measure in respect to the whole input space (not just the train/test datasets). We trained two decision tree models for each property, using different values of hyperparameters, and measured difference between the two.
Table 8 shows the results for quantifying differences between 2 trained models, where we show number of examples in which both models predict true (TT), in which the first predicts true and second false (TF), and the other combinations denoted by FT and FF. The Diff column shows the percentage of cases in which the two models make a different prediction. In all cases, the difference is close to . MCML is able to quantify the differences between the two models in respect to the whole input space, which makes it a powerful technique for evaluation of models. For example, one can take a smaller (compressed) model and rigorously quantify it in respect to the larger model to see if it can be used as a replacement. MCML is able to detect this for 12 properties within 10 seconds and is able to detect all properties within 1000 seconds.
5.2. Discussion
Tables 2 and 4 evaluate offtheshelf models using traditional ML metrics based on training and test datasets. These metrics give a false sense of confidence, which our proposed MCML framework addresses. Tables 3, 5, 6 and 7 show that the majority of precision scores and F1 scores are low w.r.t the whole state space. An important characteristic of the properties we consider is that the number of positive cases is far too small compared to the number of negative cases. We believe new classifiers are needed to handle complex relational properties.
Alloy formulas are intuitively simple but semantically quite complex, and getting high precision with respect to the entire state space is hard. In fact, in only 2 cases (reflexive and irreflexive) the precision is 1, and that is because the properties can be checked simply by inspecting the diagonal entries – the trained decision trees indeed do so. In all other cases, the precision is very low (¡0.1). MCML provides a new tool to rigorously study properties that seem “easy” to learn in the traditional setting of training and test data, but are actually “hard” to learn when viewed in the context of the full state space. For example, for partial order, precision was 0.9936 in traditional setting whereas MCML reported that the precision is 0.0059 for the whole state space. This indicates that the model is biased towards classifying an example as a partial order, likely learning patterns present in training dataset but not generalizable – when using data outside of dataset one can anticipate many false positives. MCML’s ability to quantify w.r.t. the entire state space is of unique value as it quantifies model’s generalizability, and avoids a false sense of confidence in traditional ML metrics.
MCML uses a very efficient encoding of decision trees directly into CNF that is linear in the size of the tree and does not require any auxiliary variables. The resulting formulas are solved readily by offtheshelf model counters.
The primary limitation of MCML with respect to ground truth is the requirement for ; we expect property synthesis methods can help here, and even when is not available, an approximation of that is human understandable can be of much use.
6. Conclusion
This paper introduced the MCML approach for empirically studying the learnability of relational properties that can be expressed in the software design language Alloy. A key novelty of MCML is quantification of the performance of and semantic differences among trained machine learning (ML) models, specifically decision trees, with respect to entire input spaces (up to a bound on the input size), and not just for given training and test datasets (as is the common practice). MCML reduces the quantification problems to the classic complexity theory problem of model counting, and employs stateoftheart approximate and exact model counters for high efficiency. The results show that relatively simple ML models can achieve surprisingly high performance (accuracy and F1 score) at learning relational properties when evaluated in the common setting of using training and test datasets – even when the training dataset is much smaller than the test dataset – indicating the seeming simplicity of learning these properties. However, the use of MCML metrics based on model counting shows that the performance can degrade substantially when tested against the whole (bounded) input space, indicating the high complexity of precisely learning these properties, and the usefulness of model counting in quantifying the true accuracy.
MCML offers exciting new directions for leveraging model counting in machine learning, e.g., 1) it provides quantitative answers to key questions like ”did we train enough?”, ”how much did we overfit?”, and ”is this model basically the same as this other model (I have in mind)?”; and 2) it allows informed decision making for MLbased systems, e.g., ”should a deployed model be replaced with another (newer) model?”.
Comments
There are no comments yet.