Skip to main content
SearchLoginLogin or Signup

How Interpretable Is “Interpretable” Machine Learning?

The modern machine learning (ML) revolution has led to ML systems deployed at scale for the first time, impacting society beyond computer science experts. It is critical that these systems are integrated appropriately with minimal negative side effects.

Published onFeb 27, 2024
How Interpretable Is “Interpretable” Machine Learning?


The modern machine learning (ML) revolution has led to ML systems deployed at scale for the first time, substantially impacting society beyond computer science experts. It is critical that these systems are integrated appropriately with minimal negative side effects, and an important component of this goal is identifying whether these tools are interpretable by humans. Formal methods is a computer science discipline that may address this problem via mathematically grounded techniques for specifying and verifying the desired operation of software and hardware systems. In the ML context, ML systems may be designed to output a set of formal specifications that describe precisely what they learned, which can then be translated into a set of natural language rules. A widespread belief in the formal methods community is that the use of their approaches in machine learning lends interpretability to systems, making their reasoning understandable by humans. We evaluate the claim that formal specifications are interpretable by humans via an experiment with sixty-two participants. We describe the format of the study, provide statistical analysis of the results, and recommend modifications to researchers’ claims and ergonomic improvements to data presentation and validation training.

🎧 Listen to an audio version of this case study.

Keywords: machine learning, model interpretability, human experiments, formal methods, robot behaviors, temporal logic

Ho Chit (Hosea) Siu
MIT Lincoln Laboratory

Kevin Leahy
MIT Lincoln Laboratory; now at Worcester Polytechnic Institute

Makai Mann
MIT Lincoln Laboratory; now at Anduril Industries

Author Disclosure: DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001, and is based on material presented in Ho Chit Siu, Kevin J. Leahy, and Makai Mann, “STL: Surprisingly tricky logic (for system validation),” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023): 8613-8620, Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

Learning Objectives

  • Understand the importance of interpretability in the validation of machine learning models

  • Identify recent claims of model interpretability in the formal methods space

  • Describe an experiment testing humans’ ability to interpret formal specifications

  • Understand the implications of interpretability claims that lack an empirical basis


Modern machine learning (ML) has made significant progress on a number of tasks previously thought to be solely in the realm of human reasoning, from diverse types of image recognition and segmentation,1 to complex games ranging from ancient (Go)2 to modern (StarCraft II),3 to driving4 and flying.5 Many of the artificial intelligence “agents” performing these tasks are now able to do so at a superhuman level, surpassing the performance of even human experts. In nearly every recent case that has expanded the state of the art for these tasks, the underlying architecture is some form of deep neural network (DNN), that operates on anywhere from thousands to billions of parameters, all trained to collectively complete a given task. For an interactive example of a deep neural network, see Appendix 1. These and other types of common machine learning models suffer from a lack of interpretability or transparency about how they make their decisions.

Our human ability to understand the reasons behind a decision is often a crucial component for deciding whether we trust a recommendation, determine criminal guilt, or decide whether a decision generalizes to other situations.6 Consider, for example, a model that determines that a bone fracture has occurred based on a dark line through an x-ray image of a bone (a likely relevant feature on a relevant modality) versus one that makes a determination based on a set of pixels on the same image that lie entirely outside of any part of the body. Or, consider the difference between burglary and criminal trespassing: both involve unlawful entry into a property, but burglary has the reason of committing a further crime inside the property as a motivation, and carries a heavier punishment.

The x-ray example is also a case in which model interpretability provides an avenue for humans to debug a model, since we know that the true indicator of a fracture would not be something found outside the body on the x-ray image. Beyond the general idea of model trust, interpretability is also helpful in understanding biases in ML models. Amazon, for example, discontinued a machine learning–based job applicant screening program after it was found to be biased against women, since it was trained on the male-heavy historical résumé pool that the company had access to.7 While interpretability itself does not alleviate such biases, being able to expose the models’ decision-making processes can help catch such issues.

Modern machine learning does not generally provide the underlying reasoning for its decision-making. Our inability to probe these models can (and has) led to severe consequences, “cases of people incorrectly denied parole, poor bail decisions leading to the release of dangerous criminals, ML-based pollution models stating that highly polluted air was safe to breath and generally poor use of limited valuable resources in criminal justice, medicine, energy reliability, finance, and other domains.”8 Various approaches have been used to provide such information, including text-based explanations, visualizations, highlighting properties that contribute to a decision (local explanations), and providing examples. Specifically for sequential or temporal decision-making processes, some approaches have considered interpretability methods such as providing weights on how much certain human-defined “prototypical behaviors” were considered at a given time9 or decomposing tasks into subtasks with explicit transitions.10

One area of computer science that has sought to tackle this problem is formal methods, a set of techniques that uses mathematical approaches to specify, analyze, and verify software and hardware systems. Formal specification is particularly interesting because it allows humans to write down a set of mathematically precise rules that a system must follow—rules that can always be turned directly into a natural language description of the system’s constraints.

Imagine a notional ML system that is designed to learn the properties of good highway driving by training on many examples of it in order to inform the behavior of an autonomous car. Suppose that the properties of each “instance” of driving are provided as a ten-minute time series of data, each example is labeled as an example of good or bad driving, and the goal is to determine if a new driving example was good or bad. The time series representing “good driving” might contain data that could be broadly described by such statements as “the driver kept within ten miles per hour of the speed limit, the driver kept a distance of more than fifty feet from the car in front,” and so on. In “black box” ML systems, like neural networks, the system goes directly from the time series to a classification of good or bad driving, without exposing the reasoning behind it. In contrast, a formal methods approach to ML would have the system first learn a set of logical descriptors of good and bad driving, much like the statements above, and then those descriptors are used collectively to label good or bad driving. The intermediate step of explicitly creating these “rules” with human-readable concepts makes it possible to examine what is driving the decision-making of the latter type of system, so in theory, humans could go in and check whether the system learned things appropriately or not.

The example above is an example of “temporal logic,” which is a subset of formal methods that describes events that evolve over time. Temporal logic is a popular choice for specifying robot behavior when any kind of formal specification is involved. Signal temporal logic (STL) is a particular type of temporal logic that uses statements with components that describe

  1. an equation that can be evaluated to true or false,

  2. a timeframe, and

  3. how the equation should hold, for example, for the entire duration of the timeframe (“always” or “globally”) or at least once during that timeframe (“finally” or “eventually”).

Returning to our car example, an STL statement might be something like “G[0,600](s<slim+10)G_{\lbrack 0,600\rbrack}(s < s_{\lim} + 10),” which would mean that in order to satisfy the statement, the signal ss (the speed of the car) must be less than slims_{\lim} (the speed limit) plus 10 (miles per hour) for the entire duration of time from 0 to 600 seconds (the entire ten-minute sample).

These statements can then be combined into more complex specifications with Boolean operators, such as AND (\land), OR11 (\vee), and NOT (¬\neg). We might specify that a system must obey the constraints G[0,600](s<slim+10)G[0,600](s>slim10)¬G[0,600](dfront50)G_{\lbrack 0,600\rbrack}\left( s < s_{\lim} + 10 \right) \land G_{\lbrack 0,600\rbrack}\left( s > s_{\lim} - 10 \right) \land \neg G_{\lbrack 0,600\rbrack}(d_{front} \leq 50). This formula can be translated into

  • the speed of the car must always be less than the speed limit plus 10 mph from time 0 to 600 seconds and

  • the speed of the car must always be greater than the speed limit minus 10 mph from time 0 to 600 seconds and

  • the distance between the car and the next closest car in front must not be less than or equal to 50 feet from 0 to 600 seconds.

In this case, the rules are all “global” or “always true” in the duration, and the duration is always the entire length of the sample, but STL can also refer to parts of the timeframe or properties that only need to occur once within a timeframe, among other types of properties.

The direct mapping from formula to language description is something that is not as easily done with other types of “black box” machine learning. The example above shows how signal temporal logic provides a type of text-based explanation of the model, a set of local explanations of relevant features under consideration, and even a chain of reasoning for the precise collection of features it considers when generating a trajectory (or deciding whether a trajectory met the specification). You could also imagine how such a model could use its constraints to generate a set of labeled examples—ones that it could say met or did not meet its requirements—as further evidence of its “reasoning.”

However, as you might be able to tell from the somewhat convoluted text explanation above (even for a fairly short set of rules), just because the specifications are machine-interpretable, does not necessarily mean that they are easily interpretable by humans! Imagine, for example, what may happen once we start considering issues such as lane changes, getting on or off the highway, traffic jams, and responses to police cars. You might guess at how properties such as how long the formula is, how “nested” it is (how many sets of parentheses are involved, and how deep they go inside each other), and even whether or not you can keep track of which numbers relate to which variables can strongly affect how easy it is for a human to understand.

Methods to explain temporal properties are a newer area of research compared to such methods for static properties (e.g., “Ford Model T cars are always black” as a distinguishing feature when classifying car models). In fact, interpretable machine learning methods for temporal or sequential properties is considered one of the grand challenges of explainable machine learning.12

Unfortunately, although claims that formal specifications are human-interpretable are fairly common in the formal methods literature, we found that they were poorly supported. We performed a literature search in the Web of Science, focused on temporal logic, using a query of (interpretable OR explainable OR transparency) “temporal logic,” which yielded thirty-eight results for the 2012 to 2022 timeframe. Of the thirty-eight results, twenty-five contained claims of formal specifications being interpretable by humans (see Appendix 2).

Of these twenty-five papers, only three cited any sources to support the notion that their methods would improve humans’ ability to interpret the models, whereas the other twenty-two did not. Even the three that did cite sources did not actually incorporate the methods they cited to improve human interpretability in their actual systems and experiments. Crucially, none of the twenty-five papers we found tested their methods with humans to support their claims of interpretability.

Although our search is certainly not exhaustive, it was designed to be easy to replicate, and we believe it represents a serious indictment of the state of interpretability claims in formal methods. The results (and methods) of this search mirror a broader issue in “explainable AI” illustrated by Miller,13 where the vast majority of studies claiming to build explainable AI neither support their methods with citations, nor perform any experiments to check that their methods are, in fact, explainable to any human.

We decided to put formal specifications in front of human interpreters and see how well they understood them. In this case study, we were particularly interested in a particular application: whether humans are able to use formal specifications to perform system validation on a mobile robot—that is, to check whether a robot will always do what the user wants it to. To validate the system from its specifications, a user must understand what the specifications mean for the behavior of the robot, and critically, must do so for any possible state of the robot across the entire specification. This task is harder than if a person were to be asked to just predict what a robot might do next, or point to the part of a specification that would drive certain behaviors, both of which are more “local” checks that one might do with a system. However, validating the entire system is essential to ensuring that it will do what the end user wants.

Validating Robot Specifications

We performed a human experiment in which we asked subjects to determine whether or not a given temporal logic formula would always guarantee a “robot” will win a game of capture the flag (CTF) from a given starting configuration. Subjects were given an introduction to our grid-based CTF game (Figure 1), where the goal was to have the blue team robot (the blue circle) go from its starting location to the flag location, and then return to any blue cell, while avoiding being within one square of the red team (any red diamond). In all cases, the red team was stationary, and the blue robot could only move one square per time step in the up/down/left/right directions.

The instructions gave an example for each of the three formats in which the temporal logic would be shown, and a set of definitions for all the symbols they would encounter. After the introduction, each subject saw thirty pairs of starting grids and associated formulas, where the latter was presented randomly in one of the three formats. For our presentation formats, we used 1) “raw” formulas in STL, 2) a format that used lines of English and mathematical equalities and inequalities, and 3) a format that used a decision tree representation. The English representation was used since formal specifications’ direct ties to language-based representations was one of the main reasons people thought it led to interpretability, and the decision tree representation was used because the machine learning community often assumes that models expressed as decision trees are somehow intrinsically interpretable, again, generally without evidence.14

Specifications represented constraints on the movement of the robot, and were approximately a paragraph of text at most, which is much less than what would typically be required to specify robot behaviors in real life.

Figure 1

Example validation workflow. For each validation, participants are presented with an initial game configuration and any additional agent behaviors, for example, red not moving (left), followed by a specification in one of three formats (center), and then two questions about validity and confidence (right).

For each pair of grids and specifications, we asked subjects to tell us whether a specification was 1) valid, 2) invalid because no plan could be generated with it, or 3) invalid because it is possible for a game-losing plan to be generated. To be valid, a specification must only allow the robot to move in a way such that it would reach the flag and return to its home (blue) region within a certain amount of time, without ever touching any red players, which are on some maps, but never move. In addition to telling us about the validity of the plan, subjects were also asked to rate their level of confidence in their answer from 1 (low) to 5 (high).

Sixty-two people completed our experiment, and included mostly members of the public without previous formal methods experience, along with thirteen people who had experience in formal methods, generally from graduate-level work in the field. This pool was chosen because end users who would likely be checking if their systems were programmed the way they wanted are unlikely to have formal methods experience, but we also wanted to see how well formal methods experts did in this task.

Humans Are Bad at Validating Formal Specifications

The results were quite poor. Validation accuracy was 13.6 ± 6.1 out of thirty possible points, or 45% ± 20%. Familiarity with formal methods helped a little, but there was still high variability in our expert pool. Participants were only correct about whether the robot’s rules were what they wanted about half the time. See Figure 2.

Figure 2

Self-reported familiarity with formal methods vs validation score. Horizontal jitter was added to visually separate points.

The format and the complexity of the specification did not appear to matter (i.e., were not statistically significant).

If we pause for a moment to consider what level of validation correctness we might expect or want in different domains, the impact of these results become even clearer. How accurate do you need a person to be when deciding whether to use the new software for a surgical robot? A plane’s autopilot? A ship’s safety system? Ninety percent? Ninety-five? Or 99.999 percent? Likely not 45 percent. Consider again that these real-world systems generally follow sets of rules that are far more complicated than what we presented to people here.

But there is more that we can glean from the data. Digging a little deeper, we found that validation accuracy differed significantly for specifications that were actually valid vs. ones that were not. Valid specifications were much more likely to be correctly marked valid than invalid, but specifications that were invalid due to possibly generating blue losses were also much more likely to be incorrectly marked valid than invalid. These trends show that subjects tended to just mark specifications as valid, regardless of whether they were. However, specifications that are mostly correct but allow for some unwanted behaviors are precisely the kind of specification that you would want human checkers to look out for—and those are the cases where humans do poorly.

In Figure 3, the horizontal bars on top of the box and whisker plots represent the results of statistical tests. A p-value of less than .05 is generally considered “statistically significant,” meaning a difference was detected between the two datasets that make up the boxes under the bar. The d values are effect sizes and tell us how large the difference is. More specifically, d is the ratio between the difference of the means of two groups against the standard deviation of all the data in both groups. Effect sizes with an absolute value greater than 0.8 are generally considered “large.”

Figure 3

Correct and incorrect validations by ground truth validity. There were an equal number of valid and invalid specifications, but the invalid specifications were split unevenly between no-plan and blue-loss cases.

Furthermore, when we split people into groups based on their familiarity with formal methods, we found that there was a small, but detectable difference in self-rated confidence when our unfamiliar group was correct vs. when they were incorrect (being slightly more confident when they were actually correct). But crucially, our expert group showed no such difference—they were always just very confident in their answers, with the median rating almost at the maximum value on the scale. See Figure 4 for the comparison.

Figure 4

Participant confidence in their answer when their answer was actually correct vs. incorrect, split by formal methods familiarity. Forty-nine participants had a familiarity rating < 4 (out of 5), while thirteen had a familiarity ≥ 4.

Implications for Formal Methods and Machine Learning

What we have essentially found is that there is a subdiscipline of computer science that has been enthusiastically applying their methods to the development of machine learning models under a largely unsupported assumption that their models enabled better human interpretability. Very few people cite any sources on human interpretability, and no one (at least in our search parameters) tested their claims with real humans. Our experiment showed that for the case of robot behavior validation, humans are quite bad at interpreting formal specifications to a meaningful degree.

“Interpretability” is a difficult goal that is frankly poorly defined in the machine learning space.15 It is likely true that being able to tie outputs to meaningful rules that can be expressed in natural language is better than the more “black box” methods out there, such as systems using mainly or entirely deep neural networks. Maybe it is tempting to develop methods that output a set of rules in English and call that sufficiently interpretable. But clearly, being able to state even a very small set of rules in English (or in math, or in a decision tree) is not enough to make it meaningfully interpretable to humans. And while interpretability may be difficult to pin down, it still does not excuse the large proportion of work that we found in formal methods, and that Miller16 found in “explainable AI” at large that claim interpretability, but do not support their claims. These claims have historically gotten ahead of the evidence in much of the machine learning world.

It is standard practice in machine learning to present performance metrics for new models and algorithms, typically with some values of model accuracy. The field has developed norms about what evidence is required to evaluate such performance, which vary slightly depending on the task being assessed. For example, a paper might call a new model “highly accurate” and use root mean squared error as a measure of accuracy, presenting some values that are then compared against other models.

However, it is still very common in machine learning literature to call an approach “interpretable” or "explainable” or “transparent” without defining what is meant by those terms and without evidence that the definitions are met. Nonetheless, when models are presented (or sold) to end users—everyone from hospitals to factories to militaries—these properties are typically touted as important differentiators for why one model is better than another.

But at a high level, how does the interpretability of formal specifications compare to that of other popular methods, such as deep neural nets? This study does not change the fact that formal specifications can be directly and meaningfully translated into natural language, which is much more than can be said for neural nets. Being able to tie specific pieces of a model directly to certain decisions that it makes is valuable.17 Nonetheless, it should be clear that if we hope to make any of these outputs useful for validation, much more attention needs to be paid to elements such as user interfaces, user experiences, human learning and training, and perhaps even our whole approach to “examining” models. There may be lessons to be learned from this kind of design work that can go beyond applications in formal specifications to broader use in how we as humans check both software we write ourselves and ML systems that learn models on their own.

This work contributes one example (of many) to an argument that Nancy

Leveson has been making for several decades, that checking whether or not systems meet their stated requirements is not sufficient to show that they are, in fact, safe.18 Rather, issues stemming from the engineering of complex systems often come about because the stated requirements were unsafe or otherwise insufficient, and checking that they are enough (validation) is not an easy thing to accomplish. Systems engineering approaches to safety have been developed to tackle this kind of challenge,19 but such methods do not appear to have meaningfully made their way to the AI/ML community. Clearly, it takes more than just the ability to translate specifications into natural language for systems to be easily checked. If future work can develop better ways to help people interpret specifications, perhaps we can mitigate some of the risk of the systems being misspecified.

Nancy Leveson’s Introduction to STAMP (Parts 1 and 2) from the 2020 MIT STAMP Workshop. For more resources related to STAMP (System-Theoretic Accident Model and Processes), please visit

Discussion Questions

  1. How can people performing research in interpretable machine learning be more responsible with their methods and claims?

  2. What do you think should be the responsibility of the various parties that might be interested in interpretable machine learning (e.g., researchers, product designers, users, managers, funders, and so on)?

  3. There appear to be many unsubstantiated claims about interpretability in the machine learning space. What backgrounds, biases, and motivations might exist among those conducting this type of research that make this issue a blind spot? What about the people reading and evaluating the research?

  4. Given how poorly even people with formal methods expertise were doing in our validation study, should we give up on formal methods as a path to interpretability? Is there anything worth trying along these lines on either the human side or the machine side?

  5. How does the setting for this experiment differ from what might be more typically seen in real-world robotics? How does “capture the flag” differ from a more real-world robot? How do our human subjects and their backgrounds compare to people who might actually be checking robots? Would certain biases or expectations exist in one setting or another?

  6. Are there any groups of people or types of training that might fit better for this kind of system validation? What type of training might a “robot validator” need to go through to check a robot that uses formal specifications? What about a robot that uses more black box machine learning methods like neural nets? Are there other professions that currently train for these types of “validation skills”?

  7. The notional capture-the-flag game described here is fairly inconsequential, but what are some other robotics use cases where we really need system validation? What are some cases where it may not matter as much? Are there things we can do other than validating the entire system if we are interested in understanding its properties or deciding whether or not to trust it?

Appendix 1

Supplemental Material: An interactive example of a deep neural network (DNN), available via

Appendix 2

Note: the above card is a downloadable xlsx file with information on the literature search from Web of Science for relevant papers on “temporal logic.”


Dastin, Jeffrey. “Insight - Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women.” Reuters, October 10, 2018.

Grigorescu, Sorin, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. "A Survey of Deep Learning Techniques for Autonomous Driving." Journal of Field Robotics 37, no. 3 (2020): 362–86.

Kenny, Eoin M., Mycal Tucker, and Julie Shah. "Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes." In The Eleventh International Conference on Learning Representations. 2022. Available at

Leveson, Nancy. "Are You Sure Your Software Will Not Kill Anyone?" Communications of the ACM 63, no. 2 (2020): 25–28.

Leveson, Nancy. Engineering a Safer World: Systems Thinking Applied to Safety. Cambridge, MA: The MIT Press, 2016.

Lipton, Zachary C. "The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability Is Both Important and Slippery." Queue 16, no. 3 (2018): 31–57.

Lyu, Daoming, Fangkai Yang, Hugh Kwon, Wen Dong, Levent Yilmaz, and Bo Liu. "TDM: Trustworthy Decision-Making via Interpretability Enhancement." IEEE Transactions on Emerging Topics in Computational Intelligence 6, no. 3 (2021): 450–61.

Miller, Tim, Piers Howe, and Liz Sonenberg. " Explainable AI: Beware of Inmates Running the Asylum or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences." Preprint submitted December 2, 2017,

Minaee, Shervin, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. "Image Segmentation Using Deep Learning: A Survey." IEEE Transactions on Pattern Analysis and Machine Intelligence 44, no. 7 (2021): 3523–42.

Rudin, Cynthia. "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead." Nature Machine Intelligence 1, no. 5 (2019): 206–15.

Rudin, Cynthia, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. "Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges." Statistic Surveys 16 (2022): 1–85. .

Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche et al. "Mastering the Game of Go with Deep Neural Networks and Tree Search." Nature 529, no. 7587 (2016): 484–89.

Tegler, Eric. “AI Just Won a Series of Simulated Dogfights against a Human F-16 Pilot, 5-0. What Does That Mean?” Forbes, August 20, 2023.

Vinyals, Oriol, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi et al. "Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning." Nature 575, no. 7782 (2019): 350–54.

No comments here