Many courts use risk assessment algorithms to advise judges whether to release a defendant pretrial, and if so under what conditions. These algorithms are highly controversial, and have been criticized for perpetuating historical biases...
Many courts use risk assessment algorithms to advise judges whether to release a defendant pretrial, and if so under what conditions. These algorithms are highly controversial, and have been criticized for perpetuating historical biases. On the other hand, they can remind judges that most defendants can be safely released, and help us think about how much risk—and risk of what—could justify detention. But they can only play this role if they are transparent, and if judges know what their outputs mean. Vague labels like “high risk” are not enough. We audited a widely-used risk assessment algorithm for accuracy and fairness using a dataset of fifteen thousand defendants in Albuquerque, New Mexico. By digging deeper than previous studies, we learned that most crime is not pretrial crime, rearrest for high-level felonies is very rare, and that most people who “fail to appear” in court miss only one hearing. We also audited proposed state laws, treating them as algorithms, and showed that they would detain many people unnecessarily while only preventing a small fraction of crime. We close with some reasons that computer scientists should engage in studies like this, and how doing so can broaden your view of the both algorithms and human systems.
Keywords: risk assessment; pretrial detention; algorithmic fairness; criminal justice reform
Learn how risk assessment algorithms are used in the pretrial justice system and their relationship with bail reform.
Learn how to question whether data accurately represents reality, such as the differences between arrest and crime.
Understand the role of transparency, including the ability of defendants and judges to understand what an algorithm depends on and what its outputs mean.
Understand how the meaning of an algorithm can vary from place to place, how labels like “high risk” can be misinterpreted, and how looking more closely at the data can help reform the system.
Learn how controversies about crime and pretrial detention play out in practice, and what role algorithms and data can play in legislative debates and the movement for criminal justice reform.
What does it mean about a defendant, a tenant, or a loan applicant if they are labeled “high risk”? Even assuming this label is accurate in some sense, how much risk, and risk of what, does it imply? And what would this risk have to be to justify a punitive action?
These questions are particularly pressing in pretrial detention. At this moment there are about 450,000 people in local jails across the United States who have been arrested but not yet been found guilty.1 Many wait more than a year for their case to be resolved. This raises profound issues in a society that considers defendants innocent until proven guilty.2 Jailing a defendant makes it impossible for them to maintain housing or employment or care for their families; it also makes it hard for them to mount a strong defense, and more likely to accept a plea deal.3 It makes it more likely that they will commit crimes in the future.4 Finally, it is very expensive, costing roughly one hundred dollars per inmate per day.5
At the same time, many policymakers decry a “revolving door” where dangerous defendants are released and commit additional crimes. Although they constitute a small fraction of crime overall, these cases are highly visible to the press and public, and create a sense that the system failed.6 The thought “this wouldn’t have happened if the judge hadn’t released them” is simple, powerful, and far more salient to many voters than the complex individual and social causes of crime.7
So, on what grounds is pretrial detention justified? In their 1987 decision in United States v. Salerno, the Supreme Court found it constitutional to detain a defendant if a judge finds that releasing them would pose an unacceptable danger to the public.8 But this is essentially a prediction—a claim that a defendant has a high risk of committing an additional crime if released—which may or may not be accurate.
Many jurisdictions have turned to risk assessment algorithms, also known as tools or instruments, in an effort to distinguish high-risk from low-risk defendants. As in other high-stakes domains like housing and credit, the use of algorithms is controversial. Critics argue that algorithms predict arrest rather than crime, fail to treat defendants as individuals, and perpetuate historical biases and systems of oppression.9 Others argue that algorithms can compensate for human biases, provide new forms of accountability, and clarify how we think about decision-making.10
We believe that, if they are sufficiently transparent, interpretable, and auditable, algorithms can play a positive role in society and in the movement for criminal justice reform. But their meaning is obscured by the usual ways that their results are presented, such as “high risk,” “5 out of 6,” or “orange.” By digging beyond the usual variables used to train them, we find, for instance, that even “high-risk” defendants are rarely arrested for high-level felonies—a fact that judges should know. The better all stakeholders understand what algorithmic labels mean, the better a democratic conversation we can have about what probability, of what event, could justify pretrial detention.
Rightly or wrongly, the move toward risk assessment algorithms is coupled with the current bail reform movement. In many states, “bail” mainly means financial bail, where a judge assigns a dollar amount that the defendant must deposit up front as collateral, and the defendant is jailed if they cannot afford it. A number of states, including California, Kentucky, New Jersey, New York, and Illinois, have moved or are moving away from this practice, due to the obvious concern that it discriminates against the poor.
New Mexico is another such state.11 In 2014, the New Mexico Supreme Court ruled that it is unconstitutional to detain someone solely because they cannot afford a bail bond. In 2016, the voters amended the state constitution accordingly. The same amendment states that any defendant can be detained if the prosecutor proves “by clear and convincing evidence,” in a hearing with defense present, “that no release conditions will reasonably protect the safety of any other person or the community.”12 Thus, detention should be based, not on the defendant’s financial circumstances, but on the risk that they would pose to the public if released. The question is how to estimate that risk.
The Public Safety Assessment, or PSA, was developed by the Arnold Foundation (now Arnold Ventures) and is currently used in twenty-two states. Since June 2017 it has been used in Albuquerque, or to be more precise, in Bernalillo County Metropolitan Court and the Second Judicial District Court, specifically for felony defendants. It has now been extended to ten additional jurisdictions in New Mexico. The PSA is based on the number of a defendant’s prior convictions or failures to appear in court, whether these prior convictions are for felonies or misdemeanors and for violent or nonviolent crimes, whether there is a pending charge at the time of arrest, and whether the source charge (for which the defendant was originally arrested) is violent. It does not use juvenile records or “environmental” questions like housing, education, family, or employment. The only demographic variable it uses is age.
The PSA combines these factors with a simple point system with integer weights that are publicly known. Thus, it is interpretable in the sense that defendants and judges can easily understand how the scores were generated.13 This distinguishes it from opaque, proprietary algorithms like COMPAS, which we feel have no place in the justice system.14
The PSA produces two risk scores on a 6-point scale. One is for Failure to Appear (FTA), namely, missing a court hearing. The other is for New Criminal Activity (NCA), which means being rearrested for a new charge during the pretrial period. It also produces a flag for New Violent Criminal Activity (NVCA), meaning rearrest for a new violent charge.
When a jurisdiction adopts the PSA, local stakeholders construct a “release conditions matrix” that translates risk scores into recommendations. These typically include requiring check-ins with pretrial supervision officers, drug testing, or tracking with a GPS device. The PSA was originally proposed as a way to help judges decide whether or not to release a defendant, and if so under what conditions. However, the current position of Arnold Ventures is that the PSA should only be used to recommend conditions of release, and not for the decision to release or detain. Some jurisdictions still use it for both purposes, but this is in flux. In 2023, the Albuquerque matrix was amended to remove the red “detain or release with max conditions” category shown in Figure 1.
There are multiple gaps here between science and policy. The quantitative meaning of a PSA score is the average rate of FTA, NCA, and NVCA among the defendants in a given category. Abstract 6-point scores already obscure these rates to some extent, and as we discuss below, the rearrest rates associated with a given score vary strongly from place to place. But the matrix that translates these scores into recommended actions—what level of supervision a defendant should be under, and what conditions should apply at each level—is entirely a matter of policy.
Thus, by the time the data behind the PSA has turned into a recommendation like “release with a GPS anklet,” we have crossed from the statistics of rearrest and court appearance to much harder policy questions of what interventions are effective and justified. Our hope is that, by looking more closely at what PSA scores mean, we can help clarify this policy discussion, and encourage decision-makers to think clearly about how much risk, and risk of what, would justify pretrial detention or other punitive interventions.
Just as a bank only knows whether an applicant will repay a loan if they are approved, we only know how a defendant will behave if they are released. The New Mexico court system provided us with a rich and growing data set of such defendants. Our results are based on the 15,134 felony defendants arrested in the four-year period from July 2017 through June 2021 whose cases were closed by the end of this period and who were released pretrial. These constitute 70 percent of the closed felony cases during that time.15
For each of these defendants, our data includes:
the source charge originally filed by the district attorney against the defendant
past convictions and past failures to appear
the scores and recommendations produced by the PSA
whether the prosecutor brought a motion to detain
new charges (NCA or NVCA) filed from rearrests during pretrial supervision, if any
failures to appear (FTA) during pretrial supervision, if any.
As in many domains, it is vital to question whether this data is measuring what we think it is.16 Rearrest is not the same as crime. Not all arrests correspond to actual crimes, and conversely, not all crime is reported and leads to an arrest. Based on the National Crime Victimization Survey, the estimated fraction of crimes reported to police is 52 percent for aggravated assault, 47 percent for robbery, and 34 percent for rape and sexual assault.17 Homicide clearance rates—the fraction of cases where an arrest is made—are especially low in low-income communities of color.18
We know of very few studies that look at other signals of pretrial crime such as conviction. A study of the PSA in Los Angeles found that less than one-third of pretrial rearrests led to new convictions.19 Of course, like arrest, conviction is a noisy and biased signal of crime, even if our legal system regards it as the final say.
While we should be alert to the difference between rearrest and crime, this data already tells us a great deal. For instance, of all the violent felony charges filed during our study period, only 5 percent were filed against people who were pretrial defendants at the time, and 3 percent were filed against defendants charged with violent felonies. This suggests that even jailing all felony defendants—which no one is proposing—would only reduce violent felonies by about 5 percent. Jailing all defendants accused of violent felonies would reduce violent felonies by about 3 percent. Despite the large role it plays in the public debate, most crime is not pretrial crime.
In prior work, the current authors and De La Cerda carried out an audit or “revalidation” of the PSA in Albuquerque.20 As one measure of accuracy, we measured the AUC. Formally, the AUC is the area under the receiver-operator characteristic curve. More concretely, it is the probability that a random pair of people, one who is rearrested and one who is not, are ranked in the “correct” order: namely, that the one who is rearrested is given a higher risk score.21 One reason to use the AUC is that it does not depend on the threshold we use to, say, recommend detention. But this feature is also a bug, since the accuracy and fairness of a policy depends on this threshold.22
For both FTA and NCA we found an AUC of 0.64, similar to other studies of the PSA. To many readers, this number will seem quite low. In many AI classification tasks, AUCs well above 0.95 are common. In medicine, diagnostic tests are not considered acceptable unless their AUC is at least 0.8 or 0.9.23 It is easier to measure biological ground truth than it is to predict human behavior. However, in criminal justice an AUC of 0.64 is considered “good” because of how the AUC relates to other measures of accuracy and correlation.24 For instance, suppose that one-sixth of the population falls into each category, and that rearrest rates increase linearly from 5 percent to 30 percent as the score goes from 1 to 6. Many would consider this a strong relationship between score and risk, but the corresponding AUC is just 0.668. As discussed in this series, a similar accuracy can be achieved by nonexpert humans given a brief summary of the defendant’s case, age, and gender,25 although see also Lin et al.26
Of course, overall accuracy is not enough. The US justice system has a long history of disparities between gender and racial groups, including in bail.27 To our knowledge, Ferguson et al.28 is the first study of pretrial justice to include a significant Native American population, as well as Black, Hispanic, and White defendants. We found that the PSA is racially fair in the sense that NCA and FTA rates are roughly equal for defendants of different races if their risk scores are the same. For gender, we found that NCA rates are somewhat lower for women than for men with the same risk score.29
This definition of fairness is called calibration. If an algorithm is not well-calibrated, it differentially overestimates the risk of defendants in some groups, and underestimates risk in others. For instance, if Black defendants were rearrested less often than White defendants with the same risk score, this would suggest that the scores given to Black defendants are too high.
Since the PSA is well-calibrated across racial groups, one can argue that it is not, in itself, a source of racial discrimination: that its scores have the same meaning for defendants of different races. However, calibration does not ensure other notions of fairness, and can conflict with them. For instance, since different groups have different distributions of scores, a well-calibrated algorithm typically has different false positive rates between groups.30 For a fascinating debate about this in the criminal justice context, see an article by Angwin et al.31 and the rejoinder by Flores, Bechtel, and Lowenkamp.32
Nor does calibration address systemic inequalities. Many studies have shown that Black people are more likely to be arrested and convicted due to bias in policing, prosecution, and the legal process, and are more likely to be rearrested pretrial for the same reasons.33 In that case, Black defendants will have systematically higher PSA scores, and the PSA will “correctly” predict that they have high rearrest rates. As Eckhouse et al. put it, “The model is trained on data generated by past police bias, and we are asking the model to predict events that are dependent on future police bias” creating “a perfect circularity to the model building and assessment.”34
So, should we use the PSA at all? Like Arnold Ventures, we do not support using the PSA to justify detention. We believe that New Mexico’s current system, where the prosecutor must present individualized evidence that a defendant is dangerous, is the right approach from an ethical and constitutional point of view. We should not deny people their liberty on statistical grounds.
Nevertheless, we believe the PSA can be a useful part of the system if it is interpreted properly. It cannot compute the probability that an individual will commit a crime. But it can give judges a baseline by telling them the average rearrest rate of defendants with similar criminal records. Even though rearrest is a noisy and biased signal of crime, data about it can aid the cause of reform. If the rate of rearrest for serious crimes is low for most defendants—as we show below is the case—this may help convince judges and policymakers that most defendants do not need to be detained.
But as we discuss next, this only works if judges know what PSA scores mean.
When the results of a risk assessment algorithm are given to a judge, parole board, or other decision-maker, they are typically presented as an abstract score or category: a number on a 6- or 10-point scale, a phrase like “high risk,” or a color like orange. Figure 1 shows the report format used until recently for the PSA in Albuquerque. It tells the judge the defendant’s FTA and NCA scores, the input factors used to produce those scores, a red NVCA flag, and a color-coded matrix translating those scores into recommendations.
What do these scores mean, and how are they interpreted by decision-makers? At one level, the “meaning” of a score consists of its input factors—how many past convictions the defendant has and so on. But judges already have access to these. If a risk assessment is worth using, it should tell judges something about risk, including the probability that a defendant will be rearrested if released. Abstract scores and labels don’t convey these probabilities, even approximately. Thus, even if an algorithm is accurate and fair, we question whether judges know what its output means in a quantitative, probabilistic sense.
This raises important issues in psychology and how humans think about risk. Human decision-makers are not very good at translating phrases like “high risk” or “low risk” into quantitative probabilities.35 We tend to overestimate probabilities in “high-risk” categories, and our estimates are dependent on our feelings about crime.36 Like the rest of us, judges are risk-averse, and well aware that the public will blame them if they release someone who commits a terrible crime. Thus, given abstract labels, judges are likely to overestimate how dangerous defendants are.
In this quantitative sense, the meaning of the PSA score differs widely from place to place. Figure 2 compares rearrest rates as a function of NCA score in seven jurisdictions.37 These rates vary by a factor of 2 or 3, especially in the higher-risk categories: an NCA score of 5, for instance, doesn’t mean the same thing in San Francisco as it does in Kentucky. Moreover, these rates also differ significantly from the training data used to develop the PSA, a good lesson that training and test data are often distributed very differently.
This variation between jurisdictions could be due to a vast number of different factors. Certainly, human behavior and crime rates vary from place to place. But so do arrest rates, police resources, the lenience or strictness of pretrial release decisions, the effect of pretrial supervision programs, and the length of time between arrest and trial. For instance, it is possible that San Francisco’s high rearrest rates are due to delays in the court system (creating long pretrial periods) and arrests for crimes of drug addiction and homelessness, or alternatively that arrests are focused on repeat offenders. But in the absence of more detailed data, this is just speculation.
The wide range of rearrest rates in Figure 2 drives home the need for periodic local revalidations in each jurisdiction where a risk assessment is used.38 An algorithm that is accurate and fair in one place and time might not be in another. Even in a single jurisdiction, the validity of an algorithm may change over time as new pretrial services or interventions become available.39 California state law SB36 calls for pretrial risk assessment algorithms to be revalidated in each jurisdiction every three years. Other states should follow suit, and similar policies should apply for algorithms used in other high-stakes domains.
Some jurisdictions are experimenting with reports that give the judge rearrest rates, or equivalently success rates for avoiding new charges. Figure 3 shows such a report from El Paso. However, even these rates are not as informative as we would like. Let us look closer.
To make their decisions in a principled way, judges and other stakeholders need to understand how much risk, and risk of what, defendants pose to the public. Even the US Bail Reform Act of 1984 directs judges to consider the “nature and seriousness of the danger to any person or the community that would be posed by the person’s release.”40 Slobogin writes “policymakers must identify much more precisely than they have to date what likelihood, of what outcome, during what period of time, justifies incarceration as opposed to some lesser intervention.”41
Unfortunately, almost all studies of pretrial rearrest, including those shown in Figure 2, lump very different kinds of crime together. It is standard to separate violent and nonviolent charges (NCA and NVCA) and algorithms typically make separate predictions for these two types of rearrest. But we know of very few studies that separate charges by severity, for example, whether someone is rearrested for a felony or a misdemeanor. Using the Colorado Pretrial Assessment Tool (CPAT) as an example, Schnacke42 notes that the “high-risk” label “could mean risk of a new filing for anything from a petty or traffic offense all the way to homicide… But the risk of an armed robber committing another armed robbery is far different from the risk of that robber trespassing. It would help to know the distinction.”
Put yourself in the shoes of a judge. A defendant stands before you accused of, say, a second-degree violent felony. An algorithm tells you that they have a 20 percent chance of rearrest if released. It would be understandable if you assumed that this refers to the chance that they will commit another crime of similar type (violent) and similar severity (a high-level felony). But is that true?
In Moore, Ferguson, and Guerin43 we break down that 20 percent, and find that rearrest for high-level felonies is extremely rare. For first-degree felonies, it is 0.1 percent, one out of a thousand; for second-degree felonies, it is about 1 percent. The largest category of new charges is fourth-degree felonies, and about one-third of rearrests are for misdemeanors or petty misdemeanors. This breakdown holds roughly across all NCA score categories (see Figure 4) and also for rearrest for violent crimes (Figure 5).
We are not trying to minimize the impact that misdemeanors or low-level felonies have on their victims. However, without this breakdown, we question whether judges understand what risk scores mean. The risk that a defendant will commit a petty misdemeanor should carry less weight than the risk that they will commit a high-level felony. These should not be lumped together.
So, what kind of information should judges be given? Should risk assessment reports be like El Paso’s (Figure 3) but more elaborate, giving a table of probabilities that a defendant will be rearrested for violent or nonviolent crimes at each level of severity? Going too far in this direction would surely lead to information overload. Moreover, as Figure 5 shows, dividing rearrests into smaller subgroups creates larger error bars, so that the probabilities we can report to judges are themselves somewhat uncertain. Clearly there is a danger that judges will ignore the algorithm entirely and rely instead on their intuitions.
We believe that the question of what report or interface would be most useful to judges, and most likely to make the pretrial system fairer, is itself an important area of research. That said, we believe that simple categories like “high risk” allow judges’ stereotypes to distort the underlying probabilities. Perhaps a good compromise would distinguish between felonies and misdemeanors, with a note reminding judges that rearrests for first- and second-degree felonies are rare.
Just as we did for rearrest, let us look closer at failure to appear (FTA). In all the studies we know of, FTA is defined as a single bit: 1 if a defendant fails to appear at one or more of their court hearings, and 0 if they appear at all of them. This one-bit definition makes it impossible to distinguish defendants who try to evade the system from those who miss an initial hearing and then voluntarily attend their remaining hearings, of which there can easily be five or more in a typical case.
We found that about 86 percent of defendants with FTAs only miss one hearing. These defendants are not “flight risks” who are trying to escape justice: they miss hearings due to inadequate notification, inflexible job schedules, lack of transportation or child care, or a lack of knowledge of the consequences of nonappearance. Rather than detaining them, we can provide reminders, night and weekend courts, and transportation vouchers. (One jurisdiction in New Mexico is even using gift cards as an incentive for appearance.) Well-designed and well-timed text messages that remind individuals of upcoming hearings, ask them to make a plan to get there, and alert them that failing to appear could lead to arrest can reduce FTA rates by 26 percent Cooke et al.44 Improving court appearance rates is fairer to everyone, including victims and witnesses, who often come from the same vulnerable populations as defendants.
While we were carrying out our audit of the PSA, a debate was brewing in the New Mexico State Legislature. A number of officeholders, including the district attorney and the governor (both Democrats) argued that too many dangerous defendants were being released.
This debate is not about algorithms jailing people unfairly.45 In fact, it is largely the reverse. The district attorney complained that the PSA gives low scores to some defendants who should be detained, and presented examples of defendants who were released and then convicted of serious new crimes, including homicide. The DA objected to the fact that the PSA mainly uses past convictions, and argued that it fails to take other kinds of information into account—for instance, if a defendant was likely guilty of a prior offense, but acquitted because a victim was unwilling to testify. Even though prosecutors can present this kind of evidence as part of a motion to detain, judges in Albuquerque rule against these motions about 40 percent of the time.46 While the PSA is only a recommendation, the DA felt that judges relied on it too much.
In response, a bipartisan group of legislators proposed state laws that would instruct judges to consider certain sets of defendants dangerous, and therefore detain them by default. While judges would still have the final say, these laws—called rebuttable presumptions—would shift the burden of proof from the prosecution to the defense. A typical proposal recommended detention for defendants charged with “serious violent offenses,” who used or brandished a firearm, or who have a past history of failing to appear.
In essence, these laws predict that certain sets of defendants have a high risk of rearrest. We realized that we can test these claims with the same data that we used to audit the PSA, treating these laws as algorithms by another name. After all, a policy like “if [condition] then recommend detention” might as well be a line of code, and it should be examined for accuracy and fairness just as an algorithm should.
For each proposed law, we performed a kind of natural experiment.47 We went back to see how many of the released defendants in our data set would have been detained if the law had been in place, and how many of those were in fact rearrested. Thus, we estimated how many people the law would detain, and how many crimes it would prevent.
While their proponents argue that these laws “zero in” on the most dangerous defendants, we found that this is far from the case. Among the defendants identified by a typical proposal, the rate of rearrest for violent felonies is 4–5 percent. For comparison, among all 15,134 released defendants in our data set, this rate is 3 percent; it is 6–8 percent among those with NCA scores of 6 and the violence flag. The broadest proposed laws performed little better than chance, and would jail thousands of defendants per year unnecessarily.
In our presentations to the legislature, we urged them to articulate what risk, of what kind of crime, would justify pretrial detention. Leaving constitutional issues aside, these laws might be justified if, say, a 5 percent chance of rearrest for a violent felony is enough—that is, jailing twenty defendants for every violent felony prevented. But if we set the threshold at 3 percent, it would justify detaining all felony defendants. It is not up to us as researchers to define this threshold, but we can measure what the effects of different thresholds would be.
Of course, not all legislators agree that this kind of quantitative thinking is relevant. For many of them, the issue is not one of statistics, but of right and wrong. A few released defendants do commit terrible crimes—including homicides, where the victims’ mothers testified in favor of these bills—and we should prevent these crimes if we can. Our (very respectful) response was that serious crime is difficult to predict, and that detention also has a human cost. Casting a net wide enough to catch these few defendants would, as far as we can tell, also detain many others who do not pose the same danger. While there are defendants who should be detained based on individualized evidence—say, because they have threatened a witness—it seems very hard to define a group of defendants who are highly dangerous based on factors like their criminal record.
For one of us (CM), this work has been an incredible education. Interacting with judges, court administrators, prosecutors, public defenders, and legislators has given me a much deeper sense of how the justice system works, how debates about it play out in the real world, and what role—positive or negative—algorithms might play in criminal justice reform.
Here are three lessons that my fellow computer scientists may benefit from.
Seek out opportunities to engage in studies like these. This collaboration came out of a long series of meetings and conversations. I cold-called legislators, lawyers, and civil rights organizations. Some did not respond; others did and recommended further contacts. Eventually I met with a former state Supreme Court Justice, Charles Daniels, who was the main architect of New Mexico’s bail reform. He introduced me to the administrators of the state’s pretrial supervision program, who in turn introduced me to my collaborators. All these people tolerated my naive questions about the justice system, and were grateful that I wanted to get involved.
In many real-world settings, what we need is not better algorithms, but better data. In most of computer science, we take the data as given; we have no say over the inputs our algorithms receive. While detailed proposals have been made to assess the quality and provenance of data sets,48 most of the discussion in the algorithmic fairness community has focused on whether the data is representative of all subpopulations.
Here we have made a different point. Digging deeper into the system and going beyond the standard variables, like the one-bit definitions of New Criminal Activity and Failure To Appear—distinguishing charges by severity, and asking which court hearings people miss—can help clarify what is going on, and what policies might help. Embrace complexity. There is no reason to look at reality through a low-bandwidth channel if we can do better.
If you want to help the real world, dive into it. We often approach interdisciplinary work as a kind of transaction. A collaborator from another field gives us data, and we apply our techniques to it. We each stay in our lane: they are the domain experts, and we are “domain agnostic,” staying aloof from what the data is actually about.
But deep interdisciplinary work means crossing these lanes. If you care about the domain, you should learn about it. If you care about housing and credit, study the history of segregation;49 if you care about justice, find out how the justice system works.50 Approach your work with your whole brain and heart. Your life will be more intellectually satisfying, you will grow as a scientist, and your work will be more likely to be useful to the world.
How should a risk assessment algorithm’s results be presented to a judge? What should the judge know about its inputs and outputs, and what they mean?
Algorithms like the PSA give statistical predictions based on past defendants with similar criminal records. How do you think a judge should combine this information with evidence about an individual defendant? Imagine arguing for both the prosecution and the defense that a defendant is an exception.51
We have argued that better data is often more important than better algorithms—for instance, by revealing that most rearrests are for lower-level crimes. But more detailed data can also come at a cost to privacy. How would you balance these concerns?
By design, the PSA is a very simple algorithm—just add up the points. Suppose a more sophisticated algorithm is slightly more accurate, but harder to understand. Would you use it? How does the relative importance of accuracy vs. interpretability vary between domains like justice, health, lending, and so on?
This case study focused mostly on the overall accuracy of algorithms, and only secondarily on disparities between racial, ethnic, and gender groups. What do you think of this? What are the tensions, and the points of positive contact, between these two focuses?
We have tried to put machine learning and human-generated “algorithms” on the same footing, arguing that legislative proposals should be audited for fairness and accuracy with the same tools we use for computer code. Where might this analogy break down?
How can algorithms and data facilitate a democratic discussion of pretrial detention and the criminal justice system in general? Are algorithms ultimately allies or enemies in criminal justice reform? How does this depend on how they are designed, deployed, and used?
Is this kind of work something you see as part of your profession or your education?
We thank David Ball, Tina Eliassi-Rad, Stephanie Forrest, Mirta Galesic, Joshua Garland, Sandra Mayson, Brandon Ogbunu, Christopher Slobogin, and Tim Schnacke for comments and feedback on a draft of this article; they may or may not agree with our conclusions. CM is funded in part by the National Science Foundation through grant BIGDATA-1838251.
Abebe, Rediet, Solon Barocas, Jon Kleinberg, Karen Levy, Manish Raghavan, and David G. Robinson. “Roles for Computing in Social Change.” In FAT* 20: Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, ed. M. Hildebrant et al., 252–60. New York: ACM, 2020.
American Bar Association. (2007) Standards for Criminal Justice: Pretrial Release. 3d ed. Chicago: ABA, 2007. https://www.americanbar.org/content/dam/aba/publications/criminal_justice_standards/pretrial_release.pdf.
Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias: There Is Software That Is Used Across the County to Predict Future Criminals. And It Is Biased Against Blacks.” ProPublica, May 23, 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Arnold, David, Will Dobbie, and Crystal S. Yang. “Racial Bias in Bail Decisions.” Quarterly Journal of Economics 133 (2018): 1885–1932.
Ball, W. David. “The Peter Parker Problem.” NYU Law Review 95, no. 4879 (2020): 879–917.
Barabas, Chelsea, Colin Doyle, J. B. Rubinovitz, and Karthik Dinakar. “Studying Up: Reorienting the Study of Algorithmic Fairness Around Issues of Power.” In FAT* '20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, ed. M. Hildebrant et al., 167–76. New York: ACM, 2020.
Bell, Kristen, Jenny Hong, Nick McKeown, and Catalin Voss. “Recon Approach: A New Direction for Machine Learning in Criminal Law.” Berkeley Technology Law Journal 37 (2020).
Benjamin, Ruha. Race After Technology: Abolitionist Tools for the New Jim Code. New York: Wiley, 2019.
Berk, Richard A., Susan B. Sorenson, and Geoffrey Barnes. “Forecasting Domestic Violence: A Machine Learning Approach to Help Inform Arraignment Decisions.” Journal of Empirical Legal Studies 13, no. 1 (2016): 94–115.
Cen, Sarah H., and Manish Raghavan. “The Right to Be an Exception to a Data-Driven Rule,” MIT Case Studies in Social and Ethical Responsibilities of Computing (Winter 2023). https://mit-serc.pubpub.org/pub/right-to-be-exception/release/2.
Chouldechova, Alexandra. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5, no. 2 (2017): 153–63.
Cooke, Brice, Binta Zahra Diop, Alissa Fishbane, Jonathan Hayes, Aurelie Ouss, and Anuj Shah. Using Behavioral Science to Improve Criminal Justice Outcomes: Preventing Failures to Appear in Court. Chicago: Ideas42 and University of Chicago Crime Lab, 2018. https://www.ideas42.org/wp-content/uploads/2018/03/Using-Behavioral-Science-to-Improve-Criminal-Justice-Outcomes.pdf.
County of Los Angeles. Pretrial Risk Assessment Tool Validation, 2021. http://file.lacounty.gov/SDSInter/probation/1109503_LosAngeles-UPDATEDCountyvalidation-FINAL7-1-21.pdf.
DeMichele, Matthew, Peter Baumgartner, Michael Wenger, Kelle Barrick, Megan Comfort, and Shilpi Misra. The Public Safety Assessment: A Re-Validation and Assessment of Predictive Utility and Differential Prediction by Race and Gender in Kentucky. SSRN, April 30, 2018. http://dx.doi.org/10.2139/ssrn.3168452.
DeMichele, Matthew, Ian A. Silver, and Ryan M. Labrecque. Locked Up and Awaiting Trial: A Natural Experiment Testing the Criminogenic and Punitive Effects of Spending a Week or More in Pretrial Detention. SSRN, June 5, 2023. https://papers.ssrn.com/abstract=4467619.
Denman, Kristine, Ella Siegrist, Joel Robinson, Ashleigh Maus, and Jenna Dole. Bail Reform: Motions for Pretrial Detention and their Outcomes. Albuquerque: New Mexico Statistical Analysis Center, 2021. http://isr.unm.edu/reports/2021/bail-reform-motions-for-pretrial-detention-and-their-outcomes.pdf
Desmarais, Sarah L., and Jay P. Singh. Risk Assessment Instruments Validated and Implemented in Correctional Settings in the United States. Lexington, KY: Council of State Governments, 2013. https://csgjusticecenter.org/wp-content/uploads/2020/02/Risk-Assessment-Instruments-Validated-and-Implemented-in-Correctional-Settings-in-the-United-States.pdf.
Digard, Leon, and Elizabeth Swavola. Justice Denied: The Harmful and Lasting Effects of Pretrial Detention. New York: Vera Institute of Justice, 2019.
Dressel, Julia, and Hany Farid. “The Dangers of Risk Prediction in the Criminal Justice System,” MIT Case Studies in Social and Ethical Responsibilities of Computing (Winter 2021). https://mit-serc.pubpub.org/pub/risk-prediction-in-cj/release/2,
Dressel, Julia, and Hany Farid. “The Accuracy, Fairness, and Limits of Predicting Recidivism.” Science Advances 4, no. 1 (2018): eaao5580.
Durham, Rebekah. “Innocent Until Suspected Guilty.” University of Cincinnati Law Review 90 (2021). https://scholarship.law.uc.edu/uclr/vol90/iss2/7.
Eckhouse, Laurel, Kristian Lum, Cynthia Conti-Cook, and Julie Ciccolini. “Layers of Bias: A Unified Approach for Understanding Problems with Risk Assessment.” Criminal Justice and Behavior 46, no. 2 (2019): 185–209.
Ferguson, Elise, Helen De La Cerda, Paul Guerin, and Cristopher Moore. Bernalillo County Public Safety Assessment Validation Study. Albuquerque: Institute for Social Research, University of New Mexico, 2021. https://isr.unm.edu/reports/2021/bernalillo-county-public-safety-assessment-validation-study.pdf.
Fishbane, Alissa, Aurelie Ouss, and Anuj K. Shah. “Behavioral Nudges Reduce Failure to Appear for Court.” Science 370 (2020): eabb6591. https://science.sciencemag.org/content/370/6517/eabb6591.
Flores, Anthony W., Kristin Bechtel, Christopher T. Lowenkamp. “False Positives, False Negatives, and False Analyses: A Rejoinder to ‘Machine Bias: There's Software Used Across the Country to Predict Future Criminals. And It's Biased Against Blacks.’” Federal Probation Journal 80, no. 2 (2016): 38–46.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. “Datasheets for Datasets.” Communications of the ACM 64, no. 12 (2021): 86–92.
Goel, Sharad, Ravi Shroff, Jennifer L. Skeem, and Christopher Slobogin. “The Accuracy, Equity, and Jurisprudence of Criminal Risk Assessment.” In Research Handbook on Big Data Law, ed. R. Vogl, 9–28. Northampton, MA: Edward Elgar Publishing, 2021.
Greiner, D. James, Matthew Stubenberg, and Ryan Halen. Validation of the PSA in Harris County, TX. Cambridge, MA: Access to Justice Lab, Harvard Law School, 2020. https://a2jlab.org/wp-content/uploads/2020/11/Validation-of-the-PSA-in-Harris-County-TX.pdf.
Greiner, D. James, Matthew Stubenberg, and Ryan Halen. Validation of the PSA in Kane County, IL. Cambridge, MA: Access to Justice Lab, Harvard Law School, 2021. https://a2jlab.org/wp-content/uploads/2021/03/Validation-of-the-PSA-in-Kane-County-IL.pdf.
Hegreness, Matthew J. “America’s Fundamental and Vanishing Right to Bail.” Arizona Law Review 55, no. 4 (2013): 909.
Hinton, Elizabeth, LeShae Henderson, and Cindy Reed. An Unjust Burden: The Disparate Treatment of Black Americans in the Criminal Justice System. New York: Vera Institute of Justice, 2018. https://www.vera.org/downloads/publications/for-the-record-unjust-burden-racial-disparities.pdf.
Kearns, Michael, and Aaron Roth. The Ethical Algorithm: The Science of Socially Aware Algorithm Design. New York: Oxford University Press, 2019.
Klein, Brennan, C. Brandon Ogbunugafor, Benjamin J. Schafer, Zarana Bhadricha, Preeti Kori, Jim Sheldon, Nitish Kaza, Arush Sharma, Emily A. Wang, Tina Eliassi-Rad, Samuel V. Scarpino, and Elizabeth Hinton. “COVID-19 Amplified Racial Disparities in the US Criminal Legal System.” Nature 617 (2023): 344–50.
Kleinberg, Jon, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. “Human Decisions and Machine Predictions.” Quarterly Journal of Economics 133, no. 1 (2017): 237–93.
Kleinberg, Jon, Jens Ludwig, Sendhil Mullainathan, and Cass R. Sunstein. “Algorithms as Discrimination Detectors.” Proceedings of the National Academy of Science 117, no. 48 (2020): 30096-100.
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” Proceedings of the 8th Conference on Innovations in Theoretical Computer Science (2017): 1–23. https://arxiv.org/abs/1609.05807.
Koepke, John Logan, and David G. Robinson. “Danger Ahead: Risk Assessment and the Future of Bail Reform.” Washington Law Review 93 (2018). https://ssrn.com/abstract=3041622.
Krauss, Daniel A., Gabriel I. Cook, and Lukas Klapatch. “Risk Assessment Communication Difficulties: An Empirical Examination of the Effects of Categorical Versus Probabilistic Risk Communication in Sexually Violent Predator Decisions.” Behavioral Sciences and the Law 36, no. 5 (2018): 532–53.
Kwegyir-Aggrey, Kweku, Marissa Gerchick, Malika Mohan, Aaron Horowitz, and Suresh Venkatasubramanian. “The Misuse of AUC: What High Impact Risk Assessment Gets Wrong.” In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, ed. S. Fox et al., 1570–83. New York: ACM, 2023.
Leadership Conference on Civil and Human Rights. “The Use of Pretrial ‘Risk Assessment’ Instruments: A Shared Statement of Civil Rights Concerns.” 2018. http://civilrightsdocs.info/pdf/criminal-justice/Pretrial-Risk-Assessment-Full.pdf
Lin, Zhiyuan “Jerry,” Jongbin Jung, Sharad Goel, and Jennifer Skeem. “The Limits of Human Predictions of Recidivism.” Science Advances 6, no. 7 (2020): eaaz0652.
Lowenkamp, Christopher, Matthew DeMichele, and Lauren Klein Warren. Replication and Extension of the Lucas County PSA Project. SSRN, November 9, 2020. https://ssrn.com/abstract=3727443.
Mayson, Sandra G. “Dangerous Defendants.” Yale Law Journal 127 (2018): 490–567.
Mayson, Sandra G. “Bias In, Bias Out.” Yale Law Journal 128 (2019): 2218–300.
McIntyre, Frank, and Shima Baradaran. “Race, Prediction, and Pretrial Detention.” Journal of Empirical Legal Studies 10, no. 4 (2013): 741–70.
Mizel, Matthew. A Plea for Justice: Racial Bias in Pretrial Decision Making (PhD thesis, UCLA, 2018).
Moore, Cristopher, Elise Ferguson, and Paul Guerin. “How Accurate are Rebuttable Presumptions of Pretrial Dangerousness? A Natural Experiment from New Mexico.” Journal of Empirical Legal Studies 20, no. 2 (2023a): 377–408.
Moore, Cristopher, Elise Ferguson, and Paul Guerin. How Much Risk, and Risk of What? A Closer Look at Pretrial Rearrest and Risk Assessment. SSRN, February 10, 2023b. https://ssrn.com/abstract=4352036.
Nahm, Francis Sahngun. “Receiver Operating Characteristic Curve: Overview and Practical Use for Clinicians.” Korean Journal of Anesthesiology 75, no. 1 (2022): 25–36.
National Association of Pretrial Services Agencies. Standards on Pretrial Release. Washington, DC: NAPSA, 2020. napsa.org.
Overbeck, Ann M. “Detention for the Dangerous: The Bail Reform Act of 1984.” University of Cincinnati Law Review 55 (1986): 153–99.
Partnership on AI. Report on Algorithmic Risk Assessment Tools in the US Criminal Justice System. 2019. https://partnershiponai.org/paper/report-on-machine-learning-in-risk-assessment-tools-in-the-u-s-criminal-justice-system/.
Pew Research Center. What the Data Says (And Doesn’t Say) About Crime in the United States. November 20, 2020. https://www.pewresearch.org/fact-tank/2020/11/20/facts-about-crime-in-the-u-s/.
Rabuy, Bernadette. Pretrial Detention Costs $13.6 Billion Each Year. Northampton, MA: Prison Policy Initiative, 2017. https://www.prisonpolicy.org/blog/2017/02/07/pretrial_cost/.
Rakoff, Jed S. Why the Innocent Plead Guilty and the Guilty Go Free: And Other Paradoxes of Our Broken Legal System. New York: Farrar, Straus and Giroux, 2021.
Rice, Marnie E., and Grant T. Harris. “Comparing Effect Sizes in Follow-Up Studies: ROC Area, Cohen's d, and R.” Law and Human Behavior 29, no. 5 (2005): 615–20.
Rothstein, Richard. The Color of Law: A Forgotten History of How Our Government Segregated America. New York: Liveright Publishing, 2017.
Rudin, Cynthia. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (2019), 206–15.
Sawyer, Wendy, and Peter Wagner (2023) Mass Incarceration: The Whole Pie. Northampton, MA: Prison Policy Initiative, 2003. https://www.prisonpolicy.org/reports/pie2023.html.
Schlesinger, Traci. “Racial and Ethnic Disparity in Pretrial Criminal Processing.” Justice Quarterly 22, no. 2 (2005): 170–92.
Schnacke, Timothy R. “Model” Bail Laws: Re-Drawing the Line Between Pretrial Release and Detention. Cleveland, OH: Center for Legal and Evidence-Based Practices, 2017. http://www.clebp.org/images/04-18-2017_Model_Bail_Laws_CLEPB_.pdf.
Schnacke, Timothy R., Michael R. Jones, and Claire M. Brooker. The History of Bail and Pretrial Release. Baltimore: Pretrial Justice Institute, 2010.
Scurich, Nicholas. “The Case Against Categorical Risk Estimates.” Behavioral Sciences and the Law 36, no. 5 (2018): 554–64.
Siegrist, Ella J., Jenna L. Dole, Kristine Denman, Ashleigh Maus, Joel Robinson, Callie Dorsey, and Graham White. Implementing Bail Reform in New Mexico. Albuquerque: Institute for Social Research, University of New Mexico, 2020. http://isr.unm.edu/reports/2020/implementing-bail-reform-in-new-mexico.pdf.
Skog, Alissa, and Johanna Lacoe. Validation of the PSA in San Francisco. Los Angeles: California Policy Lab, 2021. https://www.capolicylab.org/wp-content/uploads/2021/08/Validation-of-the-PSA-in-San-Francisco.pdf.
Slobogin, Christopher. Just Algorithms. Cambridge: Cambridge University Press, 2021.
Wagner, Claudia, Markus Strohmaier, Alexandra Olteanu, Emre Kıcıman, Noshir Contractor, and Tina Eliassi-Rad. Mass Incarceration: The Whole Pie. Northampton, MA: Prison Policy Initiative, 2003.
Wagner, C., M. Strohmaier, A. Olteanu et al. “Measuring algorithmically infused societies.” Nature 595 (2021): 197-204.
“Where Murders Go Unsolved.” Washington Post, June 6, 2018. https://www.washingtonpost.com/graphics/2018/investigations/where-murders-go-unsolved/.
Advancing Pretrial Policy and Research. (2022) About the Public Safety Assessment. https://advancingpretrial.org/psa/factors/.
Angel, Arthur R., Eric D. Green, Henry R. Kaufman, and Eric E. Van Loon. “Preventive Detention: An Empirical Analysis.” Harvard Civil Rights-Civil Liberties Law Review 6 (1971): 300–96, with a Foreword by Sam J. Ervin Jr., “Preventive Detention—A Step Backward for Criminal Justice.”
Berger, Marty. “The Constitutional Case for Clear and Convincing Evidence in Bail Hearings.” Stanford Law Review, 75 (forthcoming).
Berk, Richard, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. “Fairness in Criminal Justice Risk Assessments: The State of the Art.” Sociological Methods & Research 50, no. 1 (2021): 3–44.
Fagan, Jeffrey, and Martin Guggenheim. “Preventive Detention and the Judicial Prediction of Dangerousness for Juveniles: A Natural Experiment.” Journal of Criminal Law and Criminology 86 (1996): 415–48.
Ferguson, Elise, Helen De La Cerda, and Paul Guerin. Failure to Appear and New Criminal Activity: Outcome Measures for Preventive Detention and Public Safety Assessments. Albuquerque: Institute for Social Research, University of New Mexico, 2020. https://www.nmcourts.gov/wp-content/uploads/2020/12/failure_to_appear_and_new_criminal_activity_outcome_measures_for_preventive_detention_and_public_safety_assessments.pdf.
Goldkamp, John S. “Danger and Detention: A Second Generation of Bail Reform.” Journal of Criminal Law and Criminology 76, no. 1 (1985): 1–74.
Himsell, Scott D. “Preventive Detention: A Constitutional but Ineffective Means of Fighting Pretrial Crime.” Journal of Criminal Law and Criminology 77 (1986): 439.
Kingsnorth, Rodney, Carole Barnes, Cynthia Davis, Tina Hodgins, and Camille Nicholes. “Preventive Detention: The Impact of the 1984 Bail Reform Act in the Eastern Federal District of California.” Criminal Justice Policy Review 2, no. 2 (1987): 149–72.
Lessig, Lawrence. Code 2.0. LESSIG, 2006. http://codev2.cc/download+remix/.
Pleiss, Geoff, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. “On Fairness and Calibration.” Advances in Neural Information Processing Systems 30 (2017).
Stevenson, Megan. “Assessing Risk Assessment in Action.” Minnesota Law Review 103 (2018): 303.
Viljoen, Jodi L., Melissa R. Jonnson, Dana M. Cochrane, Lee M. Vargen, and Gina M. Vincent. “Impact of Risk Assessment Instruments on Rates of Pretrial Detention, Postconviction Placements, and Release: A Systematic Review and Meta-analysis.” Law and Human Behavior 43, no. 5 (2019): 397–420.