Skip to main content
SearchLoginLogin or Signup

Algorithmic Fairness in Chest X-ray Diagnosis: A Case Study

Machine learning models are being increasingly deployed in real-world clinical environments. However, these models often exhibit disparate performance between population groups, leading to inequitable and discriminatory predictions...

Published onFeb 27, 2023
Algorithmic Fairness in Chest X-ray Diagnosis: A Case Study


Machine learning models are being increasingly deployed in real-world clinical environments. However, these models often exhibit disparate performance between population groups, potentially leading to inequitable and discriminatory predictions. In this case study, we use several distinct concepts of algorithmic fairness to analyze a deep learning model that predicts from their chest X-ray whether someone has a disease. After observing disparities in the false positive rate and false negative rate between groups from several protected classes, we apply algorithmic fairness methods to remove such disparities. However, we find that such algorithmic interventions can have serious unintended consequences. Finally, we question what the appropriate definition of fairness is in the clinical context, and advocate for an investigation of bias in the data whenever possible, as opposed to blindly applying algorithmic interventions.

🎧Listen to an audio version of this case study.

Keywords: algorithmic fairness, deep learning, medical imaging, machine learning for health care

Haoran Zhang
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology

Thomas Hartvigsen
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology

Marzyeh Ghassemi
Department of Electrical Engineering and Computer Science & Institute for Medical Engineering and Science, Massachusetts Institute of Technology

Authors’ Note: This SERC case study incorporates some material that was original presented in Laleh Seyyed-Kalantari, Haoran Zhang, Matthew B. A. McDermott, Irene Y. Chen, and Marzyeh Ghassemi, “Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations,” Nature Medicine 27, no. 12 (2021): 2176–82,; and in Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-Rayner, Stephen Pfohl, and Marzyeh Ghassemi, “Improving the Fairness of Chest X-ray Classifiers,” Proceedings of the Conference on Health, Inference, and Learning. PMLR 174 (2022): 204–33,

Learning Objectives:

  • Understand commonly used definitions in algorithmic fairness.

  • Learn about how algorithmic fairness is applied to machine learning models in the health care setting.

  • Recognize potential sources of bias that may explain performance disparities observed in clinical machine learning models.

  • Identify various best practices for applying algorithmic fairness in the clinical setting.

  • Recognize the limitations and trade-offs of algorithmic approaches for achieving fairness.


As machine learning models reach and surpass human-level performance on many health care tasks, they are increasingly deployed in real-world hospitals, laboratories, and intensive care units.1 However, machine learning models have been shown to frequently exhibit unfair behavior in the form of disparate performance across protected groups.2 (“Protected groups” are defined within US antidiscrimination law as groups of people who share common characteristics, such as age, race or ethnicity, national origin, or sex.3) For example, commercial facial analysis systems have been shown to exhibit worse performance on Black women, and models that classify occupations from biographies often conform to existing gender stereotypes.4

Fairness is an important consideration in the health care setting for several reasons.5 First, health care is a particularly high-stakes application area, where incorrect model predictions can be the difference between life and death. Second, there is a long history of systematic treatment disparities between protected groups in medical practice.6 Moving forward, machine learning systems must be constructed without encoding such biases.7

In this work, we study machine learning algorithms through the lens of algorithmic fairness. Algorithmic fairness provides a quantitative definition of fairness based on the properties of the machine learning model.8 While this approach allows us to easily compare the fairness of different models, it disregards other key components of the decision-making pipeline.9 For example, algorithmic fairness ignores biases that may be introduced in the downstream use of the model, or biases in the upstream data collection process.10 We will discuss how these additional factors may influence the algorithmic fairness analysis at the end of this case study.

There exist many prior works that evaluate clinical machine learning models using algorithmic fairness.11 For example, prior work has shown biases in models that predict in-hospital mortality from clinical notes;12 in models that diagnose disease from skin lesions,13 chest X-rays,14 and magnetic resonance imaging (MRIs);15 and in medical image segmentation models.16

In this work, we present a case study of algorithmic fairness in the clinical setting. In particular, we focus on an application of machine learning in medical imaging—diagnosing disease from chest X-rays—based on some of our recent research.17 Given that deep learning algorithms have been shown to match specialist performance in this setting, combined with a global shortage of radiologists, there is a clear incentive for the real-world deployment of such tools.18

We start by reviewing how machine learning models are evaluated, and defining what it means for a machine learning model to be “fair.” With a metric established, we evaluate the fairness of a state-of-the-art chest X-ray classifier. Then, we use algorithmic methods to try and achieve fairness, and investigate why the model may have been unfair in the first place. Finally, we conclude by discussing key takeaways for practical application of algorithmic fairness in the clinical setting.

We find that algorithmic approaches to achieve equal predictive performance between demographic groups can have many serious side effects. We advocate instead for an alternate definition of fairness focused on maximizing performance of the worst-case group. Finally, we emphasize the importance of understanding the biases present in the data over blindly applying algorithmic interventions.

Evaluating Performance of Machine Learning Models

In this work, we focus on the problem setting of binary classification, where the label that we are trying to predict can take one of two possible values. Here, we will evaluate classifiers along three dimensions: performance, calibration, and fairness.

Machine learning models built for binary classification typically output a predicted probability (i.e., a value between 0 and 1). The most basic metrics for performance first impose a threshold for this probability to create a binary prediction, at a fixed threshold between 0 and 1. Metrics can then be computed by examining how the binary prediction matches with the binary label across all evaluation samples. Such metrics include accuracy (the percentage of correct predictions), the false positive rate (FPR, the percentage of incorrect predictions for negative samples), and the false negative rate (FNR, the percentage of incorrect predictions for positive samples).

Though such metrics are easy to evaluate and easy to understand, we often do not know what the value of the threshold will be prior to deployment. The setting of the decision threshold is highly dependent on the particular deployment setting and the preference of the physician making the decision and could even change over time. In particular, the threshold is often selected by considering the relative cost of a false positive and a false negative (see Appendix A). Absent of such information, it would be more appropriate for the model developer to evaluate the performance of a classifier over all possible thresholds. Here, we consider one such metric: the Area Under the Receiver Operating Characteristic curve (AUROC), which can be intuitively interpreted as the probability that a randomly selected sample from the positive class will have a higher predicted probability than a randomly selected sample from the negative class.

Group Fairness

US anti-discrimination law defines several protected classes (e.g., race, sex, age) for which an algorithmic decision should not have a disparate impact.19 Group Fairness encodes this notion by prescribing that a machine learning model should perform equally well for all groups within a protected class, based on some metric.

For simplicity, we first consider group fairness with respect to a binary model prediction, which is also the setting considered in almost all of the prior work. Commonly used fairness definitions, as well as the binary metric in which they prescribe, should be equalized, are shown in Table 1.

Table 1

Commonly used group fairness definitions in the binary classification setting for a binary prediction.

Type of Fairness

Fairness Principle

Equalized Metric

Demographic Parity


Predicted prevalence

Equality of Odds



Equality of Opportunity (Positive Class)


Equality of Opportunity (Negative Class)


Predictive Parity



Note. FPR, FNR, and PPV are the false positive rate, false negative rate, and positive predictive value, respectively. For more formal definitions, see Appendix B.

In this case study, we focus on the equality of odds metric, which specifies that the false positive rate and false negative rate should be the same between all groups.20 Note that each fairness definition implies a corresponding definition of bias—for example, under equality of odds, the degree of bias can be measured by the absolute difference in FPR and FNR between groups.

Are Chest X-Ray Models Group Fair?

In our study on “Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations,” we examine group fairness of a chest X-ray model with respect to a particular task: predicting “No Finding.”21 Here, a positive label for No Finding indicates the absence of any pathology. A false positive for No Finding then means that the model falsely predicts a patient is healthy when they are actually ill. This error is severe because it may lead practitioners to erroneously withhold treatment from their patients, also known as underdiagnosis. Thus, we refer to the false positive rate as the underdiagnosis rate. Conversely, a model may predict a patient to be ill when they are not, also known as a false negative, which is less serious in this case.

In our study, we use the MIMIC-CXR data set, which contains over 300,000 chest X-ray images taken at the Beth Israel Deaconess Medical Center in Boston, Massachusetts, between the years of 2011 and 2016.22 We train a deep convolutional neural network to predict No Finding, and analyze its FPR and FNR for a variety of groups across sex, age group, race/ethnicity, and insurance type. (For technical details, see our research article.23)

We ultimately find that there exist significant disparities in underdiagnosis rate between groups, as shown in Figure 1. Therefore, this classifier fails to satisfy equal odds. For example, female patients, younger patients, Black patients, and those with Medicaid insurance tend to have higher false positive rate for No Finding, indicating that they are more likely to incorrectly receive no treatment. Additionally, we also evaluate the performance of groups at the intersection of two attributes (e.g., Black females), finding that this reveals even larger disparities in underdiagnosis. Similar results have been previously shown in other chest X-ray tasks.24

Figure 1

Predictive performance of a model trained to predict No Finding from MIMIC-CXR data for each group, using 0.5 as the threshold, as measured by the false positive rate (FPR, underdiagnosis) and the false negative rate (FNR). Error bars shown are 95% confidence intervals from five model retrainings with different random seeds.

On Calibration

To dig deeper into this result, we make the important connection that the machine learning models that we train do not natively output a binary prediction, but instead a score: a predicted probability that the patient has No Finding. The binary prediction is then obtained by comparing the predicted probability to some threshold. One important metric for evaluating risk scores is calibration.25 Intuitively, a model is calibrated if its probability output is reflective of the real-world risk. For example, out of the samples that the model outputs 35 percent chance of No Finding, roughly 35 percent of these samples should actually have No Finding. Calibration can be measured using a metric called the expected calibration error (ECE).26

Similar to prior group fairness metrics that define fairness in terms of disparities in FPR and FNR, we can also define fairness in terms of calibration error. Specifically, this metric would imply that a model is biased if there are large disparities in the per-group calibration error, which is closely related to the concept of sufficiency of a risk score.27 Since the decision threshold for binary classification is rarely known in advance (as previously discussed), we emphasize that differences in model calibration between groups is a significant source of disparity once these models are deployed.28 For example, if a model is well-calibrated for one group and poorly calibrated for another, making decisions based on the model’s output at a fixed threshold could result in differing implied thresholds on the true risk between protected groups.

Achieving Group Fairness

One natural way to achieve equal odds is to select a different operating threshold for each group. Indeed, this is an effective debiasing method for equalized odds that can be applied to an existing model.29 We demonstrate its usage on the No Finding prediction model, selecting White and Black patients as protected groups. Note that this method is applicable to any combinations of groups in general.

We find, interestingly, that despite the differences in FPR and FNR at a fixed threshold, the AUROC and ECE are not significantly different between the two groups, and the classifier is fairly well-calibrated for both groups. By adjusting the threshold independently for each group, we are easily able to find a setting that eliminates the FNR and FPR gaps between the two groups completely. For the additional analysis, see Appendix C.

However, there are several major drawbacks of this procedure. First, by using a separate threshold for each group, we implicitly assume a different ratio of the costs of false negatives and false positives. In the above example, White patients are assumed to have equal costs of false negatives and false positives, while false negatives are assumed to be nearly twice as costly for Black patients as false positives. This clearly implies a disparity in how decisions are made between groups, and violates the disparate treatment component of nondiscrimination law as well as potential clinical practice guidelines.30 Second, in order to deploy a model under this policy, group membership must be known for all patients. Patients may identify with a different sex or race/ethnicity over time, and may also be unable or unwilling to provide this information during admission to a hospital.31 This could result in vastly inconsistent and inaccurate treatment policies.

A natural question to ask is whether it is even possible to achieve equal odds at the same threshold while maintaining model calibration. Surprisingly, several theoretical works have shown that this is not possible for a probabilistic version of equal odds, and that there is an inherent trade-off between calibration and fairness of the risk score.32 In our research on “Improving the Fairness of Chest X-ray Classifiers,” we attempted to achieve equal odds using algorithmic methods that alter the loss function of the machine learning model.33 (See appendix Figure C.2.) We observed that such methods are successful in enforcing group fairness, but this comes at the cost of worse performance and calibration for all: no group achieves better performance when group fairness is enforced.

This finding—enforcing group fairness constraints results in reduced model performance for all groups—has also been observed in clinical tabular data,34 and in general machine learning data sets.35 If a group-fair classifier performs no better than the original classifier for any group, then its use in a clinical setting poses little benefit.

Alternatives to Group Fairness

Group fairness produces classifiers that worsen performance for all groups (essentially trading off overall welfare for fairness), so are there alternative fairness definitions that are better suited for clinical settings? One recent and promising definition is Minimax Pareto Fairness, which is connected with political philosopher John Rawls’s Maximin Principle.36 Here, we only consider the set of models that are Pareto optimal, for which it is impossible to improve model performance for any group without making another group worse. Minimax Pareto fairness then states that we should select the Pareto optimal model with the best worst-group performance.

As we have observed above, group-fair classifiers are rarely Pareto optimal. By selecting only from the set of Pareto optimal models, minimax Pareto fairness ensures that no model performance is left unattained, which is an issue with group fairness, especially when groups have different base error rates. Additionally, any Pareto optimal classifier can be easily converted to a group-fair classifier through randomization, which involves setting a subset of the predictions for a particular group to a random value.37 This procedure of systematically worsening predictions for a particular group to achieve equal performance would clearly raise ethical issues in a clinical setting.

However, for real-world problems, it is difficult to know whether a model is Pareto optimal, or what the minimax error is. Therefore, unlike group fairness, minimax Pareto fairness is typically used as a relative definition of fairness, where two models are compared based on their worst-group and per-group performances.38 As such, it is difficult to know when you have attained the minimax Pareto fair model, though in practice, we have found that simply balancing the data by the group attribute is a fairly competitive baseline,39 and recent work has found that models that minimize overall error may already be minimax Pareto fair.40

Finally, there are many other fairness definitions that have been proposed in the literature. For example, individual fairness requires that similar individuals be treated similarly, though it is often difficult to establish a similarity metric.41 There also exist many definitions of fairness that utilize the causal structure of the data,42 though recent works have found that such definitions often suffer from the same drawbacks that we have observed in group fairness.43 Despite almost all work in clinical fairness focusing on group fairness, we encourage practitioners to select other fairness definitions that may be better catered to their problem setting.

Potential Sources of Bias in Deployed Machine Learning Systems

Though the results presented above may be discouraging to some readers, we emphasize that there is much more to the artificial intelligence (AI) fairness field than simply correcting biases from an algorithmic perspective. In particular, there are several key components in the machine learning deployment pipeline that are not considered in an algorithmic fairness analysis. For example, if a model is to be used for clinical decision support, any biases in how the clinician might selectively choose to follow model recommendations would be a source of bias in the system, even if the model deployed is perfectly fair.44 Similarly, algorithmic fairness also assumes that the data distribution we observe during evaluation are identical to what will be seen during deployment. However, we know that clinical data can often suffer from selection bias, in which subsets of particular groups are more likely to be observed within the data set, due to factors such as inequality in health care access.45 For example, prior work has shown that White patients are significantly more likely to receive diagnostic imaging than non-White patients, even after adjusting for a variety of confounding factors.46 If this selection bias is not present in the deployment setting, then we would expect the actual performance and fairness metrics of the model to be vastly different than what we have observed on the “biased” data set.

Here, we conduct a preliminary study of another potential source of data set bias: label bias. Label bias occurs when the observed labels differ from the ground truth at different rates for different groups.47 Not only would label bias affect the models that we train, it would also invalidate the performance and fairness metrics that we have calculated, as our model predictions would be compared against incorrect labels.

To study label bias, a first question is whether there is any mislabeling in the observed labels at all, as there would be no label bias if all labels were correct. We present the pipeline for how labels are derived in the MIMIC-CXR data set in Figure 2 The labels within the MIMIC-CXR data set are obtained from free-text radiology notes using the CheXpert labeler, a simple rule-based natural language processing system that extracts entities based on a list of predefined phrases.48 A prior work found that the CheXpert labeler-derived labels have mediocre accuracy compared to manual annotations from a radiologist looking at only the note, indicating a large amount of mislabeling.49

Figure 2

Labels within the MIMIC-CXR data set are obtained using a rule-based natural language extractor from free-text radiology notes, which are written by clinicians looking at the chest X-rays.

Another compounding factor in label errors is that radiologists may make mistakes when writing the radiology note, or may not include all relevant information. A prior work found a significant degree of disagreement in diagnosis between a radiologist looking at the X-ray and a radiologist looking only at the note, demonstrating that radiology notes frequently omit important information.50 This would then lead to errors in the labels extracted from the notes.

However, does the degree of mislabeling differ between groups? In our study, we selected twelve hundred radiology reports in MIMIC-CXR that were labeled as No Finding by the automatic labeler, corresponding to roughly two hundred samples each from the intersections of sex and ethnicity. We recruited a board-certified radiologist to verify whether each report actually indicates No Finding using only the free-form text, without access to the underlying chest X-ray or any other patient information.51

In Figure 3 we report the accuracy of the CheXpert labeler for each protected group and intersectional subgroup, assuming that the radiologist labels are the gold standard. Each cell in the heatmap corresponds to the probability that a group has No Finding, given that the CheXpert labeler labels it as so (i.e., a positive predictive value).

Figure 3

Accuracy of the CheXpert labeler on 1200 radiology reports in MIMIC-CXR, which it labels as No Finding relative to the radiologist gold standard, for each protected group and intersectional subgroup.

We find that the quality of the CheXpert labeler is poor across the board. Overall, when the labeler labels a report as No Finding, it is only correct 64.1 percent of the time. Looking at the accuracy for each group, we find no significant differences in the label quality between sexes and ethnicities, or their intersections. However, there are significant disparities between age groups: specifically, those in the “80 and above” group have the worst-quality labels, and those in the “18–40” group have the best. Interestingly, this also correlates with the performance disparities observed between age groups for this task, for which the model achieves much better performance for younger populations.52 This may be explained by the fact that greater label corruption increases the Bayes error (error rate of the optimal classifier) for a particular group, which would lead to disparities in accuracy.53 We note that these results may be affected by age-related comorbidities in older patients, which increases the labeling complexity for both the automatic and radiologist labeler.

We conclude that label bias, and likely poor label quality in general, is one potential contributing factor for the observed performance disparities. We encourage further analyses of the severity of label bias, as well as other types of data set biases, in clinical data sets.

Concluding Remarks: Best Practices for Fairness in Clinical Settings

Given our findings, we provide the following best practices for the use of algorithmic fairness in clinical machine learning models.

Evaluate comprehensively. First, we recommend evaluating per-group performance over a wide range of metrics.54 Examining a large set of metrics across operating thresholds gives a holistic view of where gaps between protected groups lie. We specifically emphasize calibration error as an evaluation metric that is important for clinical risk scores, but is relatively underexplored in the clinical fairness literature.55 Differing calibration curves between protected groups means that deployment at any fixed operating threshold would result in differing implied thresholds on the true risk between protected groups.56 Where possible, a model's performance on the worst-case group, its average performance, and the gap in performance between groups should all be examined.

Consider sources of bias in the data. Even after a comprehensive evaluation, performance metrics are only as valid as the data set on which they were computed. The old adage “garbage in, garbage out” applies, both to the models trained and the metrics used to evaluate them. It is crucial to consider any potential biases in the data generation mechanism and how they may differ during real-world deployment. Where possible, steps should be taken to correct for such biases. Fairness should be prioritized throughout data collection.

Not all gaps need to be corrected. When observing disparities in performance, it is critical to consider the data-generating process to determine whether the gaps are clinically justified. For example, the task could be inherently harder for some groups (for example, older populations due to comorbidities). Blindly equalizing performance in such cases can lead to worse welfare for all.57 We note that debiasing a model to satisfy equal odds typically also results in a change in the effective decision threshold applied to one or more groups, as a result of explicit threshold adjustment induced by postprocessing or through the miscalibration induced by algorithmic procedures.58 As a consequence, decisions made on the basis of models debiased for equal odds are unlikely to be made at thresholds that are concordant with clinical practice guidelines or at thresholds that were selected based on the effectiveness of the clinical intervention associated with the model.59

Ultimately, algorithmic approaches alone are not enough to ensure that clinical machine learning models are equitable.60 Mitigating potential harms requires reasoning about all sources of health disparities and carefully auditing any interventions inspired by machine learning models.

Discussion Questions

  1. What governmental regulations related to fairness (if any) should clinical machine learning models satisfy before they are approved for use in the real world?

  2. Think about a particular machine learning problem from data collection all the way to deployment (for example, data collection, labeling, model training, model evaluation, deployment). How could one evaluate and encourage fairness in each stage?

  3. In many real-world clinical data sets, demographic variables may not be available at all or only sporadically throughout the data set. How could one evaluate the fairness of a model in this setting?

  4. Recent work has demonstrated the surprising ability of machine learning models to identify demographics (e.g., race) from medical images and deidentified clinical text.61 How does this finding relate to or impact the analyses in this case study?

  5. Recent advances in machine learning have brought forth deep generative models that can synthesize images (e.g., DALL.E 2, stable diffusion) or text (e.g., GPT-3, ChatGPT). Such models have started to be adapted to clinical tasks. What might fairness mean in this context, and can you think of some concrete definitions?

Appendix A

On Threshold Selection

Given that we have to select a threshold t in order to convert a model’s predicted probability to a binary value, one approach to select t seeks to maximize utility and minimize cost.62 Specifically, if we assume that a false negative is c times as costly as a false positive for all groups, and that the model is perfectly calibrated, then, to minimize total cost, we should select the threshold

t=1/(1+c),t^* = 1/(1+c),

with any other thresholding rule incurring a higher overall cost.

Appendix B

Formal Definitions

Let Y be a random variable representing the label we are trying to predict. Here we focus on the binary classification setting, so Y ∈ {0, 1}. Let Y^\widehat{Y} ∈ {0, 1} be the prediction made by our machine learning model, and G be a random variable representing the group membership for some specific protected class. Group fairness definitions are then specified as conditional independence statements between these random variables, as shown in Table B.1.63

Table B.1. Commonly used group fairness definitions, the conditional independence statements that they entail, and the metric that they equalize in the binary classification setting.

Type of Fairness

Independence Statement

Equalized Metric

Demographic Parity

Y ^G\widehat{Y\ }\bot G

Predicted prevalence

Equality of Odds

Y ^G  Y\widehat{Y\ }\bot G\ |\ Y


Equality of Opportunity (Positive Class)

Y ^G  Y=1\widehat{Y\ }\bot G\ |\ Y = 1


Equality of Opportunity (Negative Class)

Y ^G  Y=0\widehat{Y\ }\bot G\ |\ Y = 0


Predictive Parity

Y ^G  Y ^=1\widehat{Y\ }\bot G\ |\ \widehat{Y\ } = 1


Note. Y, Y ^\widehat{Y\ }∈ {0, 1}, and TPR, FPR, and PPV are the true positive rate, false positive rate, and positive predictive value, respectively.

Formally, we say that a model fθ is calibrated if

p ∈ [0, 1]: P\mathbb{P} (Y = 1 | fθ (X) = p) = p,

where P\mathbb{P} (Y = 1 | fθ (X) = p) is the conditional probability of the positive outcome given a predicted probability of p.

Appendix C

Additional Experimental Results

C.1. Per-Group Thresholding





0.256 (0.248, 0.264)

0.167 (0.156, 0.178)

0.089 (0.083, 0.094)


0.171 (0.162, 0.180)

0.269 (0.260, 0.278)

-0.098 (-0.102, -0.092)


0.863 (0.860, 0.866)

0.860 (0.857, 0.864)

0.003 (-0.000, 0.005)


0.018 (0.013, 0.023)

0.025 (0.017, 0.033)

-0.007 (-0.013, -0.002)

(a) Both groups have the same threshold (~ 0.35).





0.353 (0.329, 0.376)

0.362 (0.330, 0.393)

-0.009 (-0.022, 0.000)


0.111 (0.102, 0.121)

0.119 (0.111, 0.127)

-0.008 (-0.013, -0.005)

(b) Threshold White patients at 0.50, and threshold Black patients at 0.63.

Figure C.1. We experiment with using per-group thresholding in order to achieve equal odds. (a) When using the same threshold for both White and Black patients, we observe disparities in the FPR (false positive rate) and FNR (false negative rate). (b) When we use a per-group thresholding policy, the disparities disappear. Error bounds shown are 95% confidence intervals from five model retrainings with different random seeds. AUROC = area under the receiver operating characteristic curve; ECE = expected calibration error; TPR = true positive rate.

C.2. Debiasing during Training

Here we present an empirical study, which attempts to achieve equal odds using algorithmic methods.64 Broadly, existing approaches add a term to the loss function that minimizes the distance between the predicted score distribution of the two groups, either directly using a distance metric such as the Maximum Mean Discrepancy (MMDMatch65) or the difference between the means (MeanMatch66), or by worsening the performance of an adversary that aims to predict the group attribute (Adversarial67). Then, we can balance the strength of this additional term by adjusting some hyperparameter λ.

Our results are shown in Figure C.2, where we observe that for all three methods, increasing λ fails to increase any performance metric for any groups. For each method, we do achieve equal odds for large values of λ, at the cost of worse performance and calibration for all.

Figure C.2. Comparison of models that predict No Finding in MIMIC-CXR while trying to achieve equal odds between ethnicities, plotted as a function of λ, the weighting of the additional loss term. Error bars indicate 95% confidence intervals from 250 bootstrapping iterations.


Abbasi-Sureshjani, Samaneh, Ralf Raumanns, Britt E. J. Michels, Gerard Schouten, and Veronika Cheplygina. “Risk of Training Diagnostic Algorithms on Data with Demographic Bias.” In Interpretable and Annotation-Efficient Learning for Medical Image Computing: Lecture Notes in Computer Science, vol. 12446, edited by Jaime Cardoso, Hien Van Nguyen, Nicholas Heller, Pedro Henriques Abreu, Ivana Isgum, Wilson Silva, Ricardo Cruzet al., 183–92. Berlin: Springer, 2020.

Adam, Hammaad, Ming Ying YangKenrick CatoIoana BaldiniCharles SenteioLeo Anthony CeliJiaming Zeng, Moninder Singh, and Marzyeh Ghassemi. “Write It Like You See It: Detectable Differences in Clinical Notes by Race Lead to Differential Model Recommendations.” Preprint submitted November 1, 2022.

Alba, Ana Carolina, Thomas Agoritsas, Michael Walsh, Steven Hanna, Alfonso Iorio, P. J. Devereaux, Thomas McGinn, and Gordon Guyatt. “Discrimination and Calibration of Clinical Prediction Models: Users’ Guides to the Medical Literature.” Journal of the American Medical Association 318, no. 4 (2017): 1377–84.

Ahmad, Muhammad Aurangzeb, Arpit Patel, Carly Eckert, Vikas Kumar, and Ankur Teredesai. “Fairness in Machine Learning for Healthcare.” In KDD’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, edited by Rajesh Gupta and Yan Liu, 3529–30. New York: ACM, 2020.

Bakalar, Chloé, Renata Barreto, Stevie Bergman, Miranda Bogen, Bobbie Chern, Sam Corbett-Davies, Melissa Hall et al. “Fairness on the Ground: Applying Algorithmic Fairness Approaches to Production Systems.” Preprint submitted March 24, 2021.

Banerjee, Imon, Ananth Reddy Bhimireddy, John L. Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, Natalie Dullerud et al. “Reading Race: AI Recognises Patient’s Racial Identity in Medical Images.” Preprint submitted July 21, 2021.

Barocas, Solon, and Andrew D. Selbst. “Big Data’s Disparate Impact.” California Law Review 104 (2016): 671–732.

Buolamwini, Joy, and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of Machine Learning Research 81 (2018): 1–15.

Busby, Lindsay P., Jesse L. Courtier, and Christine M. Glastonbury. “Bias in Radiology: The How and Why of Misses and Misinterpretations.” Radiographics 38, no. 1 (2018): 236.

Chen, Irene, Fredrik D. Johansson, and David Sontag. “Why Is My Classifier Discriminatory?” In NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, edited by Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, and Nicolò Cesa-Bianchi, 3543–54. Red Hook, NY: Curran Associates, 2018.

Chen, Irene Y., Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman, and Marzyeh Ghassemi. “Ethical Machine Learning in Healthcare.” Annual Review of Biomedical Data Science 4 (2021): 123–44.

Chen, Irene Y., Peter Szolovits, and Marzyeh Ghassemi. “Can AI Help Reduce Disparities in General Medical and Mental Health Care?.” AMA Journal of Ethics 21, no. 2 (2019): 167–79.

Corbett-Davies, Sam, and Sharad Goel. “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning.” Preprint submitted August 14, 2018.

Dai, Jessica, and Sarah M. Brown. “Label Bias, Label Shift: Fair Machine Learning with Unreliable Labels.” In NeurIPS 2020 Workshop on Consequential Decision Making in Dynamic Environments 12 (2020).

De-Arteaga, Maria, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. “Bias in Bios: A Case Study of Semantic Representation Bias in A High-Stakes Setting.” In FAT* ’19: Proceedings of the Conference on Fairness, Accountability, and Transparency, 120–28. New York: ACM, 2019.

Dickman, Samuel L., David U. Himmelstein, and Steffie Woolhandler. “Inequality and the Health-Care System in the USA.” The Lancet 389 (2017): 1431–41. 10.1016/S0140-6736(17)30398-7.

Dwork, Cynthia, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel . “Fairness through Awareness.” Preprint submitted November 29, 2011.

Foryciarz, Agata, Stephen R. Pfohl, Birju Patel, and Nigem Shah. “Evaluating Algorithmic Fairness in the Presence of Clinical Guidelines: The Case of Atherosclerotic Cardiovascular Disease Risk Estimation.” BMJ Health & Care Informatics 29, no. 1 (2022): e100460.

Fu, Runshan, Manmohan Aseri, Param Vir Singh, and Kannan Srinivasan. “‘Un’fair Machine Learning Algorithms.” Management Science 68, no. 6 (2022): 4173–95.

Ganz, Melanie, Sune H. Holm, and Aasa Feragen. “Assessing Bias in Medical AI.” In Workshop on Interpretable ML in Healthcare at International Conference on Machine Learning (ICML), 2021.

Gichoya, Judy Wawira, Liam G. McCoy, Leo Anthony Celi, and Marzyeh Ghassemi. “Equity in Essence: A Call for Operationalising Fairness in Machine Learning for Healthcare.” BMJ Health & Care Informatics 28, no. 1 (2021): e100289.

Glocker, Ben, Charles Jones, Melanie Bernhardt, and Stefan Winzeck. “Risk of Bias in Chest X-ray Foundation Models.” Preprint submitted December 19, 2022.

Gretton, Arthur, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. “A Kernel Two-Sample Test.” The Journal of Machine Learning Research 13, no. 1 (2012): 723–73.

Hardt, Moritz, Eric Price, and Nati Srebro. “Equality of Opportunity in Supervised Learning.” In NIPS’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, edited by Daniel D. Lee, Ulrike von Luxburg, Roman Garnett, Masashi Sugiyama, and Isabelle Guyon, 3323–31. Red Hook, NY: Curran Associates, 2016.

Hu, Lily, and Yiling Chen. “Fair Classification and Social Welfare.” In FAT*’20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, edited by Mireille Hildebrandt and Carlos Castillo, 535–45. New York: ACM, 2020.

Irvin, Jeremy, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo et al. “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.” Proceedings of the AAAI Conference on Artificial Intelligence 33, no. 1 (2019): 590–97.

Jain, Saahil, Akshay Smit, Steven QH Truong, Chanh DT Nguyen, Minh-Thanh Huynh, Mudit Jain, Victoria A. Young, Andrew Y. Ng, Matthew P. Lungren, and Pranav Rajpurkar. “VisualCheXpert: Addressing the Discrepancy between Radiology Report Labels and Image Labels.” In CHIL’21: Proceedings of the Conference on Health, Inference, and Learning, edited by Marzyeh Ghassemi, 105–15. New York: ACM, 2021.

Johnson, Alistair E. W., Tom J. Pollard, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G. Mark, Seth J. Berkowitz, and Steven Horng. “MIMIC-CXR-jpg, a Large Publicly Available Database of Labeled Chest Radiographs.” Preprint submitted November 14, 2019.

Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. “Inherent Trade-offs in the Fair Determination of Risk Scores.” Preprint submitted November 17, 2016.

Kusner, Matt J., Joshua Loftus, Chris Russell, and Ricardo Silva. “Counterfactual Fairness.” In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, edited by Ulrike von Luxburg, Isabelle Guyon, Samy Bengio, Hanna Wallach, and Rob Fergus, 4069–79. Red Hook, NY: Curran Associates, 2017.

Lee, Tiarna, Esther Puyol-Anton, Bram Ruijsink, Miaojing Shi, and Andrew P. King. “A Systematic Study of Race and Sex Bias in CNN-Based Cardiac MR Segmentation.” Preprint submitted September 4, 2022.

Liu, Lydia T., Max Simchowitz, and Moritz Hardt. “The Implicit Fairness Criterion of Unconstrained Learning.” Preprint submitted January 25, 2019.

Liu, Xiaoxuan Liu, Livia Faes, Aditya U. Kale, Siegfried K. Wagner, Dun Jack Fu, Alice Bruynseels, Thushika Mahendiran et al. “A Comparison of Deep Learning Performance against Health-Care Professionals in Detecting Diseases from Medical Imaging: A Systematic Review and Meta-analysis.” The Lancet Digital Health 1, no. 6 (2019): e271–97.

Mamary, A. James, Jeffery I. Stewart, Gregory L. Kinney, John E. Hokanson, Kartik Shenoy, Mark T. Dransfield, Marilyn G. Foreman, Gwendolyn B. Vance, and Gerard J. Criner for the COPDGene® Investigators. “Race and Gender Disparities Are Evident in COPD Underdiagnoses across All Severities of Measured Airflow Obstruction.” Chronic Obstructive Pulmonary Diseases: Journal of the COPD Foundation 5, no. 3 (2018): 177–84. 10.15326/jcopdf.5.3.2017.0145.

Martinez, Natalia, Martin Bertran, and Guillermo Sapiro. “Minimax Pareto Fairness: A Multi-Objective Perspective.” Preprint submitted November 3, 2020.

McCradden, Melissa D., Shalmali Joshi, Mjaye Mazwi, and James A. Anderson. “Ethical Limitations of Algorithmic Fairness Solutions in Health Care Machine Learning.” The Lancet Digital Health 2, no. 5 (2020): e221–23.

Mehrabi, Ninareh, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A Survey on Bias and Fairness in Machine Learning.” ACM Computing Surveys (CSUR) 54, no. 6 (2021): 1–35.

Mitchell, Shira, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum. “Algorithmic Fairness: Choices, Assumptions, and Definitions.” Annual Review of Statistics and Its Applications 8 (2021): 141–63.

Morning, Ann. “The Racial Self-identification of South Asians in the United States.” Journal of Ethnic and Migration Studies 27, no. 1 (2001): 61–79.

Nilforoshan, Hamed, Johann D. Gaebler, Ravi Shroff, and Sharad Goel. “Causal Conceptions of Fairness and Their Consequences.” Proceedings of the 39th International Conference on Machine Learning. PMLR 162 (2022).

Nixon, Jeremy, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. “Measuring Calibration in Deep Learning.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 38–41. New York: Computer Vision Foundation, 2019.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. “Dissecting Racial Bias in an Algorithm Used to Manage Health of Populations.” Science 366, no. 6464 (2019): 447–53.

Petersen, Eike, Aasa Feragen, Maria Luise da Costa Zemsch, Anders Henriksen, Oskar Eiler Wiese Christensen, and Melanie Ganz. “Feature Robustness and Sex Differences in Medical Imaging: A Case Study in MRI-based Alzheimer’s Disease Detection.” Preprint submitted July 14, 2022.

Pfohl, Stephen R., Agata Foryciarz, and Nigam H. Shah. “An Empirical Characterization of Fair Machine Learning for Clinical Risk Prediction.” Journal of Biomedical Informatics 113 (2021): 103621.

Pfohl, Stephen, Yizhe Xu, Agata Foryciarz, Nikolaos Ignatiadis, Julian Genkins, and Nigam Shah. “Net Benefit, Calibration, Threshold Selection, and Training Objectives for Algorithmic Fairness in Healthcare.” In FAccT’22: 2022 ACM Conference on Fairness, Accountability, and Transparency, 1039–52. New York: ACM, 2022.

Pfohl, Stephen R., Haoran Zhang, Yizhe Xu, Agata Foryciarz, Marzyeh Ghassemi, and Nigam H. Shah. “A Comparison of Approaches to Improve Worst-Case Predictive Model Performance over Patient Subpopulations.” Scientific Reports 12, no. 1 (2022): 1–13.

Pleiss, Geoff, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. “On Fairness and Calibration.” Preprint submitted November 3, 2017.

Rajkomar, Alvin, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. “Ensuring Fairness in Machine Learning to Advance Health Equity.” Annals of Internal Medicine 169, no. 12 (2018): 866–72.

Rajpurkar, Pranav, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding et al. “CheXnet: Radiologist-Level Pneumonia Detection on Chest X-rays with Deep Learning.” Preprint submitted December 25, 2017.

Rawls, John. Justice as Fairness: A Restatement. Cambridge, MA: Harvard University Press, 2001.

Ross, Andrew B., Vivek Kalia, Brian Y. Chan, and Geng Li. “The Influence of Patient Race on the Use of Diagnostic Imaging in United States Emergency Departments: Data from the National Hospital Ambulatory Medical Care Survey.” BMC Health Services Research 20, no. 1 (2020): 1–10.

Sendak, Mark P., Joshua D’Arcy, Sehj Kashyap, Michael Gao, Marshall Nichols, Kristin Corey, William Ratliff, and Suresh Balu1. “A Path for Translation of Machine Learning Products into Healthcare Delivery.” EMJ Innovation 10 (2020): 19–172.

Seyyed-Kalantari, Laleh, Guanxiong Liu, Matthew McDermott, Irene Y. Chen, and Marzyeh Ghassemi. “CheXclusion: Fairness Gaps in Deep Chest X-ray Classifiers.” Biocomputing 2021 (2020): 232–43.

Seyyed-Kalantari, Laleh, Haoran Zhang, Matthew B. A. McDermott, Irene Y. Chen, and Marzyeh Ghassemi. “Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations.” Nature Medicine 27, no. 12 (2021): 2176–82.

Smit, Akshay, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren. “CheXpert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT.” Preprint submitted October 18, 2020.

Spencer, Christine S., Darrell J. Gaskin, and Eric T. Roberts. “The Quality of Care Delivered to Patients within the Same Hospital Varies by Insurance Type.” Health Affairs 32, no. 10 (2013): 1731–39.

Sun, Tony Y., Oliver J. Bear Don't Walk IV, Jennifer L. Chen, Harry Reyes Nieva, and Noémie Elhadad. “Exploring Gender Disparities in Time to Diagnosis.” Preprint submitted November 15, 2020.

Suresh, Harini, and John Guttag. “Understanding Potential Sources of Harm throughout the Machine Learning Life Cycle.” MIT Case Studies Series in Social and Ethical Responsibilities of Computing (Summer 2021).

Vayena, Effy, Alessandro Blasimme, and I. Glenn Cohen. “Machine Learning in Medicine: Addressing Ethical Challenges.” PLoS Medicine 15, no. 1 (2018): e1002689.

Verma, Sahil, and Julia Rubin. “Fairness Definitions Explained.” FairWare’18: Proceedings of the International Workshop on Software Fairness, edited by Yuriy Brun, Brittany Johnson, and Alexandra Meliou, 1–7. New York: ACM, 2018.

Vyas, Darshali A., Leo G. Eisenstein, and David S. Jones. “Hidden in Plain Sight: Reconsidering the Use of Race Correction in Clinical Algorithms.” New England Journal of Medicine 383 (2020): 874–82.

Wadsworth, Christina, Francesca Vera, and Chris Piech. “Achieving Fairness through Adversarial Learning: An Application to Recidivism Prediction.” Preprint submitted June 30, 2018.

Wick, Michael, Swetasudha Panda, and Jean-Baptiste Tristan. “Unlocking Fairness: A Trade-Off Revisited.” In NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, and Emily B. Fox, 8783–92. Red Hook, NY: Curran Associates, 2019.

Wiens, Jenna, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X. Liu, Finale Doshi-Velez, Kenneth Jung et al. “Do No Harm: A Roadmap for Responsible Machine Learning for Health Care.” Nature Medicine 25, no. 9 (2019): 1337–40.

Wynants, Laure, Maarten van Smeden, David J. McLernon, Dirk Timmerman, Ewout W. Steyerberg, and Ben Van Calster. “Three Myths about Risk Thresholds for Prediction Models.” BMC Medicine 17, no. 1 (2019): 1–7.

Zhang, Haoran, Amy X. Lu, Mohamed Abdalla, Matthew McDermott, and Marzyeh Ghassemi. “Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings.” CHIL’20: Proceedings of the ACM Conference on Health, Inference, and Learning, ed. Marzyeh Ghassemi, 110–20. New York: ACM, 2020.

Zhang, Haoran, Natalie Dullerud, Karsten Roth, Lauren Oakden-Rayner, Stephen Pfohl, and Marzyeh Ghassemi. “Improving the Fairness of Chest X-ray Classifiers.” Proceedings of the Conference on Health, Inference, and Learning. PMLR 174 (2022): 204–33.

Zietlow, Dominik, Michael Lohaus, Guha Balakrishnan, Matthaus Kleindessner, Francesco Locatello, Bernhard Scholkopf, and Chris Russell. “Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10410–21. New York: Computer Vision Foundation, 2022.

Zong, Yongshuo, Yongxin Yang, and Timothy Hospedales. “Medfair: Benchmarking Fairness for Medical Imaging.” Preprint submitted October 4, 2022.

No comments here