Book traversal links for 3.2.1 Summary of the evidence and rationale
The use of CXR to screen for TB is a practice that goes back several decades. CXRs are also routinely used for triage of patients presenting to care who are displaying signs, symptoms or risk factors for TB to determine the most appropriate clinical pathway for proper evaluation. However, in many settings, the use of CXR for TB screening and triage for TB disease is limited by the unavailability of trained health personnel to interpret radiography images and by substantial intra- and inter-reader variability in its accuracy to detect abnormalities associated with TB (70–72).
Numerous specific software packages that provide CAD, or automated interpretation of digital CXR images for the express purpose of determining the likelihood of TB disease, have been developed and offer a potential technological answer to the numerous implementation challenges inherent in human interpretation of CXRs.
The GDG considered the performance of CAD software separately for the screening- and triageuse cases. For this guideline, triaging is defined as the process of deciding the diagnostic and care pathways for people based on their symptoms, signs, risk markers and test results. Triaging involves assessing the likelihood of various differential diagnoses as a basis for making clinical decisions (73). It can follow more- or less-standardized protocols and algorithms, and it may be done in multiple steps (68). A triage test for TB is one that can be rapidly conducted among people presenting to a health facility to differentiate those who should have further diagnostic evaluation for TB (those whose TB triage test is positive or abnormal) from those who should undergo further investigation for non-TB diagnoses (for those whose TB triage test is negative or normal) (74). While there may be overlap between triaging and screening, there are several reasons to distinguish screening from triage when evaluating the performance of CAD software:
- The disease presentation may be different in screening populations in which one is more likely to encounter CXR findings of earlier TB than when compared with triage populations. Therefore, the same sensitivity and specificity point may not be achieved or may be achieved but with a different threshold score
- TB prevalence will typically be much lower in screening populations (< 5%) than in triage populations (10–20%), which will impact a test’s predictive values and the numbers of individuals correctly and incorrectly diagnosed.
- The ethical consequences of not detecting TB or other non-TB-related CXR findings (but clinically relevant abnormalities) that require follow-up examination are different for populations that do not seek care than for those that do (11).
A previous assessment of using CAD for automated interpretation of digital CXRs for TB by WHO determined that in order to adequately assess diagnostic accuracy, it was necessary to evaluate CAD software using a standard panel of CXR files with associated demographic and clinical data, including TB diagnosis, drawn from a representative population for the corresponding use case for the technology. It was deemed essential that such evaluations ensure that CXR libraries used in an evaluation not be made available for CAD software development, training or evaluation (68). For this GDG meeting, a scoping review for independent evaluations that met these criteria was conducted. Three independent evaluations for both the screening use case and the triage use case that assessed the performance of three distinct CAD programmes were identified and presented to the GDG, and they included all products that had received a CE mark (for Conformité Européenne, indicating a product’s conformity with the European Economic Area’s directives or standards) by January 2020.² The GDG was blinded to the brand names of the software programmes. A separate quality assessment of the evaluations was conducted and results presented to the GDG.
CAD programmes produce a numerical abnormality score for each digital image read that can then be compared to a threshold defined by the user to indicate if the patient is to be referred for further TB diagnostic evaluation. Because the abnormality scores produced are continuous, the sensitivity and specificity can vary from 0 to 100%, depending on where the threshold is set. For evaluation for the GDG, each software programme was set to a threshold that corresponded to 90% sensitivity for detecting pulmonary TB disease based on a microbiological reference standard. The resulting accompanying specificity for the software at that threshold was then reported and compared with the diagnostic accuracy of human readers interpreting CXRs in the same studies.
Due to specific methodological challenges, the estimates of CAD diagnostic accuracy were not able to be pooled across software programmes or across evaluations. Thus, the performances of CAD programmes and human readers from the included evaluations were presented as ranges (see Table 4). The three included evaluations assessed each programme’s performance in different populations and in different settings (see Web Annex B, Tables 11 and 12, and Web Annex C, Tables 4 and 5).
The results showed the variability of both human readers and CAD software programmes across different settings and populations. In comparing the range of accuracy of CAD to that of human readers interpreting CXRs and noting the variability of readers and the substantial overlap between the two ranges, the data suggested there is little difference between the two. Therefore, the GDG considered that CAD software programmes can be considered accurate when compared with human readers.
Other desirable effects beyond the accuracy of the technologies would likely include the possibility to scale up and thus increase the access to chest radiography, given the scarcity of radiologists in many settings. In addition, GDG members noted that in many settings, general practitioners or other providers without specific training in radiology are often tasked with interpreting chest radiographs, and they may not be as highly skilled as the readers used for comparison in the evaluations considered, thus indicating that the comparisons presented here may represent an underestimate of the true comparative accuracy of CAD software for detecting TB.
The drawback of using CAD interpretation in place of human readers for chest radiographs included the fact that it cannot detect other lung pathologies beyond TB. The capacity of CAD technologies to simultaneously screen for multiple pulmonary or thoracic conditions could be attractive for programmes, but no data on the performance of CAD for differential diagnosis were available to be assessed by the GDG.
CAD technologies have the potential to increase equity in the reach of TB screening interventions and in access to TB care if they facilitate the scale up of radiography for TB screening and triage and improve the interpretation of images.
The recommendation applies to software brands that upon external validation demonstrate a performance that is not inferior to the products reviewed by the GDG in 2020. The analysis for this recommendation was restricted to bacteriologically confirmed TB and, thus, the recommendation may not necessarily apply to other forms of TB (e.g. exclusively extrapulmonary TB, clinically diagnosed TB). This recommendation is specific to adults and adolescents aged 15 years and older. The recommendation applies only to the interpretation of anteroposterior or posteroanterior views of digital plain CXRs for pulmonary TB: it does not apply to the interpretation of lateral or oblique views, and its applicability to the interpretation of analogue CXRs is unknown.
² The three technologies that had received a CE mark by January 2020 and were included in all the evaluations are CAD4TB v6, Delft Imaging; Lunit Insight CXR, Lunit Insight; and qXR v2, Qure.ai.
CAD: computer-aided detection; CXR: chest X-ray.