Annex 1: Guideline development methods

Methods used to develop World Health Organization guidelines

To develop new or update existing guidelines for methods and tools to diagnose tuberculosis (TB), the World Health Organization (WHO) Global TB Programme commissions systematic reviews on the performance or use of the tool or method in question. A systematic review provides a summary of the current literature on diagnostic accuracy or user aspects, for the diagnosis of TB or the detection of anti-TB drug resistance in adults or children (or both) with signs and symptoms of TB.

The certainty of the evidence is assessed consistently for documented evidence using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach. GRADE produces an overall quality assessment (or certainty) of evidence and a framework for translating evidence into recommendations. The certainty of the evidence is rated as high, moderate, low or very low. These four categories imply a gradient of confidence in the estimates. Even if a diagnostic accuracy study is of observational design, it would initially be considered high-quality evidence in the GRADE approach (1).

In addition, the WHO Global TB Programme commissions systematic reviews to collect evidence in the field of resource use (i.e. cost and cost–effectiveness), as well as end-user perspectives on particular diagnostic tests or interventions. This evidence-to-recommendation process will inform domains such as feasibility, accessibility, equity and end-user values.

If systematic review evidence is unavailable or is scarce, the potential subsequent effects can be modelled for both diagnostic accuracy as well as cost and cost–effectiveness. For instance, the prevalence of the disease in question, combined with the sensitivity and specificity of a certain test, can be used to estimate the number of false positives and false negatives in a population. Similarly, data on expenditures and cost–effectiveness ratios can be estimated and modelled, based on economical and epidemiological data. Finally, qualitative evidence on the end-user perspective of using a particular test may be generated through end-user interviews if data are scarce in the public domain.

Following a systematic review, the WHO Global TB Programme convenes a Guideline Development Group (GDG) meeting to review the collected evidence. The GDG is made up of external experts whose central task is to develop evidence-based recommendations. The GDG also performs the important task of finalizing the scope and key questions of the guideline in PICO (i.e. population, intervention, comparator and outcomes) format.

This group should be established early in the guideline development process, once the Steering Group has defined the guideline’s general scope and target audience, and has begun drafting the key questions. The GDG should be composed of relevant technical experts; end-users, such as programme managers and health professionals, who will adopt, adapt and implement the guideline; representatives of groups most affected by the guideline’s recommendations, such as service users and representatives of disadvantaged groups; experts in assessing evidence and developing guidelines informed by evidence; and other technical experts as required (e.g. a health economist or an expert on equity, human rights and gender).

Recommendations are developed based on consensus among GDG members, where possible. When it is not possible to reach consensus, a vote is taken. When a draft guideline is developed by a WHO steering committee, it is reviewed initially by GDG members and subsequently by an External Review Group (ERG). The ERG is made up of individuals interested in the subject, and may include the same categories of specialists as the GDG. When the ERG reviews the final guideline, its role is to identify any errors or missing data, and to comment on clarity, setting, specific issues and implications for implementation – not to change the recommendations formulated by the GDG (2).

Formulation of the recommendations

Evidence is synthesized and presented in GRADE evidence tables. The evidence to decision (EtD) framework is used subsequently to facilitate consideration of the evidence and development of recommendations in a structured and transparent manner. Finally, recommendations are developed based on consensus among GDG members where possible. If it is not possible to reach consensus, then voting takes place. Decisions on the direction and strength of the recommendations are also made using the EtD framework.

Factors that influenced the direction and strength of a recommendation in this guideline were:

  • priority of a problem;
  • test accuracy;
  • balance between desirable and undesirable effects;
  • certainty of:
    • evidence of test accuracy;
    • evidence on direct benefits and harms from the test;
    • management guided by the test results;
    • link between test results and management;
  • confidence in values and preferences and their variability;
  • resource requirements;
  • cost–effectiveness;
  • equity;
  • acceptability; and
  • feasibility.

These factors are discussed below.

Priority of a problem

The GDG considers whether the overall consequences of a problem (e.g. increased morbidity, mortality and economic effects) are serious and urgent. The global situation is considered and available data reviewed. In most cases, the problem must be serious and urgent to be considered by a GDG.

Test accuracy

The pooled sensitivity and specificity presented in the GRADE evidence profile is assessed. Preferably and if available the review includes studies with both microbiological reference standards (culture) as well as composite reference standards (e.g. in children and in patients with extrapulmonary TB).

Balance between desirable and undesirable effects

Under this component, GDG members are asked to judge the anticipated benefits and harms from the test in question, including direct effects of the test (e.g. benefits such as faster diagnosis, and harms such as adverse effects from administration of the test). In addition, the possible subsequent effects of the test must be included; for instance, effects of treatment after a positive diagnosis (cure or decrease in mortality), and the effect of no treatment or further testing after a negative test result. Evidence, ideally retrieved from systematic reviews of randomized controlled trials (RCTs) of the test, should inform the GDG of these downstream effects. If evidence from RCTs is not available, diagnostic accuracy studies can be used. In the latter, true positive and true negative diagnosed cases are taken as benefits, whereas false positive and false negative cases are taken as harms.

Certainty of the evidence

Certainty of the evidence of test accuracy is judged scored on a scale from very low, via low and moderate, to high. Certainty of the evidence on direct benefits and harms from the test are assessed and scored in a similar way.

Certainty of management

For certainty of patient management being guided by the test results, the GDG focuses on whether the management would be any different, should it be guided by the test results.

For certainty of the link between test results and management, the panel assesses how quickly and effectively test results can transfer to management decisions.

Confidence in values and preferences and their variability

The value of the test to improve diagnosis and its impact on patient care is evaluated and scored with the help of evidence from qualitative research. The impact on notification and, moreover, the ability of the test to increase case notification is also evaluated and scored, taking into account the entire diagnostic cascade, including, for example, issues related to feasibility of implementation, rate of use, staff’s confidence in test results and turnaround time of results.

Resource requirements

In relation to resource requirements, the following questions are answered:

  • How large are the resource requirements for test implementation?
  • What is the certainty of the evidence about resource requirements?
  • Does the cost–effectiveness of the intervention favour the intervention or the comparison?

Available evidence on cost–effectiveness is evaluated and scored.


GDG members consider whether implementing the tool or method will positively or negatively affect access to health care (e.g. will it be possible to implement the test in distinct levels of health care or through self-administration, or are there other ways to make the tools or method available to all levels of the health care system).


In terms of acceptability, the panel considers whether the tool or method will be acceptable by all relevant stakeholders, such as health workers, health managers and patients.


The GDG considers how feasible it is to implement a tool or method in various settings. Aspects such as training and refresher training needs, hands-on time, biosafety requirements, time to results, service and maintenance, calibration, and effect on diagnostic algorithms are all taken into account in the final score.

For more details on the transition from evidence to recommendations, see Web Annex 3: Evidence to decision tables.

References for Annex 1

  1. Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE et al. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008;336:1106–10. doi:
  2. Handbook for guideline development 2nd ed. Geneva: World Health Organization; 2014 (

Book navigation