Journal of the ICRU Vol 9 No 1 (2009) Report 81 Oxford University Press

doi:10.1093/jicru/ndp006

3. Quality and Performance Measures in Bone Densitometry 3.1

Introduction

Bone densitometry plays a central role in the diagnosis of osteoporosis, in fracture prediction, and in monitoring treatment. However, there is still ongoing debate on the strengths and weaknesses of various techniques. In analogy to other devices that measure physically well-defined quantities, the performance of techniques used to determine bone mineral density (BMD) can be described by a few simple performance measures. For example, if in measuring the length of an object, two parameters, trueness1 and precision, suffice to analyze the performance of the measuring process. This is of course also true for quantities measured in bone densitometry. However, the clinical application requires further considerations, since the evaluation of osteoporosis cannot be reduced to an assessment of the physical quantities measured by bone densitometry. A clinician is not primarily interested in a BMD or a broad-band ultrasound attenuation (BUA) value itself but rather in their usefulness to diagnose osteoporosis, to predict fractures, to monitor disease progression or treatment-related changes, and, ultimately, to make decisions about treatment and further diagnostic procedures. Thus the performance measures of the physical measurement, trueness and precision, are of secondary interest clinically. More important are performance measures that characterize the three clinical tasks: diagnosis, prediction, and monitoring. Obviously, physical and clinical performance measures are not unrelated. In comparing two densitometric methods measuring identical quantities, the method with superior trueness and precision will result in superior clinical performance. However, this is not necessarily true if the two methods measure different quantities, such as 1

Formerly, trueness was denoted as accuracy. However, with the adaption of the ISO-5725-1 standard ISO 5725-1. “Accuracy (trueness and precision) of measurement methods and results,” In: International Organization for Standardization (ISO), Ed. Part 1: general principles and definitions: ISO; 1994. The definition changed so that accuracy now includes both trueness and precision.

BMD and SOS at the same site or BMD at different sites. Therefore, in order to assess the clinical value of bone densitometry, it is necessary to understand the technical and physical limitations of the measuring apparatus. In addition, a concept of clinical performance measures is required to compare the various densitometric methods that measure different quantities at different body locations using different physical principles. This involves reference data and requires the definition of disease and disease progression and the establishment of criteria on how to assess disease and its progression in individual patients. It is the aim of the present section to review, to categorize, and to outline limitations of clinical performance measures that have been developed for these tasks. The present section will summarize statistical concepts that are used to compare densitometric techniques but will not compare these techniques. Active research is ongoing but there is no agreement yet which densitometric variable(s) should be measured, which skeletal site(s) should be investigated, and what technique(s) should be used for a given clinical task. Throughout, the present Report will refer to a densitometric method as modality, technique, or implementation. † The modality characterizes the general physical principle of the densitometric method. Thus DXA, QCT, and QUS are different modalities. Each modality measures one or more variables, e.g., BMD, BUA, or SOS. † The technique denotes the specific physical principle or the specific application of a modality. Different techniques of a given modality measure different physical or anatomic quantities. For example, transverse transmission or axial transmission methods that will be defined and described in detail in Section 7 are different techniques of QUS because due to different physical principles they yield different types of velocity results—even if all are denoted as SOS, they may represent different properties of the bones

# International Commission on Radiation Units and Measurements 2009

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

evaluated. For QCT, single- and dual-energy methods would similarly represent two different modalities because the physical principle is different, whereas spine and forearm BMD measurements are different techniques because they are two specific applications of DXA. † The implementation of a technique denotes the technical realization. In contrast to techniques, different implementations measure the same physical and anatomic quantity and should in principal give identical results. For example, DXA measurements of the vertebral bodies L1 to L4 using machines from different manufactures or using fan and pencil beam scanners are different implementations of the same technique. Similarly, gel- and water-based BUA measurements of the heel are different implementations of QUS. From a technical perspective in fan beam DXA systems, the magnification effect must be considered in order to give identical BMD results as from a pencil beam DXA systems. Similarly, the added water path must be considered in water-based QUS systems when compared with gel-based systems. These are implementation-specific details that should not change the outcome of the measurement.

3.2

3.2.1

Trueness and Bias

Trueness is defined as the closeness of agreement between the average value of a quantity q obtained from a large series of test results and an accepted reference value m. It is usually expressed in terms of bias,2 which is the difference between the expectation of the test results x¯ and an accepted reference value m (ISO 5725-1, 1994): bias ¼ x  m:

ð3:1Þ

The bias is given in units of q. Alternatively a relative bias given in % can be calculated: relative biasð%Þ ¼

x  m  100: m

ð3:2Þ:

Bias is the total systematic error. It is a theoretical value unknown in reality. Further details are given in the ISO-5725-1 (1994) documentation. For the purposes of bone densitometry, the bias can be split into a constant and a variable part: bias ¼ biasc þ biasv :

ð3:3Þ

The constant part biasc is patient-independent; it can, for example, be caused by a calibration offset. It can be corrected by normalizing x¯ to the corresponding value of a reference population (Section 3.3.4) investigated with the same scanner. Of much larger concern is the variable part biasv that is typically patient-dependent and originates from erroneous or simplifying assumptions in the underlying physical principles or in the applied image processing routines. biasv has a large impact on diagnosis. The true value of quantities such as BMD or BUA cannot be determined in vivo. Therefore, trueness is typically determined in vitro, preferably employing measurements in situ that are using cadavers instead of excised bones or by using phantoms as reference values. In order to determine the trueness of a method, the average bias of a number of different in situ measurements should be determined. Ashing techniques to determine bone mineral mass of a specimen or part of it are recognized as the gold standard for assessing calcium content in bone densitometry. Also a variety of compounds mostly based on hydroxyapatite (or in previous times K2HPO4) that can be used in phantoms to represent cortical and trabecular bone have been developed (Kalender et al., 1988). For QUS, no consensus on gold standards for ultrasonic variables in specimens has been identified.

Physical Performance Measures

This section will define the two basic physical performance measures: trueness and precision. These are fundamental concepts. Depending on a given clinical task, additional performance measures such as sensitivity and standardized errors must be defined. These will be introduced in the sections on diagnosis (Section 3.3), fracture risk assessment (Section 3.4), and monitoring (Section 3.5). This section will neither discuss modality, technique, nor implementation-specific sources of errors—these are introduced in detail in the chapters covering the individual modalities. The definition of physical performance measures follows the international standard ISO 5725-1 (1994), according to which the accuracy of a measurement method includes trueness describing the systematic part of the overall error and precision describing the random part. Thus contrary to former and still widespread use in bone densitometry and elsewhere, accuracy no longer denotes the systematic component, now named trueness, but shall be used to imply the total displacement of a result from a true or reference value. This change in terminology was adopted by ISO in 1994 (ISO 5725-1, 1994).

2

12

Formerly bias was often denoted as accuracy error.

Quality and Performance Measures in Bone Densitometry

Trueness of SOS measurements can be evaluated relatively easily using phantoms, because SOS values of materials similar to those encountered in vivo are known and tabulated (Duck, 1990). In principle, trueness of BUA could be determined in a similar fashion by using tabulated attenuation values in the frequency range between 100 kHz and 1000 kHz. However, the linear relation between attenuation frequencies of the phantom materials should be comparable to the relation observed for trabecular bone and information about such materials is limited. Therefore, to date it remains difficult to evaluate the trueness of BUA measurements. 3.2.2

term refers to a time interval of minutes to hours; † long-term should refer to a time interval comparable to typical clinical follow-up times, i.e., preferably a year or longer. 3.2.2.1 Short-term Imprecision. The short-term imprecision from nj consecutive measurements of the quantity q in a phantom or in an individual subject j is given by the SD of q: short-term imprecisionj ¼ SDj vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u nj uX ðxij  xj Þ2 ¼t ; nj  1 i¼1

ð3:4Þ

Precision where x¯j is the mean value of the individual measurements x1, . . . , xn of subject j. In the above formula, the imprecision is given in the units of q. Alternatively a relative imprecision can be defined:

Precision is defined as the closeness of agreement between independent test results obtained under stipulated conditions. It depends only on the distribution of random errors and does not relate to the true or agreed reference value of q. The measure of precision is usually expressed in terms of imprecision3 and computed as a standard deviation (SD) (ISO 5725-1, 1994). Quantitative precision results depend on the stipulated conditions. According to the ISO Standard, the terms repeatability and reproducibility should be used for two extreme conditions.

relative short-term imprecisionj ð%Þ ¼ CVj ¼

SDj  100; xj

ð3:5Þ

where CV denotes the coefficient of variation. Consecutive phantom measurements without repositioning yield the short-term phantom imprecision of the densitometric device. If carried out by the same operator, this is the phantom repeatability without repositioning. However, for diagnostic purposes, the short-term imprecision obtained in patients in vivo is more relevant. This includes errors due to patient repositioning and movement. Short-term imprecision in vivo is typically not determined in a single subject but by n repeat measurements in j ¼ 1, . . . , m subjects, each resulting in an individual SDj given in Eq. (3.4). To obtain the combined imprecision for the technique, the arithmetic average of the variances but not the arithmetic average of the SDs has to be calculated: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m uX SD2j : ð3:6Þ short-term imprecision ¼ SD ¼ t m j¼1

† Repeatability is used for conditions where independent test results are obtained with the same method on identical test items in the same laboratory by the same operator using the same equipment within short intervals of time. † Reproducibility is used for conditions where test results are obtained with the same method on identical test items in different laboratories with different operators using different equipment. In bone densitometry typically conditions in between the two extremes are used. One must differentiate between phantom precision, in vitro precision of cadavers and in vivo precision of subjects, all of which can be measured with or without repositioning the phantom, cadaver, or subject between repeat measurements. As a second differentiator, precision is reported as short- or longterm imprecision where

Thus the imprecision of a group of subjects is given by the root mean square (RMS) average of the imprecision of the individuals of the group. In Eq. (3.4), it is assumed that the number of measurements is identical for each subject. To calculate the imprecision with sufficiently small confidence intervals, one should at least scan 27 subjects twice (or equivalently, 14 subjects three times each). Further details are given in Glu¨er et al. (1995).

† short-term imprecision is determined from a number of consecutive scans, usually with interim repositioning of the patient, cadaver, or phantom. Thus depending on technique, short 3

In the literature, often the term precision error is used for imprecision.

13

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

in Section 3.3.3.1.2 must be used to compare techniques. The use of relative errors (CVs instead of SDs or SEEs) is tempting because an error given in percent may be easier to comprehend than an error given in absolute values. However, as is obvious from Eqs. (3.2), (3.5), and (3.8), relative errors depend on the value used for normalization. For example, assume that the absolute imprecision of a technique measuring BMD is constant. Then the relative imprecision automatically increases in elderly people because they have a lower BMD than younger people. Because for most BMD techniques the absolute imprecision actually increases in older subjects, the age-dependence of relative imprecision is further increased. The use of relative errors also falsely implies that relative errors of different modalities or techniques can be directly compared. In vivo errors further depend on the type of subject investigated. For example, and in view of the aforementioned age-dependency, it may be misleading to base in vivo imprecision solely on young healthy subjects, since repositioning errors are larger in elderly and diseased patients. A representative subject group has to be selected to derive clinically relevant physical performance measures.

3.2.2.2 Long-term Imprecision. Long-term in vitro precision of a densitometric system can be assessed with repeated phantom measurements. The determination of long-term precision in vivo is more complicated because the parameter one is interested in, for example, BMD, may change for the subject during the course of the measurements. One strategy to separate this effect from machine imprecision is to calculate the standard error of the estimates (SEEs) of separate regressions (either linear or higher order if appropriate) of long-term data from a number of individuals and take the RMS error. For each individual subject j, the SEEj of a linear regression from nj measurements can be calculated as: long-term imprecisionj ¼ SEEj vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u nj uX ðxij  ^xij Þ2 ¼t ; nj  2 i¼1

ð3:7Þ

where xˆij denotes the value predicted by the regression. The denominator (nj 22) accounts for two degrees of freedom, which in the case of a linear regression are slope and intercept. Again a relative individual long-term imprecision can be defined: relative long-term imprecisionj ð%Þ ¼ CVj ¼

SEEj  100;  xj

3.3 ð3:8Þ

Diagnostic Performance Measures

It is the aim of this section to review criteria suitable to evaluate the performance of a given diagnostic method and to address the question which method is best suited for a diagnostic purpose. However, before proceeding, it is necessary to be more specific on the relation between physical measurement and diagnosis. A diagnosis should primarily answer the question whether a subject is healthy, diseased, or perhaps in an intermediate state. This requires a definition of the disease for which diagnostic criteria must be specified. Afterwards, a diagnostic method must be selected, which according to the diagnostic criteria assigns the severity of the disease of a subject to a quantitative level. Ideally, treatment decisions based on this level should be possible. The diagnostic criteria in osteoporosis will be discussed in the next two sections.

where x¯j is the mean value of all measurements x1, . . . , xn of subject j. Analog to the short-term case, Eq. (3.7), the long-term imprecision of n measurements in j ¼ 1, . . . , m subjects, which is also called the long-term imprecision of the technique, is given by the RMS average of the SEEs of the individuals defined in analogy to Eq. (3.4). 3.2.3 Clinical Limitations of Physical Performance Measures Trueness and precision per se are only useful to compare different implementations but not different modalities or different techniques such as QCT and DXA measurements of the spine. QCT typically determines BMD of a trabecular volume of interest inside the vertebral bodies L2 and L3, whereas DXA determines areal BMDa of the total vertebral bodies of L1 to L4. The same argumentation applies, for example, to transverse versus axial transmission measurements of the finger phalanges, which are different QUS techniques measuring different physical parameters of the same bone. Here standardized errors as introduced

3.3.1

Conceptual Definition of Osteoporosis

Osteoporosis is a “disease characterized by low bone mass and microarchitectural deterioration of bone tissue leading to enhanced bone fragility and a consequent increase in fracture risk” (Consensus Development Conference, 1993). This statement 14

Quality and Performance Measures in Bone Densitometry

has several problems, principally because it is vague in several regards. Most important for the goal of this report: the description does not provide explicit diagnostic criteria that allow a decision as to whether an individual is osteoporotic or not. This is because the description does not specify the reasons nor the mechanisms of action leading to low bone mass or poor microarchitectural state. For example, low BMD may be caused by diseases other than osteoporosis. Deterioration could be due to inactivity, aging, or metabolic problems. In addition, the definition does not say anything about the material properties of the bone tissue. It should be noted that more recently another conceptual definition of osteoporosis has been proposed at an NIH-sponsored workshop which puts more emphasis on bone quality (Marwick, 2000). However, this definition has not been formally accepted by any of the world-wide bone societies and thus will not be discussed further.

mass and microarchitectural deterioration represent two fundamental defining criteria of osteoporosis. Indeed, these two measures are strong risk factors for fracture. Unfortunately, with current methods, the microarchitectural status as well as bone fragility cannot be adequately measured in vivo. Therefore, BMD has been selected as the key measure to define osteoporosis (WHO, 1994). Concrete diagnostic criteria solely based on BMD have been provided by the operational definition of osteoporosis developed by a working group of the WHO. It is based on the T-score concept discussed in more detail below. The WHO (1994) definition uses areal BMD (BMDa)4 as measured by DXA to categorize a subject into one of four groups: Normal: Low bone mass or osteopenia: Osteoporosis: Established osteoporosis:

3.3.2 Diagnostic Criteria for the Individual Patient 3.3.2.1 Fracture Status. It is accepted that a patient who has suffered a fracture without adequate trauma is likely to have osteoporosis. However, this criterion is not a suitable definition of the disease for several reasons. Most importantly, fractures are a late outcome of the disease, i.e., a consequence of low BMD and poor architecture. Restricting osteoporosis to a definition based on fractures defeats the purpose of an early diagnosis to begin prevention or treatment. Even without prevalent fractures, the fracture risk can be high. Prevalent fractures just increase the fracture risk even further. Secondly, it is not easy to define an osteoporotic fracture. Low trauma is one aspect but this is often difficult to ascertain. All fractures are caused by an impacting force. Therefore, a definition based on fracture status would unavoidably include aspects of force and the likelihood of a force that does not depend on bone status. Moreover, in the spine, various radiological signs are sometimes used to differentiate osteoporotic from non-osteoporotic vertebral deformities; however, no consensus has been developed yet on this issue. As a consequence, it is commonly accepted that the fracture criterion is a misleading concept for the definition of osteoporosis and therefore this report will not include performance measures for methods that determine fracture status.

BMDa T-score  21.0 21.0 . BMDa T-score . 22.5 22.5  BMDa T-score 22.5  BMDa T-score and at least one osteoporotic fracture

Although the WHO definition leads to characteristic difficulties when used as a diagnostic criterion for osteoporosis as will be further discussed in Section 3.3.2.3, it at least specifies criteria that can be used to evaluate and compare diagnostic performance of densitometric methods. Before introducing these performance measures, some other widely used diagnostic criteria will be defined here. Absolute Values. The easiest way to diagnose a patient as osteoporotic is the use of an absolute threshold value of a given measurement independent of a reference group. For example, in spinal QCT, a BMD value below 120 mg. cm23 is often considered as osteopenic, a BMD value below 80 mg. cm23 as osteoporotic (Felsenberg and Gowin, 1999). Absolute values can be used for identical implementations of a given technique. In theory, a spinal BMD-based diagnostic criterion using QCT should be independent of the type of CT scanner and analysis software. This of course assumes identical trueness. In practice, however, trueness is not identical. Therefore, methods for standardizing absolute units across different implementations of a given technique are required. Percent Decrements. In order to reduce the effect of different systematic errors on different devices 4

The ICRU report will endorse the use of BMDa to distinguish areal from volumetric BMD measurements. BMD only refers to volumetric density as measured by QCT and BMDa denotes areal BMD as measured by DXA.

3.3.2.2 Low Bone Mineral Density. According to the conceptual definition of osteoporosis, low bone 15

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

the measurement of an individual can be expressed as percent decrements relative to the mean of the reference population (see Section 3.3.4) of the same age. However, percent decrements do not incorporate the variance of the normal population and they depend on the scaling of the quantity q that is measured. SOS, for example, always ranges above and around 1500 m.s21 and thus percent decrements are much smaller compared with QCT variables. Thus percent decrements cannot be used across techniques. Within different implementations of a given technique, equal variances of the reference populations and equal imprecision would be required. † T-score and Z-score. The variance of the reference data is considered in two important standardized measures, the Z- and T-scores. The Z-score denotes the difference between a measured value x of a variable q in an individual subject j and the agematched mean reference value x¯r normalized by the age-matched SD SDr of the population variance: Z¼

x  xr ðage ¼ agej Þ : SDr ðage ¼ agej Þ

ð3:9Þ

The T-score is defined similarly but instead of agematched values data from the young reference population (denoted by index Y ) are used (Faulkner, 2005): T¼

x  xr ðage ¼ ageY Þ : SDr ðage ¼ ageY Þ





ð3:10Þ

Thus both T- and Z-scores are dimensionless numbers that denote a difference in SDs of q. Their use reduces bias. For example, a constant calibration offset is eliminated. Patient-dependent systematic errors, e.g., the fat error in QCT or DXA, are reduced. This is one of the reasons why the WHO operational definition of osteoporosis uses T-scores. As can be seen from Eq. (3.10), the definition of reference data (Section 3.3.4) has an impact on the diagnosis of osteoporosis.



3.3.2.3 Clinical Limitations of the WHO Definition of Osteoporosis. The previous section outlined several diagnostic criteria used for individual patient diagnosis. The most important criterion is the one defined by the working group of the WHO but a number of restrictions apply to this definition.



† It is based on epidemiological data (Kanis, 1997). A measurement of BMDa at the spine, proximal 16

femur, and the forearm would approximately classify 30 % of white postmenopausal women as osteoporotic if the lowest T-score of the three sites were used as the diagnostic criterion. At the age of 50, the lifetime fracture risk of white postmenopausal women is approximately 40 % (Melton et al., 1992). Thus from an epidemiological view, the WHO classifies approximately the same number of women as osteoporotic as will actually sustain a fracture during their remaining lifetime. However, with respect to the individual diagnosis, many subjects classified as osteoporotic will not sustain a fracture and vice versa. The lifetime fracture risk in other populations, men or non-white women, is different from those of white women. The WHO definition applies to white postmenopausal women only. There is ongoing debate which criterion should be used for white men. Some studies support the use of a threshold at the same absolute BMD for men and women because fracture risk then seems to be the same (De Laet et al., 1997). Other authors argue that the available information is still inconclusive (Orwoll, 2000). These authors still recommend the use of separate male reference data such as those currently implemented on commercial devices. Currently, there is no consensus regarding this topic. There is consensus that the WHO definition should not be applied to modalities other than DXA. The original WHO definition promoted the lowest T-score of DXA of the spine, hip, and forearm. However, with increasing evidence of site-specific differences, restrictions to central measurement sites (spine or hip) or even to the hip exclusively have been advocated (Kanis and Gluer, 2000). This has implications for the number of subjects diagnosed with osteoporosis. Sub-regions within the proximal femur or individual vertebrae should not be used for diagnostic purposes. If this were done and if the same threshold of 22.5 SD were used, the apparent prevalence of osteoporosis would increase substantially—beyond what is reasonable based on epidemiological studies. There is poor concordance of measurements made at one body site with those made at other body sites. The concordance can be as low as 25 % that is 75 % of the subjects are not consistently classified (Faulkner et al., 1999; Greenspan et al., 1996). Using different threshold criteria like a T-score of 21 relative to age 65 may reduce but not eliminate the inconsistencies (Lu et al., 2001a, b).

Quality and Performance Measures in Bone Densitometry

† The diagnosis depends on the BMD value of the young normal mean and the SD of the reference data (see Section 3.3.4). 3.3.3

interpretable, the concept of standardized errors has been used (Glu¨er, 1999). Standardized errors easily allow for the comparison of different modalities or techniques. Basically, a reference technique (ref ) is chosen, for which the bias is well known. The technique being investigated (invest) is then compared to this reference technique by weighing the variable part of the bias of the technique investigated with the diagnostic response ratio of the two techniques:

Diagnostic Performance of Techniques

3.3.3.1 Concepts of Diagnostic Performance Diagnostic Response and Diagnostic Power. In order to put the trueness of a quantity q as introduced in Section 3.2.1 into clinical perspective, it must be related to the differences to be determined. This difference will be called diagnostic response and defined as the difference of q measured in two populations where typically one is healthy and the other diseased. It is given in units of q: responsediag ¼ xhealthy  xdiseased :

standardized biasv jinvest ¼

powerdiag

ð3:13Þ

Thus the standardized bias of a given technique is expressed as a bias in units of the reference technique. The above equation applies to absolute as well as percentage errors and can be used for short- and long-term errors. In the equation, only the variable bias parts are considered. It should be noted that responses typically depend on disease severity, age, menopausal status, etc. Thus the declaration of standardized errors requires an exact description of the population investigated.

ð3:11Þ

It is usually impossible to compare the diagnostic responses of different techniques, particularly if they are expressed in different units. Therefore, a concept that adjusts for such differences and that makes the results of different techniques comparable needs to be introduced. This concept is termed diagnostic power and describes the ability of a technique to detect changes. It is given as a unitless ratio of the response and the corresponding variable part of the bias of an accepted reference: responsediag ¼ : biasv

responsediag jref biasv jinvest : responsediag jinvest

Sensitivity – Specificity Analysis. In the simplest case, the result of a diagnostic method could categorize a patient as being healthy or diseased. The performance of this method can be quantified using the established concept of a sensitivity and specificity analysis as described in standard textbooks for statistics. Sensitivity is defined as the probability of a positive test if the disease is present:

ð3:12Þ

Diagnostic power expresses the difference between two subject groups as multiples of the bias. The higher its diagnostic power, the better is the ability of a technique to separate healthy and diseased subjects. Obviously, multiplicative or additive calibration offsets do not change the diagnostic power. It should be pointed out that there is no consensus about the best method for characterizing diagnostic power. While it is clear that powerdiag as defined in Eq. (3.12) provides insight into this issue and will adjust for differences in calibration and units, further research is required to provide an answer how to best characterize a technique’s ability to yield accurate diagnostic results. For example, the intra-group variability also affects the ability of a technique to discriminate between subject groups, and it remains to be investigated whether this is due to lack of trueness or some other causes.

sensitivity ¼

number with disease who test positive : total number with disease ð3:14Þ

Specificity is defined as the probability of a negative test if the disease is absent: specificity ¼

number without disease who test negative : total number without disease ð3:15Þ

Obviously, sensitivity and specificity both depend on a binary criterion by which the healthy and the diseased groups are classified. In contrast to many other diagnostic tests, a test on osteoporosis, e.g., a BMD measurement, usually results in a continuous variable. Selecting thresholds to serve as binary criteria for a sensitivity and specificity analysis, such as the WHO criteria discussed above,

Standardized Bias. The quantity of diagnostic power introduced above is a helpful performance measure but results expressed as multiples of a bias are perhaps difficult to interpret. To provide a figure of merit that is more intuitively 17

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

artificially reduces the amount of information present in the measurement.

femur (Hanson, 1997), and the forearm (Shepherd et al., 2002). This allows a mathematical transformation of reference data acquired on one scanner to the specific calibration of another scanner. Collection of reference data is not an easy task. Two approaches have commonly been pursued:

Analysis of the Area Under the Receiver Operator Curve. A performance measure better suited for continuous data is the area under the receiver operator curve (ROC) (ICRU, 2008). In this case, the binary criterion is varied so that sensitivity and specificity take a number of values between 0 and 1. The ROC curve is then obtained by plotting the sensitivity as a function of 1 2 specificity. The area under the ROC curve AreaROC is defined as AreaROC ¼

ðx ¼ 1

sensitivityðxÞdx;

† gathering of population based reference data without any exclusion criteria or † convenience samples of healthy individuals in which subjects with any identified bone-related disorder are excluded. In the field of bone densitometry, the first method is preferred. Specifically for DXA measurements of the proximal femur, the manufacturers have agreed to use population-based reference data gathered in the context of the American NHANES III study (Looker et al., 1998). Convenience samples with specific exclusion rules are difficult to standardize because of the decision where to draw the line between healthy and diseased subjects. Moreover, they tend to result in a super healthy population with an artificially small population variance (Ahmed et al., 1997; Melton, 1997). This leads to errors if SDs from epidemiological studies on risk assessment, which frequently are based on population samples, are different from SDs reflecting the population variance of the reference data. For example, SDs from the Study of Osteoporotic Fractures, a very influential epidemiological study on osteoporosis (Cummings et al., 1993; 1995), are substantially larger than SDs given in the manufacturer’s reference data. Thus the use of a super healthy reference population leads to an overestimation of fracture probability. While exclusion of subjects according to medical criteria is not advisable, exclusion of subjects is warranted if the measurements are technically inaccurate. For bone densitometry, this is of particular relevance for posterior–anterior DXA measurements of the spine because above the age of 65 a large fraction of individuals will suffer from degenerative changes in the lumbar spine region. As DXA is a projectional technique in the resulting image, these degenerative changes often cannot be separated from the vertebra, which results in an artificial enhancement of the measured BMD values (Rand et al., 1997; von der Recke et al., 1996; Yu et al., 1995). It is advisable to exclude such subjects from the reference data. Reference data have to be collected separately for different races (at least for Caucasians, Blacks, and Orientals) because of different fracture rates. Regional differences for the same race and gender

ð3:16Þ

x¼0

where x ¼ 1 2 specificity. AreaROC can be used as a comparative performance measure. The technique with the greatest area under the curve is the one that, on average, is associated with the best performance. However, one limit should be noted: for clinical decision-making, specific cut-points, that is, dedicated levels of specificity, are required whereas the ROC analysis provides a global criterion for all possible specificities. A dedicated cut-point for specificity limits the upper integration border in Eq. (3.16) to x , 1. Thus the comparison of two methods may differ depending on whether the global criterion or a certain specificity cut-point is used. However, as long as the ROC curves of two tests do not intersect, their ranking would not be affected. 3.3.4

Reference Data

In bone densitometry, reference data of a quantity q refer to a collection of data from a large number of subjects. Reference data play an essential role in the diagnostic decision process. Results of patients are compared with reference ranges from the healthy population. Because of the typical phases of skeletal development, reference data can be split into three age phases: the growing skeleton (age 0 to approximately 20), the peak bone mass range with only minor increases or decreases in the range of age 20 to 40, and the period of bone loss beginning around age 40, for women specifically around the menopause. Normalizing measurements of q to the corresponding reference data reduces systematic errors. However, different manufacturers typically use reference data from different populations. Thus even on the implementation level normalized measurements of q most often cannot be compared. In DXA standardization efforts have been successfully completed for the spine (Genant et al., 1994), 18

Quality and Performance Measures in Bone Densitometry

If low BMD is selected as the decisive criterion for osteoporosis, then diagnostic power as introduced in Section 3.3.3.1.1 could be a figure of merit. It is based on trueness and can be determined in situ or in a subject group that includes all stages of the disease from healthy to very diseased (i.e., with multiple fractures, but fracture status not used as a grouping criterion). Instead of the diagnostic response defined in Eq. (3.12), the variance of biasv of this subject group would be used in Eq. (3.13). Standardized bias or diagnostic power could equally well be used as performance measures. The problem of this approach is its inability to separate osteoporotic from normal age-related variability. Its advantage is the applicability to early stages of the disease. Alternatively one could limit the assessment of diagnostic performance to later stages of the disease when fractures are present that can unambiguously be classified as being of osteoporotic origin. In this situation, two groups of individuals can be defined, healthy and osteoporotic, and various methods for assessing group differences such as the sensitivity – specificity or the ROC analysis introduced above can be utilized. However, the application of these methods for diagnostic purposes has fundamental limitations.

are reported, e.g., for BMD (Lunt et al., 1997), but there is no agreement whether, for example, US reference data for white females can be used for European female populations or whether local reference data have to be collected. Unless highquality local reference data are available and published, it is preferable to use published population-based American reference data. Also one has to consider that differences in reference data do not necessarily reflect differences in fracture risk. Even if they do, a given level, of e.g., BMD, may be associated with the same level of risk across different populations with different reference data. Thus the question whether local or universal reference data should be chosen depends on the purpose of the classification. A classification of patients according to absolute levels of risk would require universal reference data ( perhaps even across genders and ethnic groups), whereas a classification chosen to define the high-risk population of a given country would call for local reference data. It is necessary to know these relationships and to define the purpose of reference data before any decision for or against the use of local reference databases can be made. Reference data are expressed as a function of age and are reported as mean x¯r(age) and population variance or SD SDr(age). Typically these two parameters are not given as continuous functions of age but as pooled values for consecutive 5 to 10 year time intervals extending from 20 to 85 years (Kleerekoper, 1999). Sample sizes in excess of 100 subjects per age group can be considered as fairly robust. The 95 % confidence limit of the estimated mean should be within +20 % of one SD. The corresponding 95 % confidence interval of the estimated SD is no more than 12 % below or 16 % above the true SD. Fitting the data versus age is not recommended because the data will vary for certain age groups, particularly the extremes, depending on the assumptions of the fit model. The dependence of reference data on variables other than age is usually neglected. Physiological conditions such as the onset of menopause, which has a significant impact on bone loss in women but varies with age in individuals, are not considered when assembling reference data. Also height and weight have an impact on BMD that may become relevant for extreme cases.

† They are limited to later stages of the disease. At early disease stages, the performance of diagnostic techniques or even the ranking of different techniques may be different. † Diagnostic performance measures for BMD that are derived from the analysis of group differences due to the presence or absence of an osteoporotic fracture imply the risk of misinterpretation. Densitometric quantities such as BMD or QUS must not be mistaken as a method to diagnose fractures, despite the fact that the diagnostic performance measures, e.g., an ROC analysis, rely on statistical concepts also used in fracture diagnosis. Misinterpretation along these lines is a permanent phenomenon in the osteoporosis literature when the performance of bone densitometry is described. As a consequence, diagnostic performance measures such as a sensitivity –specificity or an ROC analysis should at best be used to establish a ranking among techniques and should not be interpreted as absolute diagnostic performance measures. All BMD-based diagnostic criteria outlined in Section 3.3.2.2 artificially reduce the amount of information inherent in the measurement by reducing a continuous variable (in this case, BMD or

3.3.5 Clinical Limitations of Bone Mineral Density as Diagnostic Criterion Fundamentally, two different approaches for assessing diagnostic performance in osteoporosis can be chosen. 19

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

BMDa, respectively) to a discrete discrimination of being osteoporotic or not. Thus small differences in a measurement may result in different classifications. All BMD measurements show poor concordance of measurements made at one body site with those made at other body sites but one has to accept that site-specific differences in BMD status exist, even after standardization, which make it impossible to achieve perfect diagnostic agreement. Thus one may elect to restrict the evaluation to those sites that have the highest significance in terms of medical burden: the proximal femur and the spine. The definition of a diagnostic criterion based solely on BMD faces further limitations. It cannot be used to derive treatment decisions because it does not differentiate between the underlying causes of low BMD, many of which require different treatment options. This is an inherent difficulty of the current conceptual definition of osteoporosis, which disregards all other clinical information available to the diagnosing physician. Therefore, the BMD result should never be used in isolation but in conjunction with information on other clinical risk factors and results from physical examination, patient history, basic laboratory tests, and, if indicated, from radiographs or bone biopsies. Only if all information is viewed and interpreted together can a responsible diagnosis resulting in appropriate decisions on prevention and treatment be made.

3.4 3.4.1

diagnosis. In this sense, fracture risk is a more comprehensive figure and therefore there is a trend to use estimates of absolute fracture risk for treatment decisions. Diagnosis is still relevant because fracture risk cannot be used to differentiate the underlying causes of the disease. However, the ultimate aim is not just the decision whether to treat or not but the optimum selection from different treatment options. Further, fracture risk is a continuous measure. Thus diagnostic criteria are required in order to decide whether treatment is indicated and at what level. In order to assess fracture risk, a variety of statistical approaches including cohort and case – control studies has been developed. Cohort studies are prospective studies. Patients are selected based on an exposure assumed to be relevant for the outcome and are observed over time. The outcome in this case is fracture status. In contrast, casecontrol studies are retrospective studies. Patients are selected based on the outcome variable. Cases and controls are then compared with respect to the exposure variable(s). After summarizing basic definitions, the most important statistical approaches used for fracture risk assessment will be reviewed for each of these approaches. This will be followed by an introduction to performance measures that should be helpful to assess and compare densitometric methods in an objective fashion. The nomenclature developed above will be used here, e.g., densitometric methods are differentiated as modality, technique, or implementation.

Performance Measures to Assess Fracture Risk Relevance of Fracture Risk Assessment

3.4.2

The determination of fracture risk is a central concept in osteoporosis because fractures represent the most severe outcome of osteoporosis, leading to substantial morbidity and mortality. Thus the primary treatment aim is the prevention of fractures and therefore the reduction of fracture risk. Also there is increasing consensus to base treatment and prevention decisions on estimates of absolute fracture risk. What is the respective role of diagnosis and risk assessment? Briefly, with diagnosis we try to identify the underlying causes of the disease, whereas risk assessment aims at its outcome that is fractures. Both tasks are related, as many factors used for diagnosis such as low BMD are also predictors of fracture risk. However, in contrast to relative fracture risk, absolute fracture risk, which is the decisive factor for future fracture prediction of the individual, so far cannot be determined from a

Basic Definitions

Before describing methods to determine fracture risk, it is necessary to define fracture prevalence and incidence. † Prevalence P of a disease (or in this instance of fractures) is defined as the number of cases at a given point in time. † Incidence I is defined as the number of new cases of a disease (fractures) that occur during a specified period of time t. † The prevalence rate PR is the ratio of the prevalence and the number n of persons at risk at the time when prevalence is determined: PR ¼ P/n. † The incidence rate IR relates the incidence to the total population time at risk: IR ¼ I/(n.t). Usually t is measured in years and n is the number of persons who are without fracture prior to the time period and are followed during the period. Then IR is given as a number per person years. 20

Quality and Performance Measures in Bone Densitometry Table 3.1. Outcome of disease by group exposure Disease

Exposure No exposure

Yes

No

A C

B D

Total

Incidence rate

Odds

AþB CþD

A/(A þ B) C/(C þ D)

A/B C/D

For a dichotomous risk factor, that is, a factor which is either present or absent such as previous fractures, the relative risk RR is defined as the ratio of incidence rates of two groups of which one is exposed to the risk factor and the other is not (Table 3.1): RR ¼

A=ð A þ BÞ : C=ðC þ DÞ

Figure 3.1. Age-related BMD changes in women as measured by QCT with and without vertebral fractures, showing the large overlap (Engelke and Kalender, 1998).

ð3:17Þ vertebral fractures). Thus in this context the concept can also be applied in cohort studies. † RRs are used to characterize the risk of incident fractures, such as hip fractures. Here cohort studies are required.

RR is a dimensionless quantity that evaluates how much more or less likely it is that the disease occurs in the group with exposure to the risk factor compared to the group without exposure. For a calculation of RRs, the incidence rates must be known. Therefore RRs can only be determined in cohort studies. In case – control studies in which incidence rates are unknown but fracture status is known, only the so-called odds ratio OR can be determined. The odds of disease are defined as the probability P of having the disease divided by the probability of not having the disease: odds ¼ P/(1 2 P) (Kleinbaum et al., 1982). OR is defined as the ratio of the odds of disease in the exposed subjects relative to the unexposed subjects and can be calculated from the following equation: OR ¼

A=B AD ¼ : C=D BC

3.4.3

Risk Factors and Fracture Risk

Diagnostic variables determined in the field of bone densitometry such as BMD or BUA are important predictors of fracture risk, although the fracture event cannot be predicted with adequate certainty since other factors also contribute. In other words, if two groups of subjects, one with and the other without prevalent vertebral fractures, are compared, for example, with regard to their BMD status, substantial overlap in their BMD values will be observed. Considering other factors such as age enhances the predictive accuracy but an overlap remains (Fig. 3.1). The uncertainty, characterized by the magnitude of the overlap between the BMD values of the two groups is a measure of the discriminatory power of the technique or, in other words, a measure of the technique’s ability to separate subjects with and without prevalent fractures. Interestingly, this ability usually closely resembles the technique’s ability to predict future fractures. Therefore, statistical methods that allow characterization of a technique’s ability to discriminate fractured and non-fractured patients and to predict prevalent fractures or future fracture risk are required and will be reviewed and compared in the following section. The relationship between risk factors such as BMD, BUA, or age and fracture risk can be characterized by a continuous relationship, the

ð3:18Þ

In a case – control study, the proportion of exposure among fractured and non-fractured persons and therefore the ORs can be estimated. For rare diseases, the odds of disease are very similar to the probability of disease. Similarly, the OR approximates the RR if the disease incidence is low. With respect to fracture risk, this report will adopt the following terminologies. † Odd ratios are used to characterize the risk of prevalent fracture. This can be done in casecontrol studies. ORs are also appropriate measures for incident fractures if the exact time point of fracture is not known while the time interval of its occurrences is known (e.g., 21

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

identical SD (SD1 ¼ SD2) and same sample size of (n1 ¼ n2), t is equal to the difference pffiffiffi pffiffiffithe means of the two groups divided by 2SD= n. The test outcome is the probability (often also called significance level) P describing the likelihood that on average the two groups do not differ with regard to the diagnostic test variable, i.e., that they are two samples from an underlying population that do not differ with regard to the diagnostic test. The value of P can be calculated from the test statistic that is in this case from the t-distribution using the t-value and the degrees of freedom df ¼ n1 þ n2 – 2 (Sheskin, 1997). Equation (3.20) is valid if the variances of the two underlying populations are equal otherwise adjusted estimates must be used. If the variances are different, one can replace Eq. (3.20) by

exponential gradient of risk P ¼ expðb0 þ b1 x1 þ b2 x2 þ    þ bn xn Þ;

ð3:19Þ

where P is the probability of fracture, xi are the risk factors obtained as results of diagnostic tests or questionnaires and bi are the coefficients describing the gradient of risk for the respective risk factor xi adjusted for the impact of all other risk factors in the model. The xi can be continuous or categorical variables. For continuous variables such as BMD, it is obvious that there is no fracture threshold that would allow a clear separation of individuals at high and low risk. 3.4.4 Statistical Methods for Characterizing Fracture Risk Several statistical methods are available to analyze the performance of a diagnostic test with regard to fracture assessment:

x1  x2 ffi: t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SD22 SD21 þ n1 n2

† Student’s t-test: In this case, the univariate separation of mean BMD for the fractured and healthy groups is analyzed using mean differences relative to the SDs. † Multivariate methods such as discriminant or logistic regression analysis: The goal of these methods is the prediction of group membership from a measured set of risk factors. † Prospective models like Poisson regression or Cox proportional hazards model are suited to model fracture incidence observed in cohort studies.

The probability P again is calculated from the t-distribution using the following expression for df: h df ¼ 



þ

i SD22 2 n2

  2 2 SD21 2 SD þ n211 n22 n1

:

ð3:22Þ

Performance Measures. Two performance measures that result from the t-test are the t- and P-values. However, these measures have limited value because they depend on sample size. In principle, the performance measure of a diagnostic variable that separates fractured and unfractured subjects should be independent of the sample size. For example, statistical significance using the P-value indicates a difference of group means that is unlikely to be caused by chance. However, if P depends on the sample size then in a large sample a very tiny difference between groups may be significant while in a small sample size a much larger group difference may easily be missed. A sample-size-independent performance measure that can be derived from a two independent samples t-test is the so-called v 2 (Sheskin, 1997):

3.4.4.1 Student’s t-test Analysis. The independent t-test is a statistical procedure to compare the means of a parameter q measured in two independent groups. It is suited to test the performance of diagnostic tests in case – control studies. The test assumes that q is distributed normally in both groups and that the variances of the populations from which the two groups are drawn are equal. The test gives a dimensionless t-value: x1  x2 t ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h ih i ðn1 1ÞSD21 þðn2 1ÞSD22 1 1 þ ; n1 þn2 2 n1 n2

1 n1 1

SD21 n1

ð3:21Þ

ð3:20Þ

where x¯ denotes the mean value, SD the standard deviation of the parameter q and n the number of subjects in groups 1 and 2, respectively. t linearly depends on the difference x¯1 – x¯2 of the means of the groups relative to the pooled SD. The larger the difference the larger is t. However, as obvious from the equation above, t also depends on sample sizes n1 and n2. In the simplest case of two groups with

v2 ¼

t2  1 : t2 þ n1 þ n2  1

ð3:23Þ

It estimates the proportion of variability of the dependent variable that is associated with the independent variable in the underlying population. The closer the absolute value of v 2 is equal to 1, the stronger is the association. In 22

Quality and Performance Measures in Bone Densitometry

and a cut-off value beyond which a subject is to be assigned as fractured or normal. The linear coefficients bi are determined by S21d, where S21 is the inverse of the pooled group variance –covariance matrix of xi within fractured and non-fractured subjects and d is the vector of mean differences between fractured and non-fractured subjects (Johnson and Wichern, 1988). If the risk factors follow multivariate normal distributions, ZLDA is the best risk factor combination resulting in the maximum AUC in an ROC analysis in which ZLDA is used to assign group membership status (Su and Liu, 1993). Also LDA is quite robust for nonnormally distributed data (Metz et al., 1973).

behavioral sciences, the following scheme for the interpretation of v 2 is used: small effect: 0.01  v 2 , 0.06; medium effect: 0.06  v 2 , 0.15; large effect: v 2  0.15 (Cohen, 1988). Another performance measure of a diagnostic test is the AUC, the area under a ROC (Metz et al., 1973; 1998), which will be discussed in more detail in the following sections. For a comparison of the diagnostic performance of two different diagnostic tests, t-values, Eq. (3.20), or corresponding P-values are appropriate if the same numbers of fractured and unfractured subjects are used for both tests. If sample sizes are different, it is more general to compare the ratio of the mean differences over the pooled SD according to Eq. (3.21). Again, an AUC comparison is always an appropriate indicator. The t-test or AUC analysis is restricted to the discrimination of two groups. Analysis of variance (ANOVA) methods allow one to analyze mean differences of a diagnostic test, such as bone densitometry among multiple groups, e.g., combinations of menopausal and fracture status (Johnson and Wichern, 1988). Further, even for two groups the t-test can only be used if the test performance can be evaluated without adjustment for confounding variables. If this is not appropriate, multivariate methods must be employed instead. For comparing differences of multiple diagnostic tests for differences among multiple groups, multivariate ANOVA (MANOVA) can be used that compares joint mean differences of the risk factors among multiple groups via tests for the linear contrasts. Usually, however, the ability of techniques to discriminate between two groups of subjects (most commonly fractured versus non-fractured) is to be evaluated in the presence of confounding variables that are continuous, ordinal or dichotomous. For such evaluations, the following methods are well suited.

Performance Measures. An LDA does not by itself provide a performance measure to characterize a technique’s discriminatory power. A performance measure can be generated as follows. † First, an ROC analysis can be performed by comparing the group membership predicted by the discriminant function ZLDA with the true group membership status. The AUC of this ROC curve can then be used as a performance measure. Statistical tools for testing whether two ROC curves from the same subjects differ significantly have been published. A public domain program is available for calculation (Metz et al., 1998). One should note, however, that this method is not very powerful. Differences in area must be quite large before they become significant and thus smaller differences may be missed. † As another alternative it is possible to test whether the performance of two tests differs significantly at a given point of the ROC curve, i.e. at a given level of sensitivity or specificity (ICRU, 2008). † A third, more traditional approach is to compare the misclassification error using the leave-one-out (LOO) technique (Lachenbruch and Mickey, 1968). LOO, also called cross-validation approach, means that LDA is performed several times. Each time a different subject (or a different group of subjects) is excluded from the study data. ZLDA is calculated from the remaining cases. Trueness of the linear discriminant classification is then tested by comparing the predicted classification with the true group membership. This procedure is repeated many times with different subjects excluded. All scenarios combined provide the misclassification probability. The outcome is representative of the true test situation in which new unknown cases are classified. The misclassification probability can be used as performance measure. The performance of two

3.4.4.2 Linear Discriminant Analysis. ANOVA, MANOVA, and linear discriminant analysis (LDA) are all applicable for case – control studies. Different from MANOVA, which tests mean differences of multiple risk factors, an LDA is a statistical method to find the best linear function of multiple risk factors to determine the memberships of subjects. A linear discriminant function to predict fracture status consists of linear combinations of multiple risk factors xi (continuous, ordinal, or dichotomous variables) that provide an index for group membership classification (e.g., fractured versus non-fractured): ZLDA ¼ b1 x1 þ b2 x2 þ    þ bn xn ;

ð3:24Þ 23

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

Performance Measures. For categorical risk factors, the corresponding OR can be used as a performance measure describing the predictive power: the higher the OR, the stronger the respective risk factor. However, if the risk factor is a continuous variable resulting from a diagnostic test such as a BMD measurement, the OR cannot directly be used as a performance measure because it is unit-dependent. A unit of change of different variables usually is not directly comparable, for example a change of 1 mg. cm23 is not equivalent to a change of 1 mg. cm22 and not even a 1 % change in one variable is necessarily equivalent to a 1 % change in a different variable. These difficulties can be prevented by using standardized increments in risk factors as performance measures. A common way of standardization is to express the OR of a risk factor xi as a multiple of the SD of the population variance of xi, typically after adjusting for relevant confounders such as age-related changes. In other words, standardized ORs (sORs) are obtained by multiplying the logistic regression coefficient bi with the SEEs calculated from a fit of population data of xi versus age.

diagnostic tests or of two sets of different diagnostic variables can be compared by generating a 2  2 table with correct and incorrect classifications. A McNemar test can be used to test for significant differences in classification performance of the two procedures (Kleinbaum et al., 1982).

3.4.4.3 Logistic Regression Analysis. In logistic regression, the logarithm of the odds of disease (e.g., fracture) which is equivalent to the logit transformation ln [P/(1 – P)] of the probability P of the presence of the disease is estimated from n independent predictor variables: ZLGR ¼ ln½P=ð1  PÞ ¼ b0 þ b1 x1 þ b2 x2 þ    þ bn xn : ð3:25Þ The probability of fracture is thus assumed to follow a binomial distribution with PðZLGR Þ ¼

1 : 1 þ eZLGR

ð3:26Þ

The logistic regression coefficient bi represents the logarithm of the OR for a one unit increase in variable xi: ORxi ¼

odds ðxi þ 1Þ Pðx1 ; x2 ;    ; xi þ 1;    ; xn Þ ¼ odds ðxi Þ 1  Pðx1 ; x2 ;    ; xi þ 1;    ; xn Þ

, bi ¼ ln ðOR for unit increase in xi Þ:



Pðx1 ; x2 ;    ; xi ;    ; xn Þ ¼ ebi 1  Pðx1 ; x2 ;    ; xi ;    ; xn Þ

ð3:27Þ

ð3:28Þ sORs are dimensionless numbers that denote a ratio in SDs of b:

As the predictor variables xi are typically biological risk factors of the disease, the logistic regression coefficients bi have biological meanings, which is not the case for the coefficients of the LDA. In case of a categorical predictor variable, b is the logarithm of the OR for the presence versus absence of this risk factor. Logistic regression analysis is mainly used for the analysis of case – control studies. Under the assumption that differences in follow-up time or changes in covariate values over time have no effect on study outcome, it can also be applied to cohort studies. In both study types, the model shows the same dependence on the risk factors x1, . . . , xn. Therefore the regression coefficients can be estimated using cohort or case – control studies. For cohort studies also the RR for parameter xi can be derived (McNutt et al., 2003; Zhang and Yu, 1998). The RR depends on b0, which is a function of the prevalence of the disease in the study population. The applicability of logistic regression to the analysis of cohort data will be compared to the more preferred methods of Poisson and Cox regression models in Section 3.4.5.

sORi ¼ expðbi  SEEi Þ:

ð3:29Þ

For the calculation of the SEE, often reference data after age 40 are used, which are available for all densitometry and ultrasound variables. If other cross-sectional data should be employed for this purpose, the SEE of xi versus age can only be used if fractured cases are included which increases the SEE in a fashion that depends on the sampling scheme. In case – control studies without knowledge of reference data the within-group (in contrast to the between) SD could be used instead of the SEE but this introduces some degree of bias. The performance of two techniques with regard to risk assessment can be compared by comparing their sOR values. Unfortunately, no closed-form analytical formula exists to evaluate whether differences are significant. One option is the use of bootstrap algorithms (Davison and Hinckley, 1997; Hui et al., 1995). Here, many random versions of different samples of the entire dataset 24

Quality and Performance Measures in Bone Densitometry

categorical predictor variable xi, b is the logarithm of the RR for the presence versus absence of this risk factor.

(with replacements) are created. For each sample, the sOR is calculated for both techniques. Then for all samples, the distribution of the difference of the two sORs is analyzed. If the distribution is not centered on zero with fewer than 5 % or 2.5 % of the samples showing differences of opposite sign compared to the remaining ones, the difference in sOR and hence in risk assessment performance is significant with 95 % confidence, one-sided or twosided, respectively. This approach is powerful; however, it requires access to computer programs that allow sampling data with replacement from the study population to create the datasets required for the bootstrapping approach. The recommended number of trials is about 2000 or more (Davison and Hinckley, 1997; Westfall and Young, 1993). Alternatively the fracture probability can be calculated for each patient for both of the techniques to be compared. These data can be used for ROC analysis comparing predicted fracture probability with true fracture status. As described in Section 3.4.4.2.1, the AUC or the performance of two tests at a given point of the ROC can be compared.

RRxi ¼

In contrast to the logistic model in Poisson regression P is modeled based on an exponential function and thus is not automatically limited to the range between 0 and 1. Therefore constraints must be imposed on the regression coefficients bi to keep p within 0 and 1. For rare events, this is usually not a problem. For other cases, this is taken care of automatically by the statistical software packages. For this reason, a log-linear regression model is best suited for categorical independent variables and also for multilevel categorical dependent variables but not for diagnostic tests with continuous outcome. Continuous independent variables may be included after categorization, for example, by using tertiles or quartiles. Performance Measures. Here we consider only categorical independent data. Thus the RRs can be directly used as a performance measure because they are dimensionless. The higher RR the stronger is the respective risk factor. If two diagnostic models are compared using Poisson regression analysis, it is also possible to compare their RRs or as an alternative the magnitude of the regression coefficients bi, which directly correspond to the probability of fracture versus no fracture. 3.4.4.5 Cox Proportional Hazard Model. The Cox proportional hazard model is a method for modeling time-to-event data in the presence of censored cases. In other words: it is applicable to model the risk of sustaining those fractures for which the time of the fracture event is known (e.g., hip fractures but not generally vertebral fractures) in the presence of censoring (lost to follow-up, etc.). Time to fracture is the outcome variable. Instead of the disease probability P derived in logistic or loglinear regression as explained above, in the Cox model, the so-called hazard function l(t) is estimated. l(t) in our case is the rate of subjects with a new fracture at time t among all subjects followed prior to time t without fracture yet. Thus l(t) represents a fracture incidence rate. Cox proportional hazard model assumes that

ZLLR ¼ lnðPÞ ð3:30Þ

where, in equivalence to the logistic model, P is the fracture probability estimated from n independent risk factors x1, . . . , xn. With ln ðIRÞ ¼ ln ðPÞ lnðtÞ ¼ b00 þ b1 x1 þ b2 x2 þ    þ bn xn ¼ Z0LLR ;

ð3:31Þ

the fracture incidence rate IR can be written as 0

IRðZLLR Þ ¼ eZLLR :

ð3:33Þ

, bi ¼ lnðRR for unit increase in xi Þ;

3.4.4.4 Poisson Regression Model. Another regression model used recently (Ensrud et al., 2000; Hochberg et al., 2002; Kanis et al., 2001; Lunt et al., 2003) to associate fracture risk with the prospective outcome of diagnostic tests is the loglinear regression or Poisson regression model (Vermunt, 1997). Because IR ¼ I/(n 2 t), the expected number of new incidences I in n people followed over time is characterized by a binomial distribution with the binomial probability P ¼ IR.t, where t is the average length of exposure. It assumes that

¼ b0 þ b1 x1 þ b2 x2 þ    þ bn xn ;

IR ðxi þ 1Þ ¼ eb i IR ðxi Þ

ð3:32Þ

Similar to the logistic regression model, the coefficients bi of the Poisson regression have a direct biological interpretation and again in case of a

lðtÞ ¼ l0 ðtÞ  eZC with ZC ¼ b1 x1 þ b2 x2 þ    þ bn xn ; 25

ð3:34Þ

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

combination of risk factors. In addition, usually a normal distribution of the risk factors is assumed. In contrast, in a logistic regression, it is assumed that the risk factors are given and random variations are assigned to the binary classifications. While a logistic regression does not require the linear combination of risk factors to be distributed normally, it requires a binomial distribution of the dichotomous classification and a logit-transformed linear relationship between the binomial probability and the linear combinations of the risk factors. It is a major advantage of logistic regression over discriminant analysis that its regression coefficients for the predictive variables do not depend on the prospective or retrospective study design. Because the model is determined by the predictive variables, it does not require assumptions of normal distributions and equal variances between cases and controls for these variables. If the assumptions of multivariate normality and equal variance matrices are met, the regression equations resulting from a logistic regression and an LDA are usually equivalent, but not identical. Nevertheless, logistic and log-linear regressions have basic differences from LDA, which is based on given subject fracture status and compares distributions of risk factors in two groups. The LDA assigns subjects into groups based on the distribution properties of risk factors. Therefore the distribution properties matter. This is not the case in logistic or log-linear regression. Here subjects are not classified into different groups but it is the intention to evaluate changes in fracture risk as functions of risk factors. Logistic as well as loglinear regression directly model the fracture probabilities for given risk factor values and thus require no distribution assumptions about these factors. Instead, both logistic and log-linear regression analyses require specific relationships between fracture (or non-fracture) probability and risk factors. As another important difference discriminant analysis assigns subjects into different groups based on their current risk-factor values, whereas logistic and log-linear regression analyses evaluate fracture probability for a given length of time ( period) and values of risk factors, most often at the beginning of the period (baseline). There are several benefits of using logistic regression analysis to analyze probability of fractures. First, logistic regression analysis is naturally linked to binomial probability. In other words, the regression coefficients can be arbitrary real numbers and the resulting estimated fracture probability is still within 0 to 1. Second, logistic regression analysis

where as before xi are independent diagnostic test variables or risk factors. l0(t), the baseline hazard function, describes the risk of subjects with xi ¼ 0. The primary aim of a Cox model is to evaluate the relative hazard of covariates bi from information about time to fracture and not the estimation of the so-called survival times l(t) (which in our case is the time until fracture), although there are also special procedures to estimate the survival distribution l(t) under proportional model assumptions (Kalbfleisch and Prentice, 1980). When comparing groups, e.g., one group with a given risk factor being absent and one group with the same risk factor being present, only the ratio of the hazard functions, the so-called hazard ratio (HR), is evaluated. If l0(t) is the hazard function for group 1, then the hazard function for group 2 is l0(t)eZ. Z is time-independent. In analogy to the log-linear regression model, the estimated regression coefficient bi represents the logarithm of the RR, in this case the HR, for a one unit increase in variable xi which is also independent of time t: bi ¼ lnðHR for unit increase in xi Þ:

ð3:35Þ

This is also the logarithm of the HR for the presence versus absence of a categorical risk factor. Thus the Cox proportional hazard model is used to estimate the RR of fracture associated with a number of risk factors xi. For calculating absolute risk of fracture during a given time interval (e.g., 10-year risk of fracture) procedures are available to estimate the baseline hazard function l0(t) for the Cox model (Kalbfleisch and Prentice, 1980). Performance Measures. Similar to sORs in logistic regression here standardized RRs (sRR) can be used as performance measure sRRi ¼ expðbi  SEEi Þ:

ð3:36Þ

The performance of two techniques can then be compared using ROC analysis for fracture within a given time window or time point (Heagerty et al., 2000). As an alternative, bootstrap methods as explained in Section 3.4.4.3.1 can be employed to compare RRs. 3.4.5

Comparison of Statistical Methods

3.4.5.1 Discriminant Analysis versus Logistic Regression. Logistic regression uses a different approach from discriminant analysis to answer similar clinical questions. In discriminant analysis, it is assumed that the classification is given; random variations are assigned to the linear 26

Quality and Performance Measures in Bone Densitometry

According to the assumptions, the baseline hazard function is time-independent, whereas the relative HR of time in study imposes an exponential effect of time. More general parametric survival models, such as Weibull distributions have been developed and potentially can be used to compare diagnostic utilities (Collet, 1994). Log-linear models usually deal with contingency tables created by categorical risk factors that have multiple category levels. Logistic models use dichotomous dependent variables, although an extended technique called multinomial logistic regression is also suited for more than two categories of outcome. Another difference is that logistic regression is based on proportional data and both the numerator P (number of fractured) and denominator 1 2 P (number of unfractured with the same risk) must be known, whereas in the Poisson model only P must be known. For example, we can have strata based on diagnostic variables and only observe fractured subjects within each strata of a cohort study. For the stratum with unfractured subjects, the number of observations is zero. With Poisson regression, these data can be analyzed, but not with logistic regression. To our knowledge, a direct comparison of proportional hazards, Poisson, and logistic regression models has not been performed in the field of bone densitometry. Therefore the following summary and recommendations have been extracted from a study on occupational cohort data (Callas, et al., 1998).

can utilize data from both retrospective case–control studies as well as prospective cohort studies. The regression coefficients of logistic regressions remain the same for both study designs. This important property helps the use of case–control studies which are much more economical than prospective studies. Third, the logistic regression coefficients have biological interpretations whereas those from discriminant regression have not. 3.4.5.2 Comparison of Regression and Survival Models. Logistic, Cox, and Poisson methods are all regression models, although Cox and Poisson methods are often termed survival models. One of the important differences between logistic regression and Poisson and Cox regression is the consideration of time, e.g., the time until a fracture occurs. When logistic regression is used for cohort analysis, the follow-up time is implicitly assumed constant for all patients. In addition, potential time dependencies of the covariates are not considered. Cox proportional hazard model estimates the time-dependent hazard function [see Eq. (3.34)]. It models fracture rates as a log-linear function of predictors (independent variables and covariates). The model separates the effect of time from the effect of time independent predictors. That is the effect of the predictors is assumed the same at all times. Extended models that allow for time-dependent predictors and even timedependent effects exist for both logistic and Cox (Collet, 1994) regression models but their discussion is beyond the scope of this report. One important assumption for the Cox regression model is that the censoring of subjects does not influence the survival distribution. Thus for those subjects censored at duration t, time without fracture exceeds t. Statistically, this is called noninformative censoring. If the censored subjects have different fracture risk compared to noncensored subjects, most techniques in survival analysis, such as Cox regression, will fail. Poisson regression, a term often used in osteoporosis research, is more appropriately referred to as the exponential survival model, which is a special case of the Cox model in which the baseline hazard function is a constant: l0(t) ¼ l0. Because of the constant baseline hazards one can summarize such data in the incidence of number of personyears. The absolute risk of fracture can readily be derived. The constant baseline hazard function seems to be a strong assumption, but the model is actually quite flexible. For example, it can easily include time-dependent covariates in the regression model. Some applications even included the time-in-study as a covariate in the model, although then the interpretation of the effect is problematic.

† The differences between the Poison and Cox regression models are small. † RRs estimated from Cox are very similar to those estimated by Poisson regression except for small sample sizes (n ¼ 600) or rare outcome (,5 %). † The OR estimated by logistic regression differs from the RR in Cox model in particular when the outcome is common or the risk is high. However, it was not clear whether the overestimation of the OR was due to the missing account of time in the logistic regression or the principal difference between ORs and RRs. † Logistic and Cox regression also show larger differences for short (5 years) and long (30 years) follow-up times. † The authors conclude that logistic regression should only be used for cohort data analysis if the outcome is below 5 % and the RR is below 2. Even then precision and trueness was less than for the other models. The application of Cox proportional hazard models is problematic for the assessment of vertebral fractures 27

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

since the exact date of the fracture is typically not known—incidence rates are determined based on a comparison of two radiographs typically taken at about the same time interval for all subjects examined. This is known as interval censoring. The Cox model assumes that the observation of exact fracture time is possible. For appendicular fractures, such as hip fractures, this is not a problem but this method cannot be applied to vertebral fractures. Here logistic regression models, specifically pooled logistic regression (D’Agostino et al., 1990) or altered logistic HR regression (Sun, 1997), are more appropriate than Cox models. The impact of interval censoring on the estimate of hazards ratio has not been evaluated in the literature. Intuitively, one would expect that the hazards ratio would be underestimated because the times to event have been postponed to the end of observation intervals. 3.4.6

independent factors has not been identified so far. Therefore, when the performance of a certain risk factor is to be investigated it is important to adjust for covariates such as age. Adjustment for multiple risk factors generally reduces sORs and sRRs. The multivariate methods introduced in Section 3.4.4 adjust for dependencies between risk factors. Various automated variable selection schemes such as forward selection, backward elimination, stepwise selection, and best subset selection are available to optimize a model, which means to eliminate risk factors that become insignificant after adjustment for other risk factors included in a given model. However, it must be cautioned that “automated” variable-selection methods result in models that are unstable and not reproducible. The variables selected as independent predictors are sensitive to random fluctuations in the data (Austin and Tu, 2004).

Performance Measures for Risk Prediction

Obviously a more detailed discussion and comparison of all the facets of the various study designs and analysis techniques that can be applied to the assessment of fracture risk is beyond the limits of the present Report. The main intention here is the definition of performance measures that can be used to characterize the discriminatory and predictive powers of a diagnostic test applied to assess fracture risk. It has been shown that standardized measures of risk (ORs and RRs) are the preferred performance measures and that an ROC comparison or bootstrap procedures can be used to compare two diagnostic tests. ORs often overestimate RRs. If the predictive power of a new technique is to be determined, a cohort study is the method of choice. However, these studies are costly and time-consuming and many promising new techniques may never be accepted because they are not included in cohort studies. As an alternative, the new technique and a reference technique such as DXA of the spine or hip can be measured in the reference population. This allows a direct face to face comparison of the predictive power of the new technique. If the predictive power of the reference technique obtained in this case – control setting is similar to its predictive power previously determined from cohort studies, one can have better confidence that the estimates of predictive power for the new technique represent a very close approximation to a prospectively determined standardized relative facture risk. 3.4.7

3.4.8

The Individual Patient

3.4.8.1 Assessment of Fracture Risk. The performance measures to assess fracture risk presented above are based on the concepts of ORs and RRs applied to groups of patients. However, from the clinical perspective, the assessment of an individual patient is more important. In order to determine a patient’s fracture risk, the following information is required: the BMD results, information on relevant other risk factors such as age, prevalent fractures, etc., and the relationship between the risk factors and the probability of risk. The latter can be obtained from prospective studies employing analysis methods as described in Section 3.4.3. For making treatment decisions, risk estimates for the next 5 or 10 years may be most meaningful. Data have been published for selected fracture types and populations (Kanis et al., 2001). While ideally computerized programs should become available that link the information of all the various risk factors with BMD and age information, simple graphs such as Fig. 3.2 published from data from the Rotterdam study may be helpful. The figure shows the dependence of 1-year hip fracture rates in women versus age and BMD. Both of these variables have a strong impact on fracture risk. At a BMDa value of 0.7 g. cm22, a woman of age 58 has only one-tenth of the hip facture risk (0.1 % instead of 1 % per year) compared to a women of age 80 with the same BMDa. Similarly, the risk is smaller by the same factor of 10 if the BMD is higher by 2.5 SDs (Z-score difference of 2.5). If information from additional risk factors is to be incorporated additional charts must be

Combination of Risk Factors

In the literature, a multitude of risk factors for fracture risk has been discussed but a unique set of 28

Quality and Performance Measures in Bone Densitometry

3.5

Performance Measures to Assess Monitoring

Monitoring describes the capability of a technique to longitudinally measure changes of a parameter q. Changes may be age-, disease-, or treatment-related. The present section will first define quantities that can be used to quantify changes. Later, follow-up intervals for individual patients are discussed and finally performance measures to compare densitometric techniques with respect to monitoring are presented.

3.5.1

Figure 3.2. One-year hip fracture rates in women, depending on BMD and age (De Laet et al., 1998).

Measures of Change

3.5.1.1 Longitudinal Response and Response Rates. In analogy to the diagnostic response, (see Section 3.3.3.1.1) longitudinal response is defined as the change of a variable q in two subsequent measurements xn and xnþ1. It is given in the same units as q:

generated, for example, one for women with and one for women without prevalent vertebral fractures. It is obvious that a computerized assessment method is required to cope with the large variety of risk factors already known today.

responselong ¼ qnþ1 – qn :

3.4.8.2 Treatment Decisions. The question that still cannot be answered with all of these charts or programs concerns the criteria when to treat and how to treat. This depends on additional patient aspects and the health-economic setting. Not everything that may be reasonable will be doable—but this subject is clearly beyond the scope of this report. However, it should be acknowledged that a BMD measurement provides not only an estimate of fracture risk but also an estimate about whether the patient is likely to respond to treatment. At least for one group of anti-resorptive treatments, the bisphosphonates, it has been shown that patients with normal BMD will not benefit from treatment. The lower the BMD the stronger is the treatment effect, i.e., the reduction in the number of fractures (Cummings et al., 1998). This is understandable and re-emphasizes the relevance of the diagnosis of low BMD: only if the patient’s problem is osteoporosis, i.e., low BMD, will a bisphosphonate be effective. If the patient has a high risk of fracture for other reasons (i.e., high propensity to fall) an osteoporosis drug is of lesser value. This has been confirmed in another study using bisphosphonates. Here treatment was not effective for a group of individuals that were selected solely on the basis of risk factors other than low BMD (Cummings et al., 1998). More data are required to investigate this issue in more detail, and specifically for ultrasound such data would be extremely valuable to allow the use of QUS methods for making treatment decisions.

ð3:37Þ

We may also define a relative response: relative responselong ¼

qnþ1  qn  100: qn

ð3:38Þ

For longitudinal measurements it is more convenient to use response rates: response ratelong ¼

qnþ1  qn ; t

ð3:39Þ

where t denotes the time between the two measurements and is typically given in units of years. Thus the longitudinal response rate is given in units of q per annum. Relative response rates are given in %/year and are calculated by dividing the relative response by time t.

3.5.1.2 Significant Changes and Monitoring Intervals. The least significant change (LSC) of a densitometric technique measuring q is defined as the change that can be measured with 95 % confidence (Glu¨er, 1999). Assuming Gaussian characteristics (two-tailed test) then for a two-point measurement pffiffiffi LSC ¼ 1:96  2  CV  2:8  CV; ð3:40Þ where CV denotes the long-term coefficient of variation of the technique (see Section 3.2.2.2). The value of LSC is given in percent. If instead of a 95 % confidence level, an 80 % confidence level is used we obtain the so-called trend 29

QUANTITATIVE ASPECTS OF BONE DENSITOMETRY

assessment margin (TAM) (Glu¨er, 1999): TAM ¼ 1:8  CV:

Table 3.2. Hypothetical comparison of performance measures of five techniques illustrating the parameters explained in this section

ð3:41Þ

The monitoring time interval (MTI) is defined as the time required between two measurements so that the change in q is equal to the LSC. It is calculated using the median longitudinal response rate of a cohort of subjects: MTI ¼

LSC : median relative response ratelong

ð3:42Þ

Technique

CV (%)

Relative response rate (%y21)

LSC (%)

MTI (y)

Standardized imprecision (%)

BMDa (spine)a BMDa (hip) BMD (spine) SOS (calc) BUA (calc)

1.0 1.5 2.0 0.16 1.2

0.9 0.6 2.0 0.07 0.3

2.8 4.2 5.6 0.45 3.4

3.1 7.0 2.8 6.4 11.2

1.0 2.3 0.9 2.1 3.6

a

BMDa at the spine is selected as reference technique for the calculation of the standardized imprecision.

Using TAM instead of LSC a so-called trend assessment interval (TAI) can be defined: TAI ¼

TAM : median relative response ratelong

two techniques.

ð3:43Þ

standardized imprecisioninvest

Both MTI and TAI are measured in years. They are estimates of the time period after which half of the patients show a measured change exceeding TAM and LSC, respectively.

3.5.2

¼

Both the MTI and the TAI are measures appropriate to characterize a technique’s ability to monitor skeletal changes: the shorter the MTI and the TAI, the better the monitoring performance. Alternatively we could use concepts analogous to those introduced for determining diagnostic performance (see Sections 3.3.3.1.1 and 3.3.3.1.2).

3.5.2.1 Longitudinal Sensitivity. Longitudinal sensitivity of a variable q expresses observed changes over time (typically per year) as multiples of the imprecision. response ratelong : imprecision

ð3:45Þ

This method of standardization transforms the imprecision of a given technique to the scaling of the reference technique. The multiplication by the ratio of the response rates makes standardized imprecision truly comparable across techniques. The above equation applies to absolute as well as percentage errors. Table 3.2 illustrates a number of hypothetical examples of techniques measuring BMDa at the spine and the hip, BMD at the spine, and SOS and BUA at the calcaneus in a given population. BMDa at the spine is used as the reference techniques relative to which standardized imprecision values are calculated. As can be seen from the equation above, the imprecision, given here as a coefficient of variation, and the relative response rate are the two independent parameters from which the other parameters are calculated. It should be reemphasized that both parameters are populationspecific. Note that the technique with the smallest imprecision does not deliver the best clinical performance. The standardized imprecision is smallest for BMD of the spine, whereas the CV is smallest for SOS.

Monitoring Performance of Techniques

sensitivitylong ¼

response ratelong jref  imprecisioninvest : response ratelong jinvest

ð3:44Þ

3.5.2.1 Standardized Imprecision. Monitoring performance can be expressed by determinations of imprecision if these are corrected for differences in response rates. Such a standardization procedure would permits use of the familiar concept of imprecision instead of using TAI and MTI. As in the case of diagnosis (Section 3.3.3.1.2), the technique being investigated (invest) is compared to a reference technique (ref ) by weighing the imprecision with the ratio of the longitudinal response rates of the

3.5.3

Monitoring of Individual Patients

For clinical decisions, it is important to know whether a measured change reflects a true change of a parameter q. For two point measurements over time according to Eq. (3.40), only changes exceeding 2.8 times the imprecision of the technique used for monitoring can be considered as a criterion for 30

Quality and Performance Measures in Bone Densitometry

true changes (95 % confidence). The corresponding measure is the LSC. On the other hand, clinicians have to balance the desire to attain statistical certainty with the patient’s need to get treated as quickly as possible if there is a valid indication to do so. In order not to withhold potentially important medication, the clinician may be satisfied with confidence levels lower than 95 % (Genant et al., 1989). It is conceivable that varying confidence limits may be appropriate under different clinical situations. For example, when identifying someone who has indeed responded to therapy in a situation where response is expected, the required confidence may be somewhat less. On the other hand, in a situation where a change of therapy is being considered, the clinician may require the 95 % confidence in order to actually change the intervention. Statistically, intervals for any level of confidence can be defined. Avoiding a plethora of different confidence levels one additional, less stringent change criterion (80 % confidence level), the TAM, Eq. (3.41), has been introduced (Glu¨er, 1999). In order to calculate the minimum time between two measurements, the changes considered to be significant have to be related to the relative response rates of the techniques. Again one will

face the dilemma of having to settle for either a quick answer at an early follow-up visit associated with greater statistical uncertainty using the time assessment interval when estimating the true change from the measured change or a more solid answer at a later visit using the MTI with the risk of substantial bone loss and fractures in the meantime. A number of comments apply to the quantities discussed above. † To be consistent, the imprecision and the response rate must be derived from the same cohort. The healthy postmenopausal population is most relevant for the calculation of TAM, LSC, TAI, and MTI; therefore, the fact that published imprecision data are often based on young healthy volunteers is a potential pitfall. † The quantities, LSC and TAM, should be calculated using the in vivo long-term, not the shortterm imprecision. † Response rates depend on many parameters such as age, race, and gender. Of course, they are influenced by treatment and in addition they are technique-specific. Thus TAI and MTI must be used with caution and the context in which they are used must always be specified.

31

3. Quality and performance measures in bone densitometry.

3. Quality and performance measures in bone densitometry. - PDF Download Free
226KB Sizes 0 Downloads 0 Views