Probability state modeling theory.

Original Article

Probability State Modeling Theory C. Bruce Bagwell,1* Benjamin C. Hunsberger,1 Donald J. Herbert,1 Mark E. Munson,1 Beth L. Hill,1 Chris M. Bray,1 Frederic I. Preffer2

1

Verity Software House, Topsham, Maine

2

Department of Pathology, Massachusetts General Hospital, Boston, Massachusetts 02114

Received 19 September 2014; Revised 9 April 2015; Accepted 16 April 2015 Additional Supporting Information may be found in the online version of this article. *Correspondence to: C. Bruce Bagwell, Verity Software House, PO Box 247, Topsham, Maine 04086. E-mail: [email protected] C. B. B., B. C. H., D. J. H., M. E. M., and B. L. H. are employed by Verity Software House, manufacturer of the probability state modeling software product, GemStoneTM. Published online 25 May 2015 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/cyto.a.22687

Abstract As the technology of cytometry matures, there is mounting pressure to address two major issues with data analyses. The first issue is to develop new analysis methods for high-dimensional data that can directly reveal and quantify important characteristics associated with complex cellular biology. The other issue is to replace subjective and inaccurate gating with automated methods that objectively define subpopulations and account for population overlap due to measurement uncertainty. Probability state modeling (PSM) is a technique that addresses both of these issues. The theory and important algorithms associated with PSM are presented along with simple examples and general strategies for autonomous analyses. PSM is leveraged to better understand B-cell ontogeny in bone marrow in a companion Cytometry Part B manuscript. Three short relevant videos are available in the online supporting information for both of these papers. PSM avoids the dimensionality barrier normally associated with high-dimensionality modeling by using broadened quantile functions instead of frequency functions to represent the modulation of cellular epitopes as cells differentiate. Since modeling programs ultimately minimize or maximize one or more objective functions, they are particularly amenable to automation and, therefore, represent a viable alternative to subjective and inaccurate gating approaches. VC 2015 International Society for Advancement of Cytometry Key terms high-dimensional modeling; broadened quantile function modeling; cytometry; polychromatic; high-dimensional visualization

C 2015 International Society for V

Advancement of Cytometry

INTRODUCTION

IN many ways, the cytometry analysis paradigm has changed little since the 1980s. Analysis begins with collecting correlated measurements from populations of suitably stained cells. Samples are normally reacted with reporter molecules or stains that confer information about the quantity of specific types of cellular structures or molecules either on or in the cell. A cocktail of stains along with the cell’s intrinsic properties allows different cell types to be partially or completely delineated from each other in subsequent analyses. Stains can include one or more fluorescentlabeled monoclonal antibodies, nucleic acid-specific fluorescent components, fluorescent lipophilic molecules, viable cell dyes, etc. The intrinsic properties of the cells include, but are not limited to, light scatter and electronic cell volume. Cells are then processed by a cytometer. Depending on the staining cocktail, intrinsic cellular properties, and the instrument’s capability, various signals are detected, digitized, and stored. This stored data structure is often referred to as a listmode file, where cellular measurements are columns and events are rows in this database of event measurements. The number of events is usually in the range of 10– 20,000 but can be in the millions for rare-event types of analyses. Listmode-correlated measurements are normally displayed as one-parameter (1P), two-parameter (2P), or three-parameter (3P) histograms. Using techniques such as principal components, higher numbers of correlated parameters can also be Cytometry Part A 87A: 646660, 2015

Original Article displayed (1). Analysis of the cytometric data generally involves classifying the data into relevant populations or cell types by using some set of hierarchical or refinement gates. A gate is either a one-dimensional range or a two-dimensional boundary where events can either be inside or outside the boundary. Gates can also be combined into Boolean algebraic expressions. Events that satisfy these gates can then be displayed in other histograms defined by other measurements. This gate refinement process is carried out until the populations of interest are displayed on some set of histograms. Statistical analyses of these gated events along with all or some of the graphics are generally the ultimate output of conventional cytometry analysis. This conventional approach to analysis of cytometric data has many limitations, including measurement scalability, gating errors, and multiple-sample averaging and visualization. The number of 2P histograms necessary to represent m-dimensional data encoded in a listmode file is m*(m 2 1)/2 and, thus, scales approximately to the square of m measurements. For example, examining a sample with a cytometer that generates 10 correlated measurements requires 45 separate 2P histograms. A sample with 11 measurements requires 55 2P histograms. With these many histograms, a visual understanding of implicit relationships between measurements is very difficult, if not impossible. Thus, conventional approaches to cytometry analyses do not scale well with an increasing number of correlated measurements. The conventional approach is also limited by compounding gating errors. A gate is a boundary that attempts to define one or more cellular populations. However, populations are actually m-dimensional probability distributions. Attempting to separate these distributions with simple boundaries can result in false positives contained in the gate and false negatives excluded by the gate. Since the output of one gate is typically used for the definition of another, these errors are compounded as the number of measurements increases. Gating errors are further exacerbated by the often subjective placement of gates. The third major issue with conventional cytometry analyses is the difficulty of integrating the cytometric data from related samples into a form that is compact and understandable. Cytometry reports often contain numerous histograms and statistical results, making it difficult to appreciate a clear picture of the status of a specimen or a group of specimens. Other analysis methods have been published that examine and visualize the phenotypic changes in cells as they differentiate and mature in the marrow and other tissues (1–3). Some of these methods have reported reproducibility issues and nonbiological branches (1,2). One recently published method (3) generates plots that resemble probability state modeling (PSM) expression profiles (EPs) but these plots are, for the most part, not consistent with the biology as shown in the companion B-cell manuscript (4). The reasons for this disparity are not selfevident and deserve further study. In particular, the downregulation of CD10 in Bendall et al. does not appear to be coordinated with CD38 downregulation, whereas in the PSM B-cell study (4), the downregulation of the two is exquisitely coordinated. PSM was invented in 2007 (5) primarily to perform high-dimensional modeling of cytometry data. Different aspects of PSM have been described elsewhere (6–8). Most Cytometry Part A 87A: 646660, 2015

recently, Inokuma et al. (9) employed PSM to model memory CD81 T-cell differentiation stages. PSM either mitigates or eliminates the major problems associated with conventional analysis strategies and avoids many of the pitfalls of other strategies. As each additional measurement is added to a probability state model, the information about the measured population typically becomes clearer and more accurate. The visual representation of the correlated data is easily understood by those not intimately familiar with cytometry, even when representing a large number of correlated parameters. There are no gates in a PSM and, therefore, no compounding gating errors or subjective placement of gates. In most instances, the PSM analyses run unattended and, therefore, are reproducible with multiple operators. Many samples can be combined into a single average model to represent a cohort of patients (4), and with appropriate statistical analysis, these data can reveal important coordinated modulations. The purpose of this manuscript is to describe the theory of PSM and provide simple step-by-step examples of how it works. A general strategy for designing PSM models is also presented, which directly relates to a companion article (4) where PSM determines the observable stages in progenitor and B-cell ontogeny. There are three short videos available in the online supporting information that describe the relevant techniques used to model the ontogeny of B cells in bone marrow. The video entitled “Automate B Cell Selection” describes the technique used to select for B cells. The video, “High-Dimensional Modeling” describes how stratifying markers such as CD10, CD34, and CD45 were added to stage B cells. The video, “Advanced B Cell Modeling” describes how zero transition modeling enhanced the accuracy of locating B cell marker modulations and how models were averaged and statistically analyzed for a group of samples.

MATERIALS AND METHODS PSM was performed using GemStoneTM software program version 1.0.118, Verity Software House, Topsham, ME.

RESULTS In brief, a probability state model represents a set of cellular progressions or “cell types.” Each cell type contains a set of broadened Q functions called expression profiles (EPs), which correspond to correlated measurements. In order to understand how PSM can model any number of correlated measurements, it is important to first appreciate the properties and relationships among histograms, P functions, and Q functions. A P function is also known as cumulative distribution function or CDF. A Q function is also known as a quantile function and is the inverse of the P function. These functions will be designated as P and Q throughout this manuscript. Histograms, P Functions, and Q Functions Figure 1A illustrates a simulation of n 5 5,000 measurements, x, randomly drawn from some distribution. The collection of all measurements is represented as X. Figure 1B shows a 100-bin frequency histogram, hX, of this set of measurements, X. Figure 1C shows the associated probability histogram, pX, where each bin’s frequency is divided by the total 647

Original Article

Figure 1. Histograms, P, and Q functions. In order to understand PSM theory, it is important to appreciate the properties and relations among histograms, P functions, and Q functions. Panel A illustrates a simulation of n 5 5,000 measurements, x, randomly drawn from some distribution. The collection of all measurements is represented as X. Panel B shows a 100-bin histogram, hX, of the set of measurements, X, shown in Panel A. Panel C shows a probability histogram, pX, where each bin’s frequency is divided by the total number of measurements, n. Panel D shows the cumulative probability distribution, PX, of pX, where each successive bin’s probability is the sum of the preceding bin’s probabilities. The PX distribution can also be posed as a function, commonly known as a P function. The P function has the property that if each measurement, x, is evaluated by PX, the resultant values form a set of uniform random numbers, UX, which have a relatively uniform frequency distribution shown in Panel E. Panel F is the inverse cumulative probability distribution, QX, which can also be posed as a function, commonly known as a Q function. The Q function has the property that if another set of n uniform random numbers, U, is evaluated by the Q function, a set of equivalent measurements, Y, is produced as shown in Panels G and H. As demonstrated in Panel H, equivalency between these two data sets can be appreciated as similar histograms hX and hY, where their differences are only due to counting error.

648

number of measurements, n. Figure 1D shows the cumulative probability distribution, PX, of pX, where each successive bin’s probability is the sum of the preceding bin’s probabilities. The PX distribution can also be posed as a function, commonly known as a P function. The P function has the property that if each measurement, x, is evaluated by PX, the resultant values form a set of uniform random numbers, UX, which have a relatively uniform frequency distribution as illustrated in Figure 1E. Figure 1F is the inverse cumulative probability distribution, QX, which can also be posed as a function, commonly known as a Q function. The Q function has the property that if another set of n uniform random numbers, Uy, is evaluated by the Q function, a set of values, Y, is produced as shown in Figures 1G and 1H. As demonstrated in Figure 1H, equivalency between the X and Y data sets can be appreciated by inspection of their generated histograms hX and hY, respectively, where the differences are only due to counting error. There are a few important concepts depicted in this figure. The first is that Figures 1C, 1D, and 1F are simply different yet equivalent ways of representing the same information. The data in the figure were created by defining pX and deriving QX . Alternatively, the data could have been created by defining QX and deriving pX . There are a number of very important advantages in defining a model with Q functions rather than frequency distributions. The main advantage is that it side-steps the dimensionality barrier that is normally associated with high-dimensional modeling by relating the coordinated changes in measurements to a single parametric variable. Another important concept demonstrated in Figure 1 is that if one has a set of probabilities as shown in Figure 1C, an inverse cumulative probability distribution function can be created from these probabilities as shown in Figure 1F. If a uniform random number is chosen and evaluated by this function, an element in the set can be appropriately chosen such that the aggregate of all selections results in an equivalent frequency distribution as demonstrated in Figure 1H. This method of selecting elements from a set of probabilities is referred to as stochastic selection and plays a pivotal role in Monte Carlo simulations (10) and PSM. The pseudo code for an algorithm that does this stochastic selection given some weight vector is shown below: rowsðWÞ n mdð1Þ u X W total 0 s cP 0 SSðWÞ :¼ while cP u Ws11 cP cP1 total s s11 return s1 mdð1Þ

(Code 1)

PSM Theory

Original Article The pseudo code (Code 1) represents the basic logic of stochastic selection and should not be considered as in its most efficient and robust form. The function obtains a uniform random number, u, ranging between 0 and 1. It then finds the s element in W that is associated with the cumulative probability, cP, being just greater than u. This part of the code is equivalent to evaluating the Q function formed by a weight vector, W, with a uniform random number. At the end, it dithers the result for the purpose of graphics representation. If necessary, this random component can be removed by the floor function which is mathematically represented with the brackets, b c. It is important to note the cumulative probability axis shown in Figure 1F is expressed more conveniently in PSM as a percent, which will be designated as s throughout the article. The important relationships shown in Figure 1 can be represented compactly using random variables. When the symbol, U , is written, it is understood that it means an infinite set of uniform random numbers ranging between 0 and 1. The data shown in Figure 1A were synthesized by evaluating a subset of uniform random numbers, UX (Fig. 1E), with the inverse function, PX 21 (Fig. 1F), which is also represented as, QX : X ¼ PX21 ðUX Þ ¼ QX ðUX Þ:

(1)

This process can be reversed (Fig. 1E) to produce the original set of uniform random numbers, UX , from the generated, X, UX ¼ PX ðXÞ ¼ QX21 ðXÞ:

(2)

If another set of random numbers, UY , is used to produce Y (Fig. 1G), then X will be consistent with Y (compare Figs. 1A and 1G). : Y ¼ QX ðUY Þ¼X:

(3)

By consistent, we mean that the probability distributions will be the same despite the order of the set elements being different. A corollary to Eq. (3) is that if a set of parameters, C, is chosen such that, : U ¼QX21 ðX; CÞ;

(4)

then the function, QX21 ðX; CÞ, can be assumed to be consistent with the data, X. From a modeling perspective, the set of C parameters can be chosen to approach this equality. PSM generalizes these concepts to include multiple-broadened Q functions. Frequency-Based Versus Probability-Based Modeling A frequency-based kernel is usually defined in traditional modeling packages, which represents some parent distribution without error. In Figure 2A, a frequency-based kernel is shown that represents two populations of events similar to those depicted in Figure 1. The lower-intensity Cytometry Part A 87A: 646660, 2015

population is given a green color and the higher-intensity population, a blue color. In Figure 2B, the kernel is then broadened to simulate measurement uncertainty. This uncertainty can be due to measurement error and/or heterogeneity. Broadening can be either a brute force method or a closed-form function, that is, the mathematical convolution with an error distribution such as the normal distribution. Minimization methods such as Levenberg–Marquardt (11) can then find the “best” model parameters given a suitable response function such as the summation of v2s between the model and observed histogram data. Figure 2C shows the result of this minimization method applied to the data generated in Figure 1. PSM can be divided into similar modeling steps. PSM begins with a similar kernel shown in Figure 2D. In this case, the measurement, x, is on the y-axis and cumulative percent is on the x-axis. When the kernel is presented in this manner, instead of two equal frequency spikes (Fig. 2A), there are two equal cumulative percent intervals (Fig. 2D). Modeling data with these types of Q functions is a well-accepted and statistically valid approach to modeling (12). The novel aspect of PSM is that the Q functions can also be broadened to depict measurement uncertainty (Figs. 2E and 2F). In PSM, a broadened Q function is referred to as an EP. Minimization methods find the optimum parameters of the probability-based model that best fit the data (Fig. 2F). Details of these response functions are described in the following theoretical section. Main Advantages of PSM Probability-based modeling has a number of advantages over its frequency-based counterpart. The most obvious difference is that the PSM kernel has an intrinsic direction. In Figure 2D, the high-intensity population (blue) could have been defined first and the low-intensity population (green) second without negatively affecting the fit. Since both cumulative percent and time share the characteristic that they are both nondecreasing functions, cumulative percent can act as a surrogate for time and place events in a natural progression order. A second advantage is that the cumulative percent axis can be shared by other EPs that also modulate with a progression. In mathematics, a line can be defined in any number of dimensions by creating a set of equations, each with a common parameter such as t. As t varies, each equation defines how the function changes in each dimension. Since all EPs have the same common parameter, cumulative percent, they can also define how a progression modulates in any number of dimensions. PSM Theory Cytometry data can be represented as a matrix, L, where the rows are events and the columns are the correlated measurements that have been properly compensated to eliminate any signal crossover (13). The linear data in L are normally mapped by a set of transform functions, Z, to a new Xmatrix, where the rows are still events but the column values are now in transformed units. The transforms are designed to stabilize cluster variances and to set a uniform maximum scale for all transformed values (14). Note that all vector and matrix indices will be presented as one-based integers. 649

Original Article

Xi;j ¼ Zj ðLi;j Þ;

i 2 1; 2; . . . ; n; j 2 1; 2; . . . ; m;

where; supfX 2 Rm : Xi;j < xmax g ¼ xmax ; Li;j : linear compensated datum from the ith event and jth measurement;

(5)

Zj : jth transform function; xmax : maximum transformed measurement; usually set to 100:

Y

EPj ðs; Xi;j ; Cj Þ

j

pi ðsÞ ¼ T ðs; Xi; ; CÞ :¼ X Y s

A single process or cell type is defined as the function, pi ðsÞ ¼ T ðs; Xi; ; CÞ, with the state of a progression, s, and the ith event’s set of transformed measurements, Xi; as independent variables. There can be any number of cell types in a model. Most of the mathematics that follows is for a one cell type system; however, at the end, we will consider models with multiple cell types. The function is controlled by a set of parameters, C, referred to as control definition points or CDPs. The function returns the probability that event i is associated with the state of progression s, pi ðsÞ.


; 0 pi ðsÞ < 1;

j

EPj ðs; Xi;j ; Cj Þ ¼ N ðXi;j ; Qj ðs; Cj Þ; rj ðs; Cj ÞÞ;

0 EPj ð. . .Þ < 1;

i 2 1; 2; . . . ; n;

j 2 1; 2; . . . ; m;

s 2 U ð0; smax Þ;

where; s:

state of the progression ðnot to be confused with timeÞ;

pi ðsÞ

probability of event i is at progression state s;

Xi; :

all transformed measurement values for event i;


jth expression profile; evaluated at s and the ith events;

(6)

jth measurement value with control parameters; Cj ; Qj ðs; Cj Þ :

model function for the jth transformed measurement evaluated at s defined by parameters contained in the matrix; Cj ;

rj ðs; Cj Þ :

standard deviation function or line spread for the jth transformed measurement evaluated at s; also defined by matrix; Cj ;

smax :

maximum range of the state of progression; usually set to 100;

m:

number of correlated measurements in the model;

n:

total number of observations or events in process T :

PSM refers to this progression function, T ðs; Xi; ; CÞ, as a cell type. The function Qj ðs; Cj Þ is equivalent to the PSM kernel shown in Figure 2D. Pseudo code for this function is given as

The line-spread function, rj ðs; Cj Þ, describes how to broaden the PSM kernel as a function of s (Figure 2E), and its pseudo code is given as

rowsðC Þ nCDPs for r 2 1 nCDPs Q ðs; C Þ ¼ return Cr;2 if s ¼ Cr;1 s2Cr21;1 return C Cr;2 2Cr21;2 if Cr;1 > s: r21;2 1 Cr;1 2Cr21;1

rowsðC Þ nCDPs for r 2 1 nCDPs rðs; C Þ ¼ return Cr;3 if s ¼ Cr;1 s2Cr21;1 return C Cr;3 2Cr21;3 if Cr;1 > s: r21;3 1 Cr;1 2Cr21;1

(Code 2)

(Code 3)

650

PSM Theory

Original Article

Figure 2. Frequency-based versus probability-based modeling. In traditional modeling packages, a frequency-based kernel is defined that is assumed to represent the parent distribution without error. In Panel A, a frequency-based kernel is shown that represents two populations of events similar to those depicted in Figure 1. The lower-intensity population is given a green color and the high-intensity population, blue color. The heights and positions are initially in arbitrary positions. As shown in Panel B, the kernel is then broadened to simulate the error in the measurement and the heterogeneity of the population’s intensities. Broadening can be either a brute force method or can be a closed-form function, that is, the mathematical convolution with an error distribution such as the normal distribution. Minimization methods such as Levenberg– Marquardt can find the “best” model parameters given a suitable response function such as the summation of v2s between the model and observed histogram data. Panel C shows the result of this minimization method applied to the data generated in Figure 1. PSM can be divided into similar modeling steps. PSM begins with a different but largely equivalent kernel shown in Panel D. In this case, the measurement, x, is on the y-axis and cumulative percent is on the x-axis. When the kernel is presented in this manner, instead of two equal frequency spikes (Panel A), there are two equal width cumulative percent intervals (Panel D). Cumulative percent is just cumulative probability times 100; therefore, all the equivalencies shown in Figure 1 apply. A Q function can also be broadened to depict either error in the measurement or heterogeneity of the populations (Panel E). Minimization methods then find the optimum parameters of the model that best fit the data. Details of the response functions are described in the theoretical section. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

An example of an EP is shown in Figure 2F. It is a surface probability distribution with progression, s, as the first independent variable and the transformed measurement value as the second independent variable. The returned probability for the ith event and jth-transformed measurement value and EP is given as EPj ðs; Xi;j ; Cj Þ ¼ NðXi;j ; Qj ðs; Cj Þ; rj ðs; Cj ÞÞ:

(7)

The EP function’s probability is drawn from the normal distribution, N , where the mean and standard deviation are returned by the Qj and rj functions. If an EP is consistent with the data, then the following equivalency will be true: Uj ¼ EPj ðs; X;j ; Cj Þ:

(8)

If all the jth-transformed values are evaluated by jth EP, then the result will be a set of uniform random values [see Eq. (2) for an analogous relationship]. In addition, if all the jth EPs are consistent with all transformed measurements, then the following equivalency is also true: Us ð0; smax Þ ¼ T ðs; X; ; CÞ:

(9)

If all the data are consistent with progression T, then it will return a set of uniform random values ranging between 0 Cytometry Part A 87A: 646660, 2015

and smax . These equivalencies will be demonstrated below with a simple example. Example The matrices, Cj , are the parameters that define the model, not to be confused with the correlated measurements, which are often misnamed as “parameters.” PSM refers to the rows of these matrices as CDPs. The following simple example demonstrates how these CDPs define the modulation of three correlated measurements:

0 0

C1 ¼

70

0 B B 50 B ¼B B 55 @ 100

40

0

40 70

60 5

1

B C B 20 60 5 C B C ; C2 ¼ B C; C3 B C 10 @ 40 30 5 A 100 30 5 1 5 C 5C C C: 7C A

70 10

100

0

!

(10)

70 7

The abovementioned example matrices have three columns of parameters. The first column represents the state of 651

Original Article progression, the second, the transformed measurement intensity, and the third, the associated standard deviation of the transformed measurement uncertainty or line-spread. The first and last rows define the beginning and ending of a progression. The first matrix, C1 , represents modeling parameters associated with a “constant” EP. Constant EPs are generally used to select for events of interest. For the B-cell models presented in the companion article (4), SSC is a constant EP that refines the selection of CD191 lymphocytes. The matrix, C2 , is an example of a “step-down” type of EP that represents the downregulation of a measurement from an

ðEj Þv;s

r ¼ 100

100ðv10:5Þ=w ð

100s=r ð

intensity of 60 to 30, starting at s ¼ 20and ending at s ¼ 40. The matrix, C3 , is an example of a “step-up” type of EP where the step up begins at s ¼ 50, which is after the downregulation defined by C2. Also note that these matrices define a linespread as a function of s. The line-spread is used to model both measurement uncertainty and biologic diversity. The functions, Qj ðs; Cj Þand rj ðs; Cj Þ, are instantiated internally as the matrices, Ej during data synthesis and modeling. The integral binning equation that converts these functions to their corresponding w 3 r matrix elements is given as

N ðx;Qj ðs; Cj Þ; rj ðs; Cj ÞÞdsdx

100ðv21:5Þ=w 100ðs21Þ=r

2ðx2lÞ2 1 where; Nðx; l; rÞ ¼ pffiffiffiffiffiffiffiffiffiffi e 2r2 ; 2 2pr

(11)

v 2 1; 2; . . . ; w; s 2 1; 2; . . . ; r:

Although Eq. (11) compactly shows how the matrices are produced, more efficient algorithms are normally required to minimize computation time. The rows of E are the levels of expression for the transformed measurement, and the columns are individual states or bins along the state of progression axis. Normally, there are r 5 100 states and w 5 100 intensity levels in a PSM model, but in some cases, it may be desirable to adjust the number of states. As the system is building the E matrix, it also builds a v2 matrix, v2 , using similar binning equations. The v2 matrix is mainly used to select for the events of interest and

0

0

B B B 0:02 B B E1 ¼ B B 0:48 B B B 0:48 @ 0:02

0

0

0:02 0:02 0:48 0:48 0:48 0:48 0:02 0:02

0

0

1

0

0

0:07 0:5 0:5 0:5

1

0

0:02 0:02

B B C C B B C C 0:02 0:02 C B 0:02 0:60 0:5 0:5 0:5 C B 0:96 0:96 B B C C B B C C B B C 0:48 0:48 C ¼ 0:96 0:33 0 0 0 ¼ ; E ; E C 2 B C 3 B 0:02 0:02 B B C C B B C C 0:48 0:48 C 0 0 0 0 C 0 B 0:02 B 0 @ @ A A 0:02 0:02 0 0 0 0 0 0 0

The first matrix shows the constant EP probability matrix after C1 is appropriately binned using Eq. (11). Only two decimal digits are being shown above, which means that a zero may represent a value that is less than 0.01. The second matrix, derived from C2 , shows the step-down type of EP with a constant line-spread of 5. The third matrix, derived from C3 , shows the step-up type of EP with its variable line-spread. Note that the beginning intensity level is row 1, which vertically inverts the step-up and step-down matrix patterns. There are two different forms of these matrices depending on whether the model is synthesizing data or analyzing data. When synthesizing, the matrices are as shown earlier where each column is a probability vector with elements summing to unity.

652

to reduce the number of computations during the modeling process. The abovementioned binning equation assumes that the line-spread is normally distributed for transformed data. In order to make this binning step as clear as possible, assume that the EP matrices only have five states (r 5 5) and five levels of intensity (w 5 5). The first column of these matrices represents the beginning of the progression, if there is one, and the last column, the end. The rows represent the five levels of intensity. With these constraints, the C matrices defined earlier become the following three matrices after binning:

0:01

0

0:56

0

0:28 0:5 0:15 0:5 0

0

0

1

C C 0 C C C 0:5 C C: (12) C C 0:5 C A 0

Synthesizing Data The procedure for synthesizing events from a probability state model is shown below:

Si 2 U ð0; rÞ;

i 2 1; 2; . . . ; n;

Xi;j ¼ SSððEj Þ;bSi c Þ; j 2 1; 2; . . . ; m:

(13)

Once an integer state is drawn from the uniform random variable, the synthesized correlated measurement values for the event are stochastically selected using the appropriate column E vector. If these stochastic steps are repeated n 5 500 times for the example model, a list of m 5 3 correlated event values that are consistent with the model are generated.

PSM Theory

Original Article 0

4:35

1

3:174 2:304

0

1:193

1

B C B C B 5:119 2:532 2:166 C B 1:147 C B C B C X¼B C; S ¼ B C: B 4:52 2:956 2:462 C B 3:057 C @ A @ A ... ... ... ...

0

0

0

0

B B 3 3 2 B B ES1 ¼ B B 50 36 55 B B 36 46 54 @ 3 0 3

0

0

(14)

0

1

0

0

1

0

147

1

0

12

1

B C B C B C B C B C B C B 10 C B 235 C B 220 C B C B C B C B C B C B C C; Y2 ¼ B 114 C; Y3 ¼ B 149 C; 241 Y1 ¼ B B C B C B C B C B C B C B C B C B C B 238 C B 4 C B 117 C @ A @ A @ A 11 0 2 FT ¼ ð 92

85 114 93

(16)

116 Þ:

Each of these matrix and vector summations can be reexpressed as expectation values and Poisson distributions that reflect simple counting error, n ðES j Þv;s ¼ ðEj Þv;s ; r X n r ðY j Þv ¼ ðEj Þv;s ; r s¼1 n Fs ¼ ; r

3

B C B 2 54 1 C B C B C C 48 52 C; ES2 ¼ B B 86 28 B C B 5 0 C 43 59 A @ 1 4 0 0 1

These summation matrices are formed by calculating the frequency distributions from all the selected columns and rows for each of the three example E matrices. For this example, there are four important marginal frequency vectors: three frequency row projections, Yj , and the single column or state frequency projection, F: 0

Plotting S versus one of the columns of X results in a dot plot similar to those found in Figures 4 and 7 of the companion B-cell article (4). In addition, the following important frequency distributions were generated from this process:

ðESj Þv;s ¼ PoisððES j Þv;s Þ; ðYj Þv ¼ PoisððY j Þv Þ;

(17)

Fs ¼ PoisðF s Þ:

The most important expectation and distribution is the last, F; F, as will be commented on later. If n 5 500 events were synthesized for this example, our expectation is that on the average n/r or 100 events will be in each element of F and they will be Poisson distributed. Analyzing Data Given the above-synthesized X data, is it possible to reverse this selection process and appropriately distribute the Cytometry Part A 87A: 646660, 2015

48 45 51

0

1

6

B C B 82 66 48 65 C B C B C C 0 0 0 C; ES3 ¼ B B4 B C B0 C 0 0 0 A @ 0 0 0 0

5

1

0

76 62

0

4

33 47

0

18 46

0

0

0

0

1

C 0 C C C 61 C C: C 53 C A

(15)

2

events in the ESj matrices as well as in the Yj and F vectors? This process is best understood by first renormalizing the model’s E matrices as shown below: 0

0:2

B B B 0:2 B B 0 B E1 ¼ B 0:2 B B B 0:2 B @ 0:2 0 0 B B B 0:01 B B 0 E2 ¼ B B 0:74 B B B 0:94 @ 0:97 0 0:4 B B B 0:39 B B 0 E3 ¼ B B 0:02 B B B 0 @ 0

0:2 0:2

0:2 0:2

1

C C 0:2 0:2 C C C C 0:2 0:2 0:2 0:2 C; C C 0:2 0:2 0:2 0:2 C C A 0:2 0:2 0:2 0:2 1 0:04 0:32 0:32 0:32 C C 0:27 0:24 0:24 0:24 C C C 0:26 0 0 0 C C; C C 0:06 0 0 0 C A 0:03 0 0 0 1 0:4 0:2 0 0 C C 0:39 0:22 0 0 C C C 0:02 0:20 0:38 0:38 C C: C C 0 0:12 0:44 0:44 C A 0 0:12 0:44 0:44

0:2 0:2

(18)

Each row is normalized such that if there were any positive probabilities in a row, the sum of the row elements is unity. These matrices encode how to stochastically distribute each event’s synthesized measurement into each state. As an example, the first event [Eq. (14)] has the following three synthesized measurement values: X1; ¼ ð 4:35 3:174 2:304 Þ:

(19)

These values map to the probability vectors in the fourth 0 0 0 row in E1 , the third row in E2 , and second row in E3 . The 653

Original Article three selected probability vectors for each measurement are shown below: WT1

¼ ð 0:20 0:20 0:20

0:20 0:20 Þ;

WT2

¼ ð 0:74 0:26 0:00

0:00 0:00 Þ;

WT3

¼ ð 0:39 0:39 0:22

0:00 0:00 Þ:

ðW1 8W2 8W3 ÞT ¼ XT ¼ ð 0:0578 0:0203 0 0

This weight vector can then be evaluated by the SS function to select for the event’s assigned state. Before showing the results of these selections, there is an important technical issue to fix with the above procedure. If E0 is defined with row probability vectors as shown earlier, then the multiplications of the weight vectors shown in Eq. (20) can result in the element values in Eq. (21) becoming too small to be stored in digital form when numerous EPs are present in the model. To avoid this issue, each row in the E0 matrices is normalized such that the maximum element is unity. The example matrices shown earlier are actually encoded as shown below:

(20)

Since the stochastic selections for each measurement were independent of each other, the three probability vectors shown above need to be multiplied element-by-element to obtain the final weight vector: 0

1 1 1

B B1 B B 0 E1 ¼ B B1 B B1 @ 1

1 1

0

1

0

0:13

1

1

B C B 0:04 1 1C 1 0:84 0:84 B C B C 0 B C 1 1 C; E 2 ¼ B 1 0:35 0 0 B C B 1 1 1C 0:06 0 0 @ A 1 1 1 0:03 0 0

1 1 1 1 1 1 1 1

0 Þ: (21)

1

1

0

1

1

0:52 0 0

B C B 1 0:84 C 1 0:59 0 B C B C 0 B C 0 C; E3 ¼ B 0:05 0:05 0:56 1 B C B 0 0 C 0 0:29 1 @ A 0 0 0 0:27 1

1

C 0C C C 1C C: C 1C A 1

(22)

The entire analysis process for n events and m correlated measurements is summarized as

0

ðEj Þv;s ¼

Xi ¼

8 > > < > > :

Y

ðEj Þv;s

maxððETj Þ Þ

maxððETj Þ Þ

for

0

otherwise;

0

T

ððEj Þ Þ

j

9 ;> > = > > ;

Si ¼ SSðXi Þ;

9 > > 0; > = > > ;

v 2 1; 2; . . . ; w; s 2 1; 2; . . . ; r; (23)

j 2 1; 2; . . . ; m; i 2 1; 2; . . . ; n:

Given the model matrices, E, and correlated data, X, the above equations ultimately categorize the events into specific states, S. When the general selection process shown in Eq. (22) was applied to the example model and data, the following important summation matrices and vectors were generated: 0

0

0

B B B 1 4 B B B ES1 ¼ B B 46 40 B B B 55 42 B @ 1 3 0 1 0 B C B C B 10 C B C B C B C C; Y2 Y1 ¼ B 241 B C B C B C B 238 C B C @ A 11

654

0

0

3

1

0

1

0

15 12

28 41 51

1

0

6

4

2

0

C B C B C B C B B 40 34 40 67 54 C B 92 82 46 0 1 C C B C B C B C B C B C B C B C 46 57 52 C; ES2 ¼ B 45 43 24 1 1 C; ES3 ¼ B B 5 3 35 54 C B C B C B C B B 3 0 1 0 0 C B 0 0 10 53 41 49 51 C C B C B A @ A @ 3 2 2 0 0 0 0 0 0 0 0 2 0 0 1 1 147 12 B B C C B B C C B 235 C B 220 C B B C C B B C C B B C C C; Y3 ¼ B 149 C; FT ¼ ð 103 89 93 109 106 Þ: 114 ¼B B B C C B B C C B B C C B 4 C B 117 C B B C C @ @ A A 0 2

0

1

C C 0 C C C C 52 C C; C C 54 C C A 0

(24)

PSM Theory

Original Article What can be appreciated in Eqs. (24) is that this analysis procedure results in summation matrices and vectors that are also described by the mean and Poisson distributions shown in Eqs. (17). This result supports the conclusion that if a data set is consistent with the probability state model as described earlier, the distributions of the resultant summation equations will be described by Eqs. (17), and most importantly, the F vector will have a uniform frequency distribution within counting error. The analysis procedure discussed above can be summarized as

can also be used to detect very low-frequency aberrant B cells that are present in bone marrow samples, which has merit for minimum residual disease detection. Multiple Cell Types Each kth cell type in a model generates a v2 summation value, Nk;i , for each ith event. The formulas below show how PSM chooses a cell type to associate with an event:

pk;i ¼ C; X ! F; F; ES j ; ESj ; Y j ; Yj :

(25)

This equation states that given a model defined by C and correlated data, X, the set of summation matrices, ES j ; ESj , and vectors, F; F; Y j ; Yj , can be calculated as described earlier. If the model is consistent with the data, their expectations and Poisson distributions will be described by Eqs. (17). Probability of Exclusion Another important aspect to this theory is how the system decides whether an event is to be included into the process or cell type, Tðs; Xi; ; CÞ. Since the example events, X, were generated from the model, all synthesized events are considered part of the process in the example. However, with cytometry data, only a subset of all events in a data file is normally considered for modeling. In the case of the B-cell models described in the companion article (4), only around 2% of all the sample events are B cells. Once the cell type within a model is defined, a critical v2 value, x 2 c , is evaluated from the following formulas:

Fðx 2 ; mÞ ¼

cðm=2; x 2 =2Þ ; Cðm=2Þ

x: is some v2 value;

otherwise;

x 2 c : is a critical v2 value:

An event is deemed part of the model process for a specific cell type if it satisfies the following logical expression: X

ðv2 j ÞbXi;j c;s

j

(27)

Ni x 2 c : The summation of the v2s, Ni , for an event is a kind of statistical distance metric that expresses how far an event is from the model’s m-dimensional probability distribution. It Cytometry Part A 87A: 646660, 2015

(28)

2 1 X X X ððESj Þv;s 2ðES j Þv;s Þ ; mwr j v s Minð1; ðES j Þv;s Þ 2 1 X X ððYj Þv 2ðY j Þv Þ ; mw j v Minð1; ðY j Þv Þ

H3 ðF; FÞ ¼ (26)

(28)

Model Objective Functions Equations (17) lead to the following objective functions that quantify how close a model is to describing observed data:

H2 ðY; YÞ ¼

pex : is a defined probability of exclusion value; and

Ni ¼

x¼Nk;i > > > > : 0;

Once each cell type has assigned an event to a particular state [Eq. (23)], the system can then calculate a probability value associated with each cell type [Eq. (28)]. The probabilities are the upper tails of the mk degree of freedom v2 distribution. The system then uses stochastic selection again to assign the event to a particular cell type. Therefore, there are two levels of stochastic selection involved in assigning an event to a specific cell type and progression location. The first level defines a state location along the progression axis for each cell type and the second level, defines a specific cell type. If an event has zeros for all the probability members, the event is deemed unclassified and is ignored by the modeling system.

H1 ðES; ESÞ ¼

m : is the number of expression profiles in the process;

Nk;i v2c ;

ti ¼ SSðp;i Þ:

x 2 c ¼ F 21 ð12pex ; mÞ; where;

8 1 ð > > > > Fðx; mk Þ;

Constructor theory of probability.

MODELING AIRWAY PROBABILITY.

Comments on quantum probability theory.

Human B-cell and progenitor stages as determined by probability state modeling of multidimensional cytometry data.

Probability and the patient state space.

Clusters galore: insights about environmental clusters from probability theory.

Surprisingly rational: probability theory plus noise explains biases in judgment.

Correction for 'Causal inference, probability theory, and graphical insights'

Additional thoughts on causal inference, probability theory, and graphical insights.

Multi-state modeling of biomolecules.

Transition probability estimates for non-Markov multi-state models.

Spectral Rate Theory for Two-State Kinetics.

Kinetics of hemoglobin and transition state theory.

Density functional theory in the solid state.

State machine modeling of MAPK signaling pathways.

Viral kinetic modeling: state of the art.

Poisson-Nernst-Planck-Fermi theory for modeling biological ion channels.

Modeling Adversaries in Counterterrorism Decisions Using Prospect Theory.

Resolution modeling in PET imaging: theory, practice, benefits, and pitfalls.

Modeling from Theory and Modeling from Data: Complementary or Alternative Approaches? The Case of Ionic Liquids.

Modeling Sensor Reliability in Fault Diagnosis Based on Evidence Theory.

Item Response Theory Modeling of the Philadelphia Naming Test.

Hierarchical Bayesian Modeling for Test Theory Without an Answer Key.

The role of theory and modeling in medical research.