Statistic Corner

A practical overview on probability distributions Andrea Viti1, Alberto Terzi2, Luca Bertolaccini2 1

Thoracic Surgery Unit, S. Croce Carle Hospital, Cuneo, Italy; 2Thoracic Surgery Unit, Sacro Cuore Research Hospital, Negrar Verona, Italy

Correspondence to: Andrea Viti, MD, PhD. Thoracic Surgery Unit, S. Croce e Carle Hospital, Via Michele Coppino 26, 12100 Cuneo, Italy. Email: [email protected].

Abstract: Aim of this paper is a general definition of probability, of its main mathematical features and the features it presents under particular circumstances. The behavior of probability is linked to the features of the phenomenon we would predict. This link can be defined probability distribution. Given the characteristics of phenomena (that we can also define variables), there are defined probability distribution. For categorical (or discrete) variables, the probability can be described by a binomial or Poisson distribution in the majority of cases. For continuous variables, the probability can be described by the most important distribution in statistics, the normal distribution. Distributions of probability are briefly described together with some examples for their possible application. Keywords: Probability distributions; discrete variables; continuous variables Submitted Nov 09, 2014. Accepted for publication Dec 17, 2014. doi: 10.3978/j.issn.2072-1439.2015.01.37 View this article at: http://dx.doi.org/10.3978/j.issn.2072-1439.2015.01.37

A short definition of probability

(II) The sum of the probabilities of E = P(E1) + (E2) + … + P (En) is 100%; (III) If E1 and E3 are two possible events, the probability P  X x  E P  E   P  E    P  En   1 thatone or the other could happen P (E1 or E3) 1is equal 2to the sum of the probability of E1 and the probability of E3 (Eq. [2]):

We can define the probability of a given event by evaluating, in previous observations, the incidence of the same event under circumstances that are as similar as possible to the circumstances we are observing [this is the frequentistic definition of probability, and is based on the relative frequency of an observed event, observed in previous circumstances (1)]. In other words, probability describes the possibility of an event to occur given a series of circumstances (or under a series of pre-event factors). It is a form of inference, a way to predict what may happen, based on what happened before under the same (never exactly the same) circumstances. Probability can vary from 0 (our expected event was never observed, and should never happen) to 1 (or 100%, the event is almost sure). It is described by the following formula: if X = probability of a given x event (Eq. [1]):

P  E1 or E2  P  E1   P  E3 

P  p, p, p , q, q  pppqq  [2] p3q 2

Probability could be described by a formula, a graph, in which each event is linked to its probability. This kind of n! x n x f  x   nC nCx  xp q description of probability is called probability distribution. x! n  x !





Binomial f  x   1 distribution

 f  4  10  C4  0.3  0.7  0.2001 4

6

 1 A classic of probability distribution is the binomial  f  xexample distribution. It is the representation of the probability when only two events may happen, that are mutually exclusive.  x e  2.753 e2.75  f x 3    X You  can only 0.221 The typical example is when you toss P a coin. x! 3! have two results. In this case, the probability is 50% for both  E P  E1   P  E2  [1] P  Enevents. However, binomial distribution may describe also   P  X x 1 x two events that are mutually exclusive but are This is one of the three axioms of probability, as Z not equally  x     1 possible baby will be left- z described by Kolmogorov (2):  f  x  (for instance e 2 , that   xa newborn  P  E1 or E2  P  E1   P  E3  P  p, p, p , q, q  pppqq  p 3q 2 1 2  handed or right-handed). The probability that (I) If under some circumstances, a given number of  f x zindividuals e 2 ,   z    2 present a given characteristic, p, that is mutually exclusive events (E) could verify (E1, E2, E3, …, En), the probability (P) of another one, called q, depends on the possible number of any E is always more than zero; n! x n x 2

2

2

nCx 

x ! n  x !

f  x   nCx p q



z1

z0

© Journal of Thoracic Disease. All rights reserved.

f  x  1

 f  x  1

1 z2 e dz 2 2

www.jthoracdis.com

 f  4  10  C4  0.3  0.7  0.2001 4

z

65  70  1.67 3

J Thorac Dis 2015;7(3):E7-E10

6



1.67



1

e

1.672 2

P    z  1.67   0.0475

 P  X x

 E P  E1   P  E2  

1

E8

 P  En 

Viti et al. Probability distributions

P  E1 or E2  P  E1   P  E3 

of combinations of x individuals within the population, called C. If my population is composed of five5 individuals, that can be p or q, I have ten possible combinations of, for instance, three individuals with p is (Eq. [3]):  E P  E1   P  E2    P  En  pppqq, pqqpp, ppqpq, ppqqp, pqpqpq, qpppq, qpqpp, qppqp, qqppp

P  p, p, p , q, q  pppqq  p 3q 2

when the period of observation is longer. To predict the probability, I must know how the events behave (thisn !data comes from previous, or historical, f  x   nC p x q n x nCx  x ! nof  xthe observations ! same event before the time xI am trying to perform my analysis). This parameter, that is a mean of the events in a given interval, as derived from previous f  x  1 4 6 3 2 observations, is called λ.  f  4  10  C4  0.3  0.7  0.2001 P  E3  P  p, p, p , q, q  pppqq  pq [3] x  1  f Poisson The distribution follows the following formula 3 2 P X  x  1  E P E  P E   P E Then p q will be multiplied for the number of          11 2 n  (Eq. [8]): P X  x  E P E  P   1   E2    P  En    combinations (ten times). x n x  x e  2.753 e2.75 f  x   nCx p q f  x  P  X 3 [8] 0.221 If, in experimental population, I had a big number of  x! 3! P  E1 or E2  P  E1   P  E3  P  p, p, p , q, q  pppqq  p 3q 2 3 2 individuals (n), the number of combinations individuals P  E1 or E2  ofPx E  P E P p , p , p  , q , q pppqq  p q      1 3 where the number e is an important mathematical constant within the population will be (Eq. [4]): that is the base of the natural logarithm. It is approximately 4 6 x  f  4  10  Cn4! 0.3  0.7  0.2001 Z x n x  x    equal to 2.71828. f  x n! nCx p q nCx  x n  x  [4] ff xx  nC1x p qe 2 ,    x    E P xE!1nPxE ! 2    P  En  nCx   x ! n  x ! For example, the distribution of major thoracic traumas    E P  E1   P  E2    P  En  1 2z 2  P  X x 1 f  z  a month e ,   z  Therefore, the probability that a group of x individuals needing intensive care unit (ICU) recovery during 2 3 2.75 2.75 e 3 n within the presents4 the 6 in the last three years in a Third Level Trauma Center f  px,Xp , 1p P  E3  ,3q , q  pppqq  of p0.221 q 2 individuals  P population f  x  f 14  10 3 2 4 6  pppqq  0.7  p0.2001 3! P E or  E P E  P E P p , p , p  ,Cq4, q 0.3  qfollows         characteristic q, will be described by the Poisson 1 2 p, that 1 excludes 3  f  4  a 10  C4  0.3 distribution,  0.7  0.2001were λ=2.75. In a future  f  x  1 2 f x  1 z   1  z 65  70  following formula (Eq. [5]): period of one what is the probability to have three z 1.67 e month, dz  z 2major thoracic trauma in ICU? 3(Eq. [9]):  2 x n  x patients with f  x   nCxnp! xq  [5] x n  x 3 2.75 f  xx  nCx  Zx e   2.75 e   nCx p q  e 2.753 e2.75     f x P X 3     x ! n  x !      x   that describes distribution. It follows3!the 0.221 P  X 3  0.221 f  x  [9] 1.67 z x ! the 1binomial x! 3! 1.67 1  f  z rules (Eq. e 2 [6]): ,   z   2 Kolmogorow’s  2 e the probability is 22.1%.P    z  1.67   0.0475 Therefore, 6 PE   E P  E1   P  E242  n  ff  4x 10 C4  0.3  0.7  0.2001 1 6 x  4  f  4  10  x   distribution refers only to discrete  x    [6] 0.2001The binomial ZC4  0.3  0.7 Z  present a limited number of values within f x  1     x     1 variables (that 2     0.9082 1 ,   3x 2   f x65  70 e P    z  1.33 P  0.9082  0.0475  0.8607 86.07% zP p, p, p , q1.67  f  x e 2 , 1  xz   a given P  E3  , qpopulation, pppqq 30% p q of the interval). However, in nature, many variables may 2 In a given people are left-handed. 3 2 1 2z 2   f z e    z   ,   3 2.75 e ,    zof  values, within a given 2.75 e  z 2 2.75 present an finfinite   from P select 0.221 this population, what If we  X 3xten  2distribution e  individuals 2.75is3 ethe 3! 65  70 74  70  0.221 f  x  P  X 3    interval. P  65  These x  74   are P  called continuous z 1.33  P    z  1.33  P    1.67  z (3). probability that   Pvariables x ! four out of ten individuals are left handed? 3! 2

2

2

1

0

2

2

2

2

2

2

2

2

3

3

f  xcan  nC x p qthe binomial distribution, since we suppose We P z    0.0475 1 zapply z2 1.67 65  70 z 1z   z 2 65  70  1.67 e dz x   thatza person may be either left-handed or z  of 1.67continuous variables e 3dz x   2 Z 2 z 2 right-handed. Distributions 3 2 our Z  Se we can use formula (Eq. [7]):  x      x    1 z 241  An example of continuous variable is the systolic blood  e 26 0.2001 x   4x0.9082  082  Pff   0.0475 e,0.8607 10  0.3  0.7 f2C  z   ,  86.07%   4z  1.67 1 [7]2z 1.67 1 1.67 2  e ,  pressure.  z   Within a given cohort of systolic blood pressure 1.67 P1  2 zf  z1.67 2  0.0475  2 e 2  z  1.67as 0.0475 e  presented   in canPbe Figure 1. Each single histogram  2 74  70  z P  1.67  z  1.33  P    z  1.33  P    z  1.67  length represents an interval of the measure of interest   Poisson 3  65  distribution 70 2.753 e2.75 zPz 1X  two intervals on the x-axis, while the histogram  0.221 3 2 1.67  70  0.0475  0.8607 between P  65 0.9082 86.07%  3ez z1.33  0.9082 1.67 dz  distribution 3!0.9082 Pprobability z z  1.33 P  0.9082  0.0475 0.8607 of 86.07%    Pz  2important Another of is the Poisson height represents the number measured values within the 3 2 distribution. It is useful to describe the probability that a interval. When the number of observation becomes very 74 a70given 65  70 within  given event can happen period (for instance, x  P  65  x  74  P  z   P  1.67  z  1.33  P   z  1.33  P  (tends  z  1.67to 65  70 74  70     z   infinite) and the length of the histogram    large  3 0.0475 3 P 65  x  74   P  P1.67  1 zZ1.67 1.67     P  1.67  z  1.33  P    z  1.33  P    z  1.67  3 3 how many thoracic traumas could need the involvement    becomes narrower (tends to 0), the above representation P    z  1.67   0.0475 e 2   x    thoracic z 2 of the surgeon in a day, or a week, etc.). The becomes more similar to a curved line (Figure 2). This 1  f  z e 2 ,   z   events that may be 2described by this distribution have the curve describes the distribution of probability, f (density  082 P  0.9082  0.0475  0.8607 86.07%  following characteristics: probability) for any given value of x, the continuous P    z  1.33  0.9082 P  0.9082  0.0475  0.8607 of 86.07%  (I) The events are independent from one another; variable. The area under the curve is equal to 1 (100% 65  70 74  70  z  Within  agiven 1.67 interval the event may present from 0 (II) of probability). We can now assume that the value of our z z  1.33  P    z  1.33  P    z  1.67    P  1.67  3 74  70   65  70 3  to infinite P  65  x times; 74   P  z  P    z  1.67  variable X depends on a very large number   P  1.67  z  1.33  P    z  1.33continuous 3   3 of other factors (in many cases beyond our possibility of (III) The probability of an event to happen increases

082

x



n x

1



1

0

0

2

2

2

2

2

2

1

0

2

2

P    z  1.67   0.0475 © Journal of Thoracic Disease. All rights reserved.

P  0.9082  0.0475  0.8607 86.07% 

www.jthoracdis.com

J Thorac Dis 2015;7(3):E7-E10

P  E1 or E2  P  E1   P  E3 

Journal of Thoracic Disease, Vol 7, No 3 March 2015

 P  X x

Number of patients

60 50 40 30 20 10

n! x ! n  x !

 E Pdispersion).  E1   P  E2  

1

f  x   nCx p x q n x

E9

 P  En 

(I) The main characteristics of this distribution are: f  x  1 4 6  C4  0.3  0.7  P 0.2001 (II) is the µ;    E10 PIt X  symmetric x Ef  4P En   1 3 around  1   P  E2   2 f x  1    P  E1 or E2  P  E1   P  E3  P  p, p, p  , q, qThe  p q the curve is 1;  pppqq (III) area under If we consider the area under the curve between µ ± σ, Parea  P  pvalues , p, p , qof , q 2.75  p 3q 2 3 2.75  E1 orwill  P  E68% this of Eall X, pppqq while 1  P 3  the possible Ex2ecover e n! x n  x     f x 3 0.221 P X     f  x the  nC nCx  x p qbetween µ ± 2σ, it will cover 95% of all the values. area x! 3! x ! n  x ! The two parameters of the distribution are linked in the n! f  x   nCx p x q n x nCx (Eq. [10]): formula x ! n  x ! x f  x  1 Z 4 6   f  4  10  C4  0.3 10.7   x 0.2001   f  x e 2 ,    x   [10]  f  x  1 1 2z 2 f  x  1  f  z e 6 ,   z  4  C4  0.3 20.7  0.2001  4 10 Forf µ x= 0,1 and σ = 1, the curve is f called standardized 110 120 130 140 Systolic Blood  119 129 139 149  x e P x  E Pdistribution.  the  P possible En  2.75  1 (mmHg)  E1 3 eP2.75 E2  All    X Pressure normal normal distributions of 2

2

2

f  x 

x! Figure 1 Graphical description of the distribution of systolic blood pressure in a given population.

N. of pts 40

nCx 

P  p, p, p , q, q  pppqq  p 3q 2

 110 120 f  x130

P  E1 or E2  P  E1   P  E3   x   

2

1 2 2 ,   x   140e 150 n! 2  nCx  x ! n  x !

P

3  X  P  X

x  1 0.221

 E PE   PE  

 PE



1 2 n 3! x may by defining a derived called z be 1 “normalized” z2 65 variable 70 x   dz 3 2.75 z    1.67 e fz x[11]): e 2.75 e z. (Eq. , p   0.221 P  X3 3  P  p,2p , q2, q  pppqq  p 3q 2 3!  Ex2 ! P  E1   P  E3  P  E1xor P  p, p, p , q, q  pppqq  p 3q 2 Z P x  E P  E1   P  E2    P  En   X 1.67  1  1.67 [11] 1 z 2x n x 1    0.0475 P e    zZ x1.67 2 f zx  2nC  z 2nx !peq ,x   x n x  1

0

2

2

2

f  x   nCx p q nCx  1 2  fP xE 1 or x !E n2 xe!P 2E1 ,P E3x   P  p, p, p , q, q  pppqq 2 p3q 2 1 2z 2 To calculate the probability that our variable falls within  f  z e ,   z  P   z  1.33  instance 0.9082 z0 and z1P  0.9082  0.0475 2 0.8607 86.07% , we should calculate z1 1 fzx2   1 65a given 70 interval, for 4 6 30 z  10 Cn4definite 4x 1.67 !0.3  0.7  0.2001 fffollowing 1 z0 2 e 2 f dz the integral calculus (Eq. [12]): x n4x 6 3 x  1   nC q   0.7  0.2001 nC   f f 4 C4x p0.3   x  10 x x ! n  x !   f x  1 2   74   z  P  65  70  z  74  70   P  1.6765 z 701.33  P    z  1.33  P   P z165 1  xe z  1.67 [12] dz 3 3   z 0 3 2 2 2 20 3 2.75 1.67 1 1.672  x e  2.75 e P    Fortunately, zP  0.0475  0.221normalized distribution X f x1.67  2fe x   x ! 1x3e for the standard z 6 2.7534 of e2.75 3!   C 0.2001  10  0.221 f  x  Pf  4 X 3    4  0.3  0.7 2 every possible interval has been tabulated. 1.67 ! x 3! f x  1   1.67 1 10 P    1.67 e 2population of adult men,   0.0475 In the zmean weight  a given  2 x   P    z  1.33  0.9082 P  0.9082  0.0475  0.8607 86.07%   is 70 kg, with is the 2 Z x  a standard deviation of 3 kg. What  x    x 2.75  3 e2.75  e a xrandomly 1 2 2 Z 3from selected individual     f  x   that P X  f  x e 2 ,    x   probability    this  0.221  z2 1x ! 2 120.9082 2 Pf    zwould  1.33 P  0.9082  0.04753! 0.8607 ,  x  e    x    86.07% Systolic blood (mmHg) 2 2weight 70 74  70  population have a of 65 kg or less?  65Pressure f z z  1.33 e P   z ,  z 1.67  P  65  x  74   P  z  1 2z   P  1.67  z  1.33  P 2  2 3 3    f z e   z ,   To “normalize” our distribution, we should calculate the Figure 2 Graphical description of the normal distribution. 2 x value of z (Eq. [13]): 74  70   65  70 Z   P    z  1.33  P   P  65  x  74   P   x   2  z    P  1.67  z  1.33 2 32 3    1 z1 1 z 65  70 e 2 ,    x   z zf1  x1  z2 1.67 dz X becomes  [13]  z 2 z0 2 e 2 of 65  70 direct analysis), the probability distribution 3e 2dz z   z1.671 e 2 ,    z  f  z0 3 similar to a particular form of distribution, called normal 2 we2 should calculate the area under 2 Then, the curve

distribution or Gauss distribution. The aforementioned (Eq. [14]): 1.67 1 1.672 The normal concept is the famous Central Limit Theorem. P1.67 11 z 1.67 e z  2 distribution of z21.67   0.0475 65  70 distribution represents a very important Pz  z  1.67  0.0475 1.67  [14] z 22ee 22 dz 3 probability because f, that is the distribution of probability The value of our interval has been already calculated of our variables, can be represented by only two parameters: P    z  1.33  0.9082 P  0.9082  0.0475  0.8607 86.07%  andPtabulated [the tables can be easily any text 1.67 • µ = mean; P found 0.9082 in  0.0475  0.8607 86.07%  1.67 1z  1.33  0.9082 P    z   e 2in the web (4)]. Our probability   0.0475 of statistics is1.67 0.0475 σ = standard deviation.  2or • 74  70  (4.75%). We may also calculate the probability to find,  65  70 The mean is a so-called measure P  65of  xcentral  74   P tendency  z (it  z  1.33  P    z  1.67      P  1.67  z  1.33  P65  70 74  70  3   3 P  65 the x  74 z  z  1.33weight   P  population,   P   isz  1.33  P     P  1.67whose within same represents the more central value of our curve), while the 3 someone  3  P   z  1.33  0.9082 P  0.9082  0.0475  0.8607 86.07   standard deviation represents how dispersed are the values between 65 and 74 kg. This probability can be seen as the of probability around the central value (is a measure of difference of distribution between those whose weight is 74 kg 2

2

1

0

2

74  70   65  70 P  65  x  74   P  z   P  1.67  z  1.33  P    z  1.33  P   3   3

© Journal of Thoracic Disease. All rights reserved.

www.jthoracdis.com

J Thorac Dis 2015;7(3):E7-E10

3

P

 X x3!

1.67

P

1

 X 3e 2

3!

 0.221

1.672 3! 3 2.75 2.75 e 2

 0.221

P    z  1.67   0.0475

x Z x x     1 2 Z P    z  1.67 f  x  0.0475 e 2 ,    x   2 1 2z 2x     x    z2  f z e ,   z   E10 Viti et al. Probability distributions   1  0.9082 2 P   Zfz P  0.9082 2  0.0475  0.8607 86.07%   z 1.33   e ,    z   2   x    z2 1 or less and those whose weight is 65 kg or less: (Eq. [15]): category will allow a proper application of a model (for P  0.9082  0.0475  0.8607 f  2z  86.07% e 2 ,   z   z1 265  70 1 z 65  70 instance, the standardized normal distribution) that would 74  70z   1.67 P  xe70  74dz z  z  1.33  P    z  1.67   P zz06565   P 3 1.67  z  1.33  P   2 2  1.67 3 easily predict the probability of a given event. 3  3 1.67  z  1.33  P 65 70z  1.33  P    z  1.67  [15] z  21.67 We 1.67already 13 1.672know that (Eq. [16]): Acknowledgements P    z  1.67   0.0475 P  2ze −1.67 1.67   0.0475 [16] 

3!

2

In the table we can find also the value for (Eq. [17]):

082

082 z z

Disclosure: The authors declare no conflict of interest.

P    z  1.67   0.0475 P    z  1.33  0.9082 [17] P  0.9082  0.0475  0.8607 86.07%  References P  0.9082  0.0475  0.8607 86.07% 

Our probability is (Eq. [18]):

1. Daniel WW. eds. Biostatistics: a foundation for analysis in 700.8607 74  70   65    0.9082  [18] PP 65  x  74  0.0475 P  z  86.07%   P  1.67  z  1.33  P    z  1.33  P    z  1.67  74  70  3  z  1.333 P   z  1.67  P   the health sciences. New York: John Wiley & Sons, 1995.  P  1.67  z  1.33     3  74  70  P  1.67  z  1.33  P    z  1.33  P    z  1.67    Conclusions 3 

The probability distributions are a common way to describe, and possibly predict, the probability of an event. The main point is to define the character of the variables whose behaviour we are trying to describe, trough probability (discrete or continuous). The identification of the right

2. Kolmogorov AN. eds. Foundations of Theory of Probability. Oxford: Chelsea Publishing, 1950.

3. Lim E. Basic statistics (the fundamental concepts). J Thorac Dis 2014;6:1875-8. 4. Standard Normal Distribution Table. Available online: http://www.mathsisfun.com/data/standard-normaldistribution-table.html

Cite this article as: Viti A, Terzi A, Bertolaccini L. A practical overview on probability distributions. J Thorac Dis 2015;7(3):E7-E10. doi: 10.3978/j.issn.2072-1439.2015.01.37.

© Journal of Thoracic Disease. All rights reserved.

www.jthoracdis.com

J Thorac Dis 2015;7(3):E7-E10

A practical overview on probability distributions.

Aim of this paper is a general definition of probability, of its main mathematical features and the features it presents under particular circumstance...
571KB Sizes 1 Downloads 13 Views