Statistic Corner
A practical overview on probability distributions Andrea Viti1, Alberto Terzi2, Luca Bertolaccini2 1
Thoracic Surgery Unit, S. Croce Carle Hospital, Cuneo, Italy; 2Thoracic Surgery Unit, Sacro Cuore Research Hospital, Negrar Verona, Italy
Correspondence to: Andrea Viti, MD, PhD. Thoracic Surgery Unit, S. Croce e Carle Hospital, Via Michele Coppino 26, 12100 Cuneo, Italy. Email:
[email protected].
Abstract: Aim of this paper is a general definition of probability, of its main mathematical features and the features it presents under particular circumstances. The behavior of probability is linked to the features of the phenomenon we would predict. This link can be defined probability distribution. Given the characteristics of phenomena (that we can also define variables), there are defined probability distribution. For categorical (or discrete) variables, the probability can be described by a binomial or Poisson distribution in the majority of cases. For continuous variables, the probability can be described by the most important distribution in statistics, the normal distribution. Distributions of probability are briefly described together with some examples for their possible application. Keywords: Probability distributions; discrete variables; continuous variables Submitted Nov 09, 2014. Accepted for publication Dec 17, 2014. doi: 10.3978/j.issn.2072-1439.2015.01.37 View this article at: http://dx.doi.org/10.3978/j.issn.2072-1439.2015.01.37
A short definition of probability
(II) The sum of the probabilities of E = P(E1) + (E2) + … + P (En) is 100%; (III) If E1 and E3 are two possible events, the probability P X x E P E P E P En 1 thatone or the other could happen P (E1 or E3) 1is equal 2to the sum of the probability of E1 and the probability of E3 (Eq. [2]):
We can define the probability of a given event by evaluating, in previous observations, the incidence of the same event under circumstances that are as similar as possible to the circumstances we are observing [this is the frequentistic definition of probability, and is based on the relative frequency of an observed event, observed in previous circumstances (1)]. In other words, probability describes the possibility of an event to occur given a series of circumstances (or under a series of pre-event factors). It is a form of inference, a way to predict what may happen, based on what happened before under the same (never exactly the same) circumstances. Probability can vary from 0 (our expected event was never observed, and should never happen) to 1 (or 100%, the event is almost sure). It is described by the following formula: if X = probability of a given x event (Eq. [1]):
P E1 or E2 P E1 P E3
P p, p, p , q, q pppqq [2] p3q 2
Probability could be described by a formula, a graph, in which each event is linked to its probability. This kind of n! x n x f x nC nCx xp q description of probability is called probability distribution. x! n x !
Binomial f x 1 distribution
f 4 10 C4 0.3 0.7 0.2001 4
6
1 A classic of probability distribution is the binomial f xexample distribution. It is the representation of the probability when only two events may happen, that are mutually exclusive. x e 2.753 e2.75 f x 3 X You can only 0.221 The typical example is when you toss P a coin. x! 3! have two results. In this case, the probability is 50% for both E P E1 P E2 [1] P Enevents. However, binomial distribution may describe also P X x 1 x two events that are mutually exclusive but are This is one of the three axioms of probability, as Z not equally x 1 possible baby will be left- z described by Kolmogorov (2): f x (for instance e 2 , that xa newborn P E1 or E2 P E1 P E3 P p, p, p , q, q pppqq p 3q 2 1 2 handed or right-handed). The probability that (I) If under some circumstances, a given number of f x zindividuals e 2 , z 2 present a given characteristic, p, that is mutually exclusive events (E) could verify (E1, E2, E3, …, En), the probability (P) of another one, called q, depends on the possible number of any E is always more than zero; n! x n x 2
2
2
nCx
x ! n x !
f x nCx p q
z1
z0
© Journal of Thoracic Disease. All rights reserved.
f x 1
f x 1
1 z2 e dz 2 2
www.jthoracdis.com
f 4 10 C4 0.3 0.7 0.2001 4
z
65 70 1.67 3
J Thorac Dis 2015;7(3):E7-E10
6
1.67
1
e
1.672 2
P z 1.67 0.0475
P X x
E P E1 P E2
1
E8
P En
Viti et al. Probability distributions
P E1 or E2 P E1 P E3
of combinations of x individuals within the population, called C. If my population is composed of five5 individuals, that can be p or q, I have ten possible combinations of, for instance, three individuals with p is (Eq. [3]): E P E1 P E2 P En pppqq, pqqpp, ppqpq, ppqqp, pqpqpq, qpppq, qpqpp, qppqp, qqppp
P p, p, p , q, q pppqq p 3q 2
when the period of observation is longer. To predict the probability, I must know how the events behave (thisn !data comes from previous, or historical, f x nC p x q n x nCx x ! nof xthe observations ! same event before the time xI am trying to perform my analysis). This parameter, that is a mean of the events in a given interval, as derived from previous f x 1 4 6 3 2 observations, is called λ. f 4 10 C4 0.3 0.7 0.2001 P E3 P p, p, p , q, q pppqq pq [3] x 1 f Poisson The distribution follows the following formula 3 2 P X x 1 E P E P E P E Then p q will be multiplied for the number of 11 2 n (Eq. [8]): P X x E P E P 1 E2 P En combinations (ten times). x n x x e 2.753 e2.75 f x nCx p q f x P X 3 [8] 0.221 If, in experimental population, I had a big number of x! 3! P E1 or E2 P E1 P E3 P p, p, p , q, q pppqq p 3q 2 3 2 individuals (n), the number of combinations individuals P E1 or E2 ofPx E P E P p , p , p , q , q pppqq p q 1 3 where the number e is an important mathematical constant within the population will be (Eq. [4]): that is the base of the natural logarithm. It is approximately 4 6 x f 4 10 Cn4! 0.3 0.7 0.2001 Z x n x x equal to 2.71828. f x n! nCx p q nCx x n x [4] ff xx nC1x p qe 2 , x E P xE!1nPxE ! 2 P En nCx x ! n x ! For example, the distribution of major thoracic traumas E P E1 P E2 P En 1 2z 2 P X x 1 f z a month e , z Therefore, the probability that a group of x individuals needing intensive care unit (ICU) recovery during 2 3 2.75 2.75 e 3 n within the presents4 the 6 in the last three years in a Third Level Trauma Center f px,Xp , 1p P E3 ,3q , q pppqq of p0.221 q 2 individuals P population f x f 14 10 3 2 4 6 pppqq 0.7 p0.2001 3! P E or E P E P E P p , p , p ,Cq4, q 0.3 qfollows characteristic q, will be described by the Poisson 1 2 p, that 1 excludes 3 f 4 a 10 C4 0.3 distribution, 0.7 0.2001were λ=2.75. In a future f x 1 2 f x 1 z 1 z 65 70 following formula (Eq. [5]): period of one what is the probability to have three z 1.67 e month, dz z 2major thoracic trauma in ICU? 3(Eq. [9]): 2 x n x patients with f x nCxnp! xq [5] x n x 3 2.75 f xx nCx Zx e 2.75 e nCx p q e 2.753 e2.75 f x P X 3 x ! n x ! x that describes distribution. It follows3!the 0.221 P X 3 0.221 f x [9] 1.67 z x ! the 1binomial x! 3! 1.67 1 f z rules (Eq. e 2 [6]): , z 2 Kolmogorow’s 2 e the probability is 22.1%.P z 1.67 0.0475 Therefore, 6 PE E P E1 P E242 n ff 4x 10 C4 0.3 0.7 0.2001 1 6 x 4 f 4 10 x distribution refers only to discrete x [6] 0.2001The binomial ZC4 0.3 0.7 Z present a limited number of values within f x 1 x 1 variables (that 2 0.9082 1 , 3x 2 f x65 70 e P z 1.33 P 0.9082 0.0475 0.8607 86.07% zP p, p, p , q1.67 f x e 2 , 1 xz a given P E3 , qpopulation, pppqq 30% p q of the interval). However, in nature, many variables may 2 In a given people are left-handed. 3 2 1 2z 2 f z e z , 3 2.75 e , zof values, within a given 2.75 e z 2 2.75 present an finfinite from P select 0.221 this population, what If we X 3xten 2distribution e individuals 2.75is3 ethe 3! 65 70 74 70 0.221 f x P X 3 interval. P 65 These x 74 are P called continuous z 1.33 P z 1.33 P 1.67 z (3). probability that Pvariables x ! four out of ten individuals are left handed? 3! 2
2
2
1
0
2
2
2
2
2
2
2
2
3
3
f xcan nC x p qthe binomial distribution, since we suppose We P z 0.0475 1 zapply z2 1.67 65 70 z 1z z 2 65 70 1.67 e dz x thatza person may be either left-handed or z of 1.67continuous variables e 3dz x 2 Z 2 z 2 right-handed. Distributions 3 2 our Z Se we can use formula (Eq. [7]): x x 1 z 241 An example of continuous variable is the systolic blood e 26 0.2001 x 4x0.9082 082 Pff 0.0475 e,0.8607 10 0.3 0.7 f2C z , 86.07% 4z 1.67 1 [7]2z 1.67 1 1.67 2 e , pressure. z Within a given cohort of systolic blood pressure 1.67 P1 2 zf z1.67 2 0.0475 2 e 2 z 1.67as 0.0475 e presented in canPbe Figure 1. Each single histogram 2 74 70 z P 1.67 z 1.33 P z 1.33 P z 1.67 length represents an interval of the measure of interest Poisson 3 65 distribution 70 2.753 e2.75 zPz 1X two intervals on the x-axis, while the histogram 0.221 3 2 1.67 70 0.0475 0.8607 between P 65 0.9082 86.07% 3ez z1.33 0.9082 1.67 dz distribution 3!0.9082 Pprobability z z 1.33 P 0.9082 0.0475 0.8607 of 86.07% Pz 2important Another of is the Poisson height represents the number measured values within the 3 2 distribution. It is useful to describe the probability that a interval. When the number of observation becomes very 74 a70given 65 70 within given event can happen period (for instance, x P 65 x 74 P z P 1.67 z 1.33 P z 1.33 P (tends z 1.67to 65 70 74 70 z infinite) and the length of the histogram large 3 0.0475 3 P 65 x 74 P P1.67 1 zZ1.67 1.67 P 1.67 z 1.33 P z 1.33 P z 1.67 3 3 how many thoracic traumas could need the involvement becomes narrower (tends to 0), the above representation P z 1.67 0.0475 e 2 x thoracic z 2 of the surgeon in a day, or a week, etc.). The becomes more similar to a curved line (Figure 2). This 1 f z e 2 , z events that may be 2described by this distribution have the curve describes the distribution of probability, f (density 082 P 0.9082 0.0475 0.8607 86.07% following characteristics: probability) for any given value of x, the continuous P z 1.33 0.9082 P 0.9082 0.0475 0.8607 of 86.07% (I) The events are independent from one another; variable. The area under the curve is equal to 1 (100% 65 70 74 70 z Within agiven 1.67 interval the event may present from 0 (II) of probability). We can now assume that the value of our z z 1.33 P z 1.33 P z 1.67 P 1.67 3 74 70 65 70 3 to infinite P 65 x times; 74 P z P z 1.67 variable X depends on a very large number P 1.67 z 1.33 P z 1.33continuous 3 3 of other factors (in many cases beyond our possibility of (III) The probability of an event to happen increases
082
x
n x
1
1
0
0
2
2
2
2
2
2
1
0
2
2
P z 1.67 0.0475 © Journal of Thoracic Disease. All rights reserved.
P 0.9082 0.0475 0.8607 86.07%
www.jthoracdis.com
J Thorac Dis 2015;7(3):E7-E10
P E1 or E2 P E1 P E3
Journal of Thoracic Disease, Vol 7, No 3 March 2015
P X x
Number of patients
60 50 40 30 20 10
n! x ! n x !
E Pdispersion). E1 P E2
1
f x nCx p x q n x
E9
P En
(I) The main characteristics of this distribution are: f x 1 4 6 C4 0.3 0.7 P 0.2001 (II) is the µ; E10 PIt X symmetric x Ef 4P En 1 3 around 1 P E2 2 f x 1 P E1 or E2 P E1 P E3 P p, p, p , q, qThe p q the curve is 1; pppqq (III) area under If we consider the area under the curve between µ ± σ, Parea P pvalues , p, p , qof , q 2.75 p 3q 2 3 2.75 E1 orwill P E68% this of Eall X, pppqq while 1 P 3 the possible Ex2ecover e n! x n x f x 3 0.221 P X f x the nC nCx x p qbetween µ ± 2σ, it will cover 95% of all the values. area x! 3! x ! n x ! The two parameters of the distribution are linked in the n! f x nCx p x q n x nCx (Eq. [10]): formula x ! n x ! x f x 1 Z 4 6 f 4 10 C4 0.3 10.7 x 0.2001 f x e 2 , x [10] f x 1 1 2z 2 f x 1 f z e 6 , z 4 C4 0.3 20.7 0.2001 4 10 Forf µ x= 0,1 and σ = 1, the curve is f called standardized 110 120 130 140 Systolic Blood 119 129 139 149 x e P x E Pdistribution. the P possible En 2.75 1 (mmHg) E1 3 eP2.75 E2 All X Pressure normal normal distributions of 2
2
2
f x
x! Figure 1 Graphical description of the distribution of systolic blood pressure in a given population.
N. of pts 40
nCx
P p, p, p , q, q pppqq p 3q 2
110 120 f x130
P E1 or E2 P E1 P E3 x
2
1 2 2 , x 140e 150 n! 2 nCx x ! n x !
P
3 X P X
x 1 0.221
E PE PE
PE
1 2 n 3! x may by defining a derived called z be 1 “normalized” z2 65 variable 70 x dz 3 2.75 z 1.67 e fz x[11]): e 2.75 e z. (Eq. , p 0.221 P X3 3 P p,2p , q2, q pppqq p 3q 2 3! Ex2 ! P E1 P E3 P E1xor P p, p, p , q, q pppqq p 3q 2 Z P x E P E1 P E2 P En X 1.67 1 1.67 [11] 1 z 2x n x 1 0.0475 P e zZ x1.67 2 f zx 2nC z 2nx !peq ,x x n x 1
0
2
2
2
f x nCx p q nCx 1 2 fP xE 1 or x !E n2 xe!P 2E1 ,P E3x P p, p, p , q, q pppqq 2 p3q 2 1 2z 2 To calculate the probability that our variable falls within f z e , z P z 1.33 instance 0.9082 z0 and z1P 0.9082 0.0475 2 0.8607 86.07% , we should calculate z1 1 fzx2 1 65a given 70 interval, for 4 6 30 z 10 Cn4definite 4x 1.67 !0.3 0.7 0.2001 fffollowing 1 z0 2 e 2 f dz the integral calculus (Eq. [12]): x n4x 6 3 x 1 nC q 0.7 0.2001 nC f f 4 C4x p0.3 x 10 x x ! n x ! f x 1 2 74 z P 65 70 z 74 70 P 1.6765 z 701.33 P z 1.33 P P z165 1 xe z 1.67 [12] dz 3 3 z 0 3 2 2 2 20 3 2.75 1.67 1 1.672 x e 2.75 e P Fortunately, zP 0.0475 0.221normalized distribution X f x1.67 2fe x x ! 1x3e for the standard z 6 2.7534 of e2.75 3! C 0.2001 10 0.221 f x Pf 4 X 3 4 0.3 0.7 2 every possible interval has been tabulated. 1.67 ! x 3! f x 1 1.67 1 10 P 1.67 e 2population of adult men, 0.0475 In the zmean weight a given 2 x P z 1.33 0.9082 P 0.9082 0.0475 0.8607 86.07% is 70 kg, with is the 2 Z x a standard deviation of 3 kg. What x x 2.75 3 e2.75 e a xrandomly 1 2 2 Z 3from selected individual f x that P X f x e 2 , x probability this 0.221 z2 1x ! 2 120.9082 2 Pf zwould 1.33 P 0.9082 0.04753! 0.8607 , x e x 86.07% Systolic blood (mmHg) 2 2weight 70 74 70 population have a of 65 kg or less? 65Pressure f z z 1.33 e P z , z 1.67 P 65 x 74 P z 1 2z P 1.67 z 1.33 P 2 2 3 3 f z e z , To “normalize” our distribution, we should calculate the Figure 2 Graphical description of the normal distribution. 2 x value of z (Eq. [13]): 74 70 65 70 Z P z 1.33 P P 65 x 74 P x 2 z P 1.67 z 1.33 2 32 3 1 z1 1 z 65 70 e 2 , x z zf1 x1 z2 1.67 dz X becomes [13] z 2 z0 2 e 2 of 65 70 direct analysis), the probability distribution 3e 2dz z z1.671 e 2 , z f z0 3 similar to a particular form of distribution, called normal 2 we2 should calculate the area under 2 Then, the curve
distribution or Gauss distribution. The aforementioned (Eq. [14]): 1.67 1 1.672 The normal concept is the famous Central Limit Theorem. P1.67 11 z 1.67 e z 2 distribution of z21.67 0.0475 65 70 distribution represents a very important Pz z 1.67 0.0475 1.67 [14] z 22ee 22 dz 3 probability because f, that is the distribution of probability The value of our interval has been already calculated of our variables, can be represented by only two parameters: P z 1.33 0.9082 P 0.9082 0.0475 0.8607 86.07% andPtabulated [the tables can be easily any text 1.67 • µ = mean; P found 0.9082 in 0.0475 0.8607 86.07% 1.67 1z 1.33 0.9082 P z e 2in the web (4)]. Our probability 0.0475 of statistics is1.67 0.0475 σ = standard deviation. 2or • 74 70 (4.75%). We may also calculate the probability to find, 65 70 The mean is a so-called measure P 65of xcentral 74 P tendency z (it z 1.33 P z 1.67 P 1.67 z 1.33 P65 70 74 70 3 3 P 65 the x 74 z z 1.33weight P population, P isz 1.33 P P 1.67whose within same represents the more central value of our curve), while the 3 someone 3 P z 1.33 0.9082 P 0.9082 0.0475 0.8607 86.07 standard deviation represents how dispersed are the values between 65 and 74 kg. This probability can be seen as the of probability around the central value (is a measure of difference of distribution between those whose weight is 74 kg 2
2
1
0
2
74 70 65 70 P 65 x 74 P z P 1.67 z 1.33 P z 1.33 P 3 3
© Journal of Thoracic Disease. All rights reserved.
www.jthoracdis.com
J Thorac Dis 2015;7(3):E7-E10
3
P
X x3!
1.67
P
1
X 3e 2
3!
0.221
1.672 3! 3 2.75 2.75 e 2
0.221
P z 1.67 0.0475
x Z x x 1 2 Z P z 1.67 f x 0.0475 e 2 , x 2 1 2z 2x x z2 f z e , z E10 Viti et al. Probability distributions 1 0.9082 2 P Zfz P 0.9082 2 0.0475 0.8607 86.07% z 1.33 e , z 2 x z2 1 or less and those whose weight is 65 kg or less: (Eq. [15]): category will allow a proper application of a model (for P 0.9082 0.0475 0.8607 f 2z 86.07% e 2 , z z1 265 70 1 z 65 70 instance, the standardized normal distribution) that would 74 70z 1.67 P xe70 74dz z z 1.33 P z 1.67 P zz06565 P 3 1.67 z 1.33 P 2 2 1.67 3 easily predict the probability of a given event. 3 3 1.67 z 1.33 P 65 70z 1.33 P z 1.67 [15] z 21.67 We 1.67already 13 1.672know that (Eq. [16]): Acknowledgements P z 1.67 0.0475 P 2ze −1.67 1.67 0.0475 [16]
3!
2
In the table we can find also the value for (Eq. [17]):
082
082 z z
Disclosure: The authors declare no conflict of interest.
P z 1.67 0.0475 P z 1.33 0.9082 [17] P 0.9082 0.0475 0.8607 86.07% References P 0.9082 0.0475 0.8607 86.07%
Our probability is (Eq. [18]):
1. Daniel WW. eds. Biostatistics: a foundation for analysis in 700.8607 74 70 65 0.9082 [18] PP 65 x 74 0.0475 P z 86.07% P 1.67 z 1.33 P z 1.33 P z 1.67 74 70 3 z 1.333 P z 1.67 P the health sciences. New York: John Wiley & Sons, 1995. P 1.67 z 1.33 3 74 70 P 1.67 z 1.33 P z 1.33 P z 1.67 Conclusions 3
The probability distributions are a common way to describe, and possibly predict, the probability of an event. The main point is to define the character of the variables whose behaviour we are trying to describe, trough probability (discrete or continuous). The identification of the right
2. Kolmogorov AN. eds. Foundations of Theory of Probability. Oxford: Chelsea Publishing, 1950.
3. Lim E. Basic statistics (the fundamental concepts). J Thorac Dis 2014;6:1875-8. 4. Standard Normal Distribution Table. Available online: http://www.mathsisfun.com/data/standard-normaldistribution-table.html
Cite this article as: Viti A, Terzi A, Bertolaccini L. A practical overview on probability distributions. J Thorac Dis 2015;7(3):E7-E10. doi: 10.3978/j.issn.2072-1439.2015.01.37.
© Journal of Thoracic Disease. All rights reserved.
www.jthoracdis.com
J Thorac Dis 2015;7(3):E7-E10