Bài giảng Probability & Statistics - Lecture 3: Numerical summary - Bùi Dương Hải

Data Measurements  Location:  Minimum, Maximum  Central Tendency: Mean, Median, Mode  Quantile: Quartile, Percentile  Variability:  Range  Variance (Var)  Standard Deviation (SD)  Coefficient of Variation (CV)  Interquartile Range (IQR)

Lecture 3. NUMERICAL SUMMARY  Data Measurements  Locations  Variability Measures  Shape  [1] Chapter 3, pp. 99 - 162  [3] Chapter 2

Comparison  Profit of two project A & B

5% 10% 15% 20% 30% 20% 1 2 3 4 5 6 Profit of Project A (million) 20% 30% 20% 15% 10% 5% 1 2 3 4 5 6 Profit of Project B (million)

Comparison

2% 5% 8% 15% 20% 30% 20% 0% 0% 1 2 3 4 5 6 7 8 9 Profit of Project C (million) 0% 0% 20% 30% 20% 15% 8% 5% 2% 1 2 3 4 5 6 7 8 9 Profit of Project D (million)

Comparison

0% 0% 10% 40% 40% 10% 0% 0% -1 0 1 2 3 4 5 6 Profit of Project F (million) 5% 10% 15% 20% 20% 15% 10% 5% -1 0 1 2 3 4 5 6 Profit of Project E (million)

Data Measurements  Location:  Minimum, Maximum  Central Tendency: Mean, Median, Mode  Quantile: Quartile, Percentile  Variability:  Range  Variance (Var)  Standard Deviation (SD)  Coefficient of Variation (CV)  Interquartile Range (IQR)

3.1. Mean (arithmetic mean)  Apply for scale variable only  =  Have the same unit as the original data

Population Sample Data: {,, ,} Data: {,, ,} = + + ⋯ + = + + ⋯ +

Weighted mean  Price (\$) in Quarter 1, 2, 3, 4 are 10, 12, 18, 14, respectively. = 10 + 12 + 18 + 14 4 =  Any difference if the volume of sales in Quarter 1, 2, 3, 4 are 70, 90, 110, 130? Mean (arithmetic mean)  Apply for scale variable only  =  Have the same unit as the original data PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 6 Population Sample Data: {,, ,} Data: {,, ,} = + + ⋯ + = + + ⋯ + Weighted mean  Price (\$) in Quarter 1, 2, 3, 4 are 10, 12, 18, 14, respectively. = 10 + 12 + 18 + 14 4 =  Any difference if the volume of sales in Quarter 1, 2, 3, 4 are 70, 90, 110, 130? Q1 Q2 Q3 Q4 Price 10 12 18 14 Volume 70 90 110 130 Value xi Weight wi

Weighted Mean  In general, for grouped data: = + + ⋯ + + + ⋯ + = ∑ ∑  For Example of Price: ̅ = 70 ∗ 10 + 90 ∗ 12 + 110 ∗ 18 + 130 ∗ 14 70 + 90 + 110 + 130 =

Mean of Grouped data  Frequency, Proportion, Percent table

Wage (\$) 7 8 9 Number of worker (Frequency) 4 10 6 Proportion (Relative frequency) 0.2 0.5 0.3 Percent 20% 50% 30%

Compare the Mean  Compare the mean of following data:  Data 1: {10, 10, 11, 12, 12}  Data 2: {5, 5, 6, 6, 100}  The mean is easily affected by the extreme or outlier value  May lead to biased comparison   Use the other measures

3.2. Median  Median, denoted by me, is the midpoint of ordered list of values  Median could be applied for ordinal variable Ex. Data: { 5, 6, 9, 5, 6 } Ordered data: { 5, 5, 6, 6, 9 } : Median = Ordered Data {6, 6, 7, 8, 9, 11} : Median =  Data: {XXS, XS, S, S, S, M, L, XL, XXL}: Median = Median  Median is the 'cutoff point' of lower 50% - upper 50% parts

Discrete vs Continous Discrete Continuous Lower 50% Upper 50% Median

3.3. Mode  Mode, denoted by m0, is the value that occurs most often, frequency of (X = m0) is the largest.  There may be no mode or several modes.  Mode could be applied for nominal variable  Example What are the modes?  Data 1: { 5, 6, 6, 7, 7, 7, 9 }  Data 2: { 5, 6, 7, 8, 9 }  Data 3: { 5, 6, 9, 5, 6 }  Data 4: { Yellow, Yellow, Red, Blue, Green} Mean, Median, Mode

Mean = 4 0 1 2 3 4 5 6 7 8 9 10 Mean = 3Median = 3 Median = 3 0 1 2 3 4 5 6 7 8 9 10 No Mode 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 5.5 Mean = Median = Mode = 5 Mode: 7Mean = 4.8

Mean, Median, Mode

Mean Median Mode Symmetric Right skewedLeft skewed Mode < Median < MeanMean < Median < Mode

Grouped data  Customer's waiting time  Median is in group of [5 – 10)  Modal group:  Mean: using middle value

Waiting time 0 – 5 5 – 10 10 – 15 15 – 20 20 + Frequency 15 20 8 5 2 Waiting time 2.5 7.5 12.5 17.5 22.5 Frequency 15 20 8 5 2

3.4. Quartile  Divide data into 4 equal-parts by 3 cutoff points: 3 quartile ,,  2nd quartile: =

25% 25% 25% 25%

Quantile  Divide into 5 equal-parts by 4 cutoff point: 4 Quintile  Divide into 10 equal-parts by 9 cutoff point: 9 Decile  100 equal-parts: 99 percentile  10th percentile = 1st decile  20th percentile = 2nd decile = 1st quintile  25th percentile = 1st quartile  50th percentile = 2nd quartile = median

Micrsoft Excel Function

Measures Command / Function Mean = average(data) Median = median(data) Mode = mode(data) Quartile k (k = 1,2,3) = quartile(data, k) Percentile k (k = 1,2,,99) = percentile(data, k)

Variability  Central Tendency may not provide efficient information of the data.  Data have the same Mean, Median, but differ in variability (dispersion, spread). 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Mean = Median = 5

3.5. Range  Range = largest value – smallest value = xmax – xmin  Simplest, but poorest information.

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Range = 7 Range = 6

3.6. Variance & Standard Deviation  Sample Data: ,, ,  the mean ̅  Deviation: − ̅ : (+) or (–) or zero  Sum of Squares: = ∑ − ̅  Variance: = − = ∑ − −  Unit of Variance is squared unit of PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 20 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Mean = Median = 5 3.5. Range  Range = largest value – smallest value = xmax – xmin  Simplest, but poorest information. PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 21 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Range = 7 Range = 6 3.6. 25% 25% 25% 25%

Outlier  There are Lower Limit and Upper Limit for the data  Observations smaller than LL or greater than UL are Outlier  By Quartiles: Lower Limit is − 1.5 Upper Limit is + 1.5

Key-point and Boxplot  Find 5 key-point and Outliers

Salary 10 11 12 13 14 15 16 17 18 No. of Worker 10 16 30 19 14 10 0 0 1 1.5 1.5  Boxplot

Table, Histogram, Boxplot

0 5 10 15 20 25 30 35 10 11 12 13 14 15 16 17 18 Salary

Value Freq. 10 10 11 16 12 30 13 19 14 14 15 10 16 0 17 0 18 1 10 11 12 13.5 18

Boxplot : Key values and Whiskers

A B C D E F Max 6 6 7 9 6 4 Q3 5 4 6 6 4 3 Q2 4.5 2.5 5.5 4.5 2.5 2.5 Q1 3 2 4 4 1 2 Min 1 1 1 3 -1 1 ̅ 4.2 2.8 5.16 4.84 2.5 2.5

Boxplot 2014 2015 2016 2017 Max Q3 Q2 Q1 Min Mean

3.9. Skewness (Sk) Sk = 0 Two-tail Sk = 0.3 Right short tail Sk = – 0.3 Left short tail Sk = 1.3 Right long tail Sk = – 1.3 Left long tail

3.10. Covariance & Correlation  Covariance: combined variability of , , in sample: , = = ∑ ( − ̅)( − ) − 1

M ea n o f Y Mean of X Positive covariance M ea n o f Y Mean of X Negative covariance

Correlation Coefficient = (,) = ∑ ( − ̅)( − ) ∑ − ̅ ∑ −  −1 ≤ ≤ 1, no unit  measures linear relationship between and  = −1 : linear negative  −1 < < 0 : negatively correlated  = 0 : no correlated  0 < < 1 : positively correlated  = 1 : linear positive

Correlation  Graph and Correlation Coefficient ()

Positively Week Strong Negatively No correlated r = 0.5 r = – 0.5 r = 0.8 r = 0

Correlation Coefficient

− ̅ − − ̅ − − ̅ ∗ − Jan 5 10 Feb 6 15 Mar 8 10 Apr 9 18 May 12 32 Sum Mean

 X: Advertising; Y: sales

3.11. Standardized value  Z-score of one value in data, have no unit = − . Interquartile Range  Interquartile Range is range between 3rd quartile and 1st quartile  = 3 − 1 = −  IQR is the width of 50% middle value of data PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 27 25% 25% 25% 25% Outlier  There are Lower Limit and Upper Limit for the data  Observations smaller than LL or greater than UL are Outlier  By Quartiles: Lower Limit is − 1.5 Upper Limit is + 1.5 PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 28 Key-point and Boxplot  Find 5 key-point and Outliers PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 29 Salary 10 11 12 13 14 15 16 17 18 No. of Worker 10 16 30 19 14 10 0 0 1 1.5 1.5  Boxplot Table, Histogram, Boxplot 0 5 10 15 20 25 30 35 10 11 12 13 14 15 16 17 18 Salary PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 30 Value Freq. 10 10 11 16 12 30 13 19 14 14 15 10 16 0 17 0 18 1 10 11 12 13.5 18 Boxplot : Key values and Whiskers PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 31 A B C D E F Max 6 6 7 9 6 4 Q3 5 4 6 6 4 3 Q2 4.5 2.5 5.5 4.5 2.5 2.5 Q1 3 2 4 4 1 2 Min 1 1 1 3 -1 1 ̅ 4.2 2.8 5.16 4.84 2.5 2.5 Boxplot 2014 2015 2016 2017 Max Q3 Q2 Q1 Min Mean PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 32 3.9. Skewness (Sk) PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 33 Sk = 0 Two-tail Sk = 0.3 Right short tail Sk = – 0.3 Left short tail Sk = 1.3 Right long tail Sk = – 1.3 Left long tail 3.10. Covariance & Correlation  Covariance: combined variability of , , in sample: , = = ∑ ( − ̅)( − ) − 1 PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 34 M ea n o f Y Mean of X Positive covariance M ea n o f Y Mean of X Negative covariance Correlation Coefficient = (,) = ∑ ( − ̅)( − ) ∑ − ̅ ∑ −  −1 ≤ ≤ 1, no unit  measures linear relationship between and  = −1 : linear negative  −1 < < 0 : negatively correlated  = 0 : no correlated  0 < < 1 : positively correlated  = 1 : linear positive PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 35 Correlation  Graph and Correlation Coefficient () PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 36 Positively Week Strong Negatively No correlated r = 0.5 r = – 0.5 r = 0.8 r = 0 Correlation Coefficient − ̅ − − ̅ − − ̅ ∗ − Jan 5 10 Feb 6 15 Mar 8 10 Apr 9 18 May 12 32 Sum Mean PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 37  X: Advertising; Y: sales 3.11. Standardized value  Z-score of one value in data, have no unit = − . Ex. Compare score of Microeconomics and Macroeconomics of one student in one class if:  Micro score = 7.5; Marcro score = 9  Mean of Micro in class = 6; Mean of Macro = 7  S.D of Micro = 1; S.D of Macro = 2 PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 38 Excel: Statistic Functions PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 39 Statistic Function Sum = SUM(array) Mean X = AVERAGE(array) Median = MEDIAN(array) Kth Quartile (Q1,Q2,Q3) = QUARTILE(array, k) Sample variance (S2) = VAR(array) Sample S.D (S) = STDEV(array) Covariance Cov(X,Y) = COVAR(array1, array2) Correlation rXY = CORREL(array1, array2) X ~ N(µ,σ2); P(X < b) = NORMDIST(b, µ, σ, 1) Exercise [1] Chapter 3:  (p110) 2, 3, 6, 7, 11, 13,  (p120) 26, 27, 29, 33,  (p133) 49, 50, 52,  (p143) 56, 58, 59,  (p152) 62, 63, 70,  Case Problem 1, 4 PROBABILITY & STATISTICS – Bui Duong Hai – NEU – www.mfe.edu.vn/buiduonghai 40