When you want to make sense of data around you every day, knowing how and when to use data analysis techniques and formulas of statistics will help.
Understanding formulas for common statistics
The most common descriptive statistics are as following, along with the formulas and short description of what each one measures.
1. Population
The entire group one desires information about.
2. Sample
A subset of the population selected according to some scheme.
3. Mean
Used for measure of centre; affected by outliers
Population mean | Sample mean |
\(\mu=\frac{1}{N}\sum_{i=1}^{N}x_i\) | \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i\) |
3. Variance
Used for measure of variation
Population variance | Sample variance |
\(\delta^2=\frac{\sum(x-\mu)^2}{N}\) | \(s^2=\frac{\sum(x-\bar{x})^2}{n-1}\) |
4. Correlation coefficient
Pearson correlation (r)
\(r=\frac{1}{n-1}\frac{\sum(x-\bar{x})(y-\bar{y})}{s_x s_y}\)
Handling statistical hypothesis tests
1. Z-statistic
Used when the standard deviation is known. The sample must be drawn from a normal distribution or have a sample size (n) at least 30.
\(z=\frac{\bar{x}-\mu}{\delta/\sqrt{n}}\), where \(\mu\) = population mean (either known or hypothesis under \(H_{o}\))
Confidence interval-Interval within which we may consider a hypothesis tenable. Common confidence intervals are 90%, 95% and 99%.
\(1-\alpha\) confidence interval for \(\mu\):
\(\bar{x}-z_{\alpha/2}(\delta/\sqrt{n})\leq\mu\leq\bar{x}+z_{\alpha/2}(\delta/\sqrt{n})\)
where \(z_{\alpha/2}\) is the value of the standard normal variable z that puts \(\alpha/2\) percent in each tail of the distribution.
2. t-statistic
Used when the standard deviation is unknown. use of Student's t distribution.
One sample: tests whether the mean of a normally distributed population is different from a specified value.
\(t=\frac{\bar{x}-\mu}{s/\sqrt{n}}\), where degree of freedom (df) = n-1
Two samples: tests whether the means of two populations are significantly different from one another.
\(t=\frac{\bar{x_1}-\bar{x_2}}{\sqrt{\frac{{s_1}^2}{n_1}+\frac{{s_2}^2}{n_2}}}\), where degree of freedom (df) = \(n_1+n_2-2\)
3. Analysis of variance (ANOVA)
\(H_{o}:\mu_1=\mu_2=\mu_3\)
\(H_{1}:\)at least one pair of samples is significantly different.
Grand mean \(\bar{x}_G \)(mean of sample mean)
\(\bar{x}_G=\frac{\bar{x}_1+\dots+\bar{x}_k}{k}\)
Between-Group Variance- reflects the magnitude of differences among the group means.
\(BGV=\frac{\sum n_k(\bar{x}_k-\bar{x}_G)^2}{K-1}\), where \(n_k\) is the sample size of each group, \(K\) is the group number.
Within-Group Variance- reflects the dispersion within each group.
\(WGV=\frac{\sum\sum(x_i-\bar{x}_k)^2}{N-K}\), where \(N\) is the total number of all groups, \(K\) is the group number.
Use F-ratio: \(F=\frac{BGV}{WGV}\), where degree of freedom is K-1 and N-K
4. Chi-squared test
For goodness of fit:Checks whether or not an observed pattern of data fits some given distribution.
\(\chi^2=\sum\frac{(O-E)^2}{E}\), where O is the observed value and E is the expected value
Degrees of freedom = number of categories in the distribution -1
For independence:Checks whether two categorical variables are related or not.
\(\chi^2=\sum\frac{(O-E)^2}{E}\), where O is the observed value and E is the expected value.
Degrees of freedom = (r-1)(c-1), r is the number of rows and c is the number of columns.