Central Limit Theorem(CLT) and Law of Large Numbers(LLN)
If you’re here, you’re likely diving into my article on Mathematics for Data Science Part 2 or seeking a detailed understanding of two cornerstone concepts in statistics: Central Limit Theorem (CLT) and Law of Large Numbers (LLN). Either way, you’ve arrived at the right place!
CLT and LLN are pivotal ideas that form the foundation of statistical reasoning. Whether you’re conducting hypothesis tests, building predictive models, or analyzing patterns in data, these principles underpin many of the methods we rely on. While they are often mentioned together, each concept addresses a unique statistical phenomenon, and a deeper understanding of both is critical for applying them effectively in real-world scenarios.
In this article, I’ve set aside dedicated sections to break down these two topics comprehensively. To make the learning curve smoother, we’ll begin with some essential prerequisites like z-scores and the normal distribution. With these fundamentals in place, we’ll explore CLT and LLN in depth, supported by real-world examples, applications, and a side-by-side comparison.
Let’s get started!
Z-score:
The z-score is a statistical measurement that describes a data point’s position relative to the mean of a dataset. It tells you how many standard deviations a specific value (data point) is away from the mean. Z-scores are particularly useful for comparing data points from different distributions and for standardizing scores. It is calculated as:
Where:
- Z: The z-score
- X: The raw data point
- μ: The mean of the dataset
- σ: The standard deviation of the dataset
The magnitude of the z-score represents how far the data point is from the mean in terms of standard deviations:
- Z = 0: The data point is exactly at the mean.
- Z = 2: The data point is 2 standard deviations above the mean.
- Z = −1.5: The data point is 1.5 standard deviations below the mean.
Example:
Suppose the mean score of an exam is 70 with a standard deviation of 10. A student scores 85. The z-score is:
Using a standard normal table or a z-score calculator:
Z = 1.5 indicates that your performance is better than 93.32% of your peers who took the exam. And only about 6.68% of students scored higher than you.
Normal Distribution:
The Normal Distribution is a probability distribution that is symmetric around the mean. It describes how data is distributed and how most of the data points are clustered around the central value (the mean).
- Bell-shaped curve: The distribution is often referred to as a “bell curve.”
- Symmetry: It is symmetric about the mean, meaning that the left and right sides are mirror images.
- Defined by two parameters: The mean (μ) — The center of the distribution. and the standard deviation (σ) — The spread of the distribution. The larger the standard deviation, the more spread out the distribution is.
- 68–95–99.7 Rule (Empirical Rule): This rule states that for a normal distribution:
Standard Normal Distribution
The Standard Normal Distribution is a special case of the normal distribution where the mean (μ) is 0 and the standard deviation (σ) is 1. It is denoted as Z ~ N(0, 1), where Z represents the standardized values or z-scores.
Why Do We Need the Standard Normal Distribution?
- To make calculations and comparisons easier, we can standardize any normal distribution to a Standard Normal Distribution using z-score.
- The Standard Normal Distribution provides a common reference framework, making probability calculations and comparisons easier.
- Standardization ensures all normal distributions are expressed on the same scale, irrespective of their original mean and standard deviation.
Central Limit Theorem (CLT)
The Central Limit Theorem (CLT) is a fundamental concept in probability and statistics. It states that the distribution of the sample mean will approach a normal distribution as the sample size increases, regardless of the shape of the population distribution. Moreover their distribution will have same mean as the original population(μ).This property holds true for sufficiently large sample sizes, typically n≥30.
Example:
Please refer to this video for clear understanding of CLT with example.
Application of CLT:
Confidence Intervals: CLT is used to construct confidence intervals for population parameters (like the population mean). Since the sampling distribution of the sample mean is approximately normal (for sufficiently large sample sizes), we can estimate the range within which the true population mean is likely to lie with a certain level of confidence.
Hypothesis Testing: CLT enables hypothesis testing for population parameters (e.g., testing if the population mean is equal to a certain value). Since the sample means will follow a normal distribution, hypothesis tests such as z-tests and t-tests rely on this property to assess the probability of observing a sample mean under the null hypothesis.
Quality Control and Process Monitoring: In manufacturing and quality control, CLT is used to monitor the quality of products by analyzing sample means. If the sample means of multiple batches of products follow a normal distribution, process deviations can be detected early.
Law of Large Numbers
The Law of Large Numbers (LLN) is a fundamental principle in probability and statistics that states that as the size of a sample increases, the sample mean (or other sample statistics) will get closer to the true population mean(or other population statistics). In other words, the larger the sample size, the more accurate the estimate of the population parameter becomes.
This principle is crucial because it gives us confidence that with enough data, the sample mean (or any statistic derived from the sample) will closely approximate the population mean.
Example:
Suppose you’re working in a contact center and want to estimate the average call duration (population mean) based on a random sample of calls.
1.Population Mean: The true average call duration for the entire population of calls is μ=5 minutes.
2.Small Sample: Initially, you take a small sample of 5 calls and find that the average call duration is X̅₅ =6 minutes. This is quite far from the true population mean of 5 minutes.
3.Increasing Sample Size: As you increase the sample size:
- With n=10, you might get X̅₁₀ =5.2 minutes.
- With n=30, you might get X̅₃₀ =5.05 minutes.
- With n=100, you might get X̅₁₀₀ =5.01 minutes.
As the sample size increases, the sample mean gets closer to the true population mean of 5 minutes.
Interpretation:
- Initially, your estimate of the average call duration might be far off due to the small sample size.
- As you collect more data (increase the sample size), your estimate becomes more accurate, converging toward the true population mean of 5 minutes.
Application of LLN:
Estimating Population Parameters:The Law of Large Numbers ensures that as we collect more data, our estimates (like means, variances) become more reliable. This is crucial in data science when making predictions or inferring properties of a population.
Data Sampling:In large datasets, sampling allows data scientists to make inferences about the entire population based on a smaller sample. As the sample size grows, the sample statistics become more representative of the population, reducing errors due to small sample sizes.
Model Accuracy:The more data you use to train a model, the more accurate the model becomes. With larger datasets, machine learning models tend to perform better because they can generalize well to the entire population of data.
LLN (Law of Large Numbers) vs. CLT (Central Limit Theorem)
The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are two foundational concepts in probability and statistics that deal with the behaviour of sample statistics as the sample size increases. Although they are related, they focus on different aspects of statistics and have distinct purposes.
Here’s a comparison of LLN and CLT:
Conclusion:
- LLN is about convergence: It tells you that as the sample size increases, the sample mean will get closer to the population mean.
- CLT is about distribution: It tells you that as the sample size increases, the distribution of the sample mean will approach a normal distribution. And as this happens, the sample mean (X̅) will get closer and closer to the true population mean (μ).
The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are more than just theoretical concepts; they are the backbone of practical data science and statistical analysis. LLN ensures that with enough data, our estimates (like the sample mean) converge to the true population values, giving us confidence in making reliable predictions. On the other hand, CLT allows us to assume normality for sample means, enabling powerful tools like confidence intervals and hypothesis testing even when the population distribution isn’t normal.
If you found this article helpful, I’d love to hear your thoughts! Feel free to leave a comment below with your feedback or any questions you have. Don’t forget to share this article with others who might find it useful. Your support means a lot — thank you for reading!