Mathematics for Data Science Part 2- Probability & Statistics

11 min read5 days ago

If you have read Part 1 of Mathematics for Data Science, you already know Probability and Statistics was one of my favourite subjects. I was intrigued by their ability to provide insights into uncertainty and randomness. Little did I know how foundational they would become in my career as a Data Scientist. If Linear Algebra serves as the backbone of data transformations and algorithms, Probability and Statistics are the lenses through which we understand uncertainty and derive insights from data.

In this article, we’ll explore key topics in Probability and Statistics — Descriptive Statistics, Probability, Distributions, Central Limit Theorem (CLT), Law of Large Numbers and Hypothesis Testing, breaking down each concept and linking it to real-world applications in data science. By the end, you’ll see how these mathematical tools are indispensable for data-driven decision-making.

Stay with me as this is going to be long one!!

1.Descriptive Statistics

Descriptive statistics are tools used to summarize and describe the essential features of a dataset. It provides a quick overview of the data’s central tendency (mean, median, mode), variability (range, variance, standard deviation), and distribution. You may end up using descriptive Statistics most of the times in your work.

Mean: The average value of a dataset.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value(s).
Variance and Standard Deviation: Measures of data spread or variability.
Percentiles and Quartiles: Indicators of data distribution, divides the data into equal parts to analyze specific segments.

Example: Suppose you are analyzing the historical revenue data of an apparel brand over the past 5 years to predict daily revenue.

Here’s the revenue data (in $) from the last 7 days of one store:

[3000, 3200, 3100, 2800, 2900, 3100, 3000]

Mean (Average Revenue):

The average daily revenue is $3014.

Median (Middle Value):

Sort the data:

[2800,2900,3000,3000,3100,3100,3200]

The middle value (4th position) or Median is $3000.

Mode (Most Frequent Revenue):

The value 3000 occurs most frequently (2 times). Hence, Mode is 3000.

Variance and Standard Deviation (Spread of Revenue):

Standard Deviation is $118.32

Percentiles (Revenue Distribution):

25th Percentile (Q1): Value below which 25% of the data lies is $2900.
50th Percentile (Q2/Median): Value below which 50% of the data lies is $3000.
75th Percentile (Q3): Value below which 75% of the data lies is $3100.

Inference:

The mean of $3014.29 tells on average, the store generates approximately $3014.29 in revenue daily.
The median $3000 means, half of the days had revenue greater than $3000, while the other half had revenue below this value.
The mode ($3000) indicates the store most often generates exactly $3000 in daily revenue.
A standard deviation of $118.32 shows that daily revenues typically vary by about $118 from the mean. A low standard deviation indicates relatively consistent sales performance. This stability is desirable for forecasting and planning. And a high standard deviation (e.g., $500 or more) would indicate fluctuating sales, which could make planning inventory, staffing, or promotions more challenging.
Percentiles divide the data into equal parts to analyze specific segments. 75th Percentile (Q3 = $3100) mean 75% of days had revenue below $3100, meaning the top 25% of days had revenue higher than this. Days in this range represent high-performing periods.

2.Probability

Probability is a fundamental concept in statistics and data science, used to quantify the uncertainty or likelihood of an event occurring. It ranges from 0 (impossible event) to 1 (certain event). Probability is crucial in predictive modeling, decision-making, and risk assessment.

Sample Space (S): The set of all possible outcomes of an experiment or event.
Event: A subset of the sample space, representing an outcome or a set of outcomes.

Probability of an Event (P(E)): The likelihood that a specific event will occur. It is calculated as:

Conditional Probability: The probability of an event occurring given that another event has already occurred.

Where:

P(A|B) is the probability of A occurring given B has occurred.
P(A∩B) is the probability that both events A and B occur together.
P(B) is the probability of event B occurring.

Bayes’ Theorem: A way to update the probability of an event based on new evidence or information.

Where:

P(A|B) is the updated probability of event A given event B,
P(B|A) is the probability of event B given event A,
P(A) is the initial probability of A,
P(B) is the probability of B.

For detailed explanation, refer to my article on Conditional Probability and Bayes’ Theorem.

3. Distributions

A probability distribution describes how the values of a random variable are spread or distributed. It provides a model for how data behaves and helps in making predictions or drawing inferences. There are several types of probability distributions, each applicable to different types of data and events.

Normal Distribution: Also known as the Gaussian distribution, it is symmetric around the mean, meaning that most data points cluster around the center, with fewer data points appearing as you move further from the mean. You’ll mostly be encountering normal distribution!

It is defined by two parameters: mean (μ) and standard deviation (σ).
The area under the curve is equal to 1, meaning it accounts for all probabilities.
For all normal distributions, 68.2% of the observations appear within +/- σ of the mean; 95.4% fall within +/- 2 σ ; and 99.7% within +/- 3σ and this fact is called empirical rule.

Binomial Distribution: Models the number of successes in a fixed number of independent trials, each with the same probability of success.

Two possible outcomes: success or failure (e.g., call answered or not)
Defined by two parameters: the number of trials (n) and the probability of success (p).

Poisson Distribution: Models the number of occurrences of a rare event in a fixed interval of time or space.

Typically used for events that happen rarely and independently.
Defined by the rate (λ), which is the average number of events in a fixed interval.

Uniform Distribution:A distribution where all outcomes are equally likely within a given range. It is used when there is no preference for any particular outcome.

All values in the range have the same probability.
Defined by two parameters: the minimum (a) and maximum (b) values of the distribution.

Example: Let’s say you are working with contact center staffing data and need to forecast staffing requirements based on different types of distributions. Here are examples of how you can apply each distribution to forecasting staffing needs:

Normal Distribution: You are analyzing the call handle time (average duration of a call) for the past month to predict future staffing needs in a contact center. You collect data for the average duration of customer support calls, and it is normally distributed.

Mean (μ) = 8 minutes
Standard Deviation (σ) = 1.5 minutes

Since the call durations are likely to be symmetrically distributed around the mean, you can use the Normal Distribution to predict how many calls will take longer than 10 minutes or shorter than 5 minutes, which helps in estimating the number of agents needed at different times. For instance, you can use the normal distribution to calculate the probability of a call taking longer than 10 minutes, helping to plan for peak times.

Binomial Distribution: You want to forecast the number of calls answered successfully (as opposed to missed or abandoned calls) during a given period in the contact center. Assume there are 100 calls, and the probability of answering each call successfully is 80%.

Number of Trials (n) = 100 calls
Probability of Success (p) = 0.8 (80% chance of a call being answered)

The Binomial Distribution is suitable here because you’re dealing with fixed trials (calls), and each trial has two possible outcomes (answered or not). By using the binomial distribution, you can estimate the probability of answering a specific number of calls, which helps in staffing predictions. For example, you can calculate the probability of answering 85 calls out of 100 and adjust staffing based on the expected call-answer rate.

Poisson Distribution: Next, you want to forecast the number of incoming calls in a contact center over the next hour. The center receives an average of 20 calls per hour.

Rate (λ) = 20 calls per hour

The Poisson Distribution is appropriate here because you’re modeling the occurrence of rare events (calls) over a fixed interval (1 hour). Using the Poisson distribution, you can calculate the probability of receiving exactly 15 calls, 20 calls, or 25 calls during the next hour. This helps in determining how many agents to schedule based on the likely call volume. For instance, if the probability of receiving more than 25 calls is high, you may want to schedule more agents during that time.

Uniform Distribution: In your contact center, calls are received at unpredictable but equally likely intervals. The time between successive calls ranges from 5 to 10 minutes.

Minimum time (a) = 5 minutes
Maximum time (b) = 10 minutes

Here, the Uniform Distribution is used because the time between calls is equally likely to be anywhere between 5 and 10 minutes. By modeling this distribution, you can estimate the average time between calls and plan staffing more effectively. For example, knowing that each call is equally likely to arrive at any minute between 5 and 10 minutes helps in scheduling agents at consistent intervals and optimizing agent availability.

4. Central Limit Theorem (CLT)
The Central Limit Theorem (CLT) is a fundamental concept in probability and statistics. It states that the distribution of the sample mean will approach a normal distribution as the sample size increases, regardless of the shape of the population distribution. This property holds true for sufficiently large sample sizes, typically n≥30.

The larger the sample size n, the smaller the standard deviation of the sample mean.

For detailed explanation, refer to my article on CLT and LLN.

5.Law of Large Numbers
The Law of Large Numbers (LLN) is a fundamental principle in probability and statistics that states that as the size of a sample increases, the sample mean (or other sample statistics) will get closer to the true population mean. In other words, the larger the sample size, the more accurate the estimate of the population parameter becomes.

This principle is crucial because it gives us confidence that with enough data, the sample mean (or any statistic derived from the sample) will closely approximate the population mean.

For detailed explanation, refer to my article on CLT and LLN.

6. Hypothesis Testing
Hypothesis testing is a statistical method used to evaluate whether there is enough evidence in a sample of data to support or reject a particular assumption (hypothesis) about a population. It involves setting up two competing hypotheses: the null hypothesis (H₀) and the alternative hypothesis (H₁).

Null Hypothesis (H₀): This hypothesis assumes that there is no effect, difference, or relationship in the population. It represents the default or status quo assumption.
Alternative Hypothesis (H₁): This hypothesis posits that there is an effect, difference, or relationship in the population. It is what you want to prove or support through your data analysis.

The goal of hypothesis testing is to determine whether the observed data provides sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis.

Common Types of Hypothesis Tests:

z-test: Used when the population variance is known or the sample size is large (typically n > 30). It tests the difference between the sample mean and the population mean.
t-test: Used when the population variance is unknown and the sample size is small (typically n ≤ 30). It compares the means of two groups (e.g., independent t-test for two different groups or paired t-test for related groups).
Chi-Square Test: Used for categorical data to test the association between two variables or the goodness of fit of a model.
ANOVA (Analysis of Variance): Used to compare the means of three or more groups to determine if at least one of the group means is different from the others.

For detailed explanation, refer to my article on Hypothesis testing.

7.Correlation vs. Causation

Correlation:
Correlation refers to a statistical relationship or association between two variables, meaning that as one variable changes, the other tends to change in a specific way. However, correlation does not imply that one variable causes the other to change. It simply means they move together in some manner, whether it be positively, negatively, or in some other pattern.

Positive Correlation: When one variable increases, the other variable also increases (e.g., height and weight).
Negative Correlation: When one variable increases, the other decreases (e.g., amount of exercise and weight gain).
No Correlation: No predictable relationship between the two variables.

Causation:
Causation means that one variable directly affects another. If there is a causal relationship, a change in one variable will directly cause a change in the other. This relationship is often more difficult to establish because it requires more rigorous evidence, such as controlled experiments or causal inference methods.

Example of Causation: Smoking causes lung cancer, or a company’s marketing efforts directly increase its sales.

Key Differences:

Correlation does not prove causation. Just because two variables are correlated (move together) does not mean that one causes the other to change.
Causation requires a cause-and-effect relationship, while correlation only measures how variables are related.

Application in Data Science:

Understanding the difference between correlation and causation is crucial in data science for making accurate predictions, drawing valid conclusions, and avoiding misleading interpretations. Misinterpreting correlation as causation can lead to faulty insights, incorrect business decisions, or misguided policy recommendations.

Example: Suppose you are analyzing the relationship between the number of staff in a contact center and the number of incoming calls. You find a strong positive correlation between the two variables: as the number of staff increases, the number of calls also tends to increase.

Correlation: This observation tells you that staffing levels and call volumes tend to move together, but it doesn’t mean that having more staff directly causes more calls.
Causation: The actual cause of the increase in calls might be factors like marketing campaigns, seasonal trends, or customer service issues. Staffing decisions alone may not be the direct cause of higher call volumes.

Understanding that the increase in staff does not directly cause more calls helps you avoid the assumption that increasing the number of staff will automatically lead to more calls. It’s important to explore other potential drivers (e.g., advertising, promotions, or external factors like holidays) before making staffing decisions based solely on correlation.

This article covered a vast array of concepts in Probability and Statistics that are foundational for data science. While the depth and breadth of these topics might feel overwhelming, it’s important to remember that you don’t need to master every concept in exhaustive detail to excel as a data scientist.
Certain topics, like descriptive statistics, normal distribution, and correlation, are frequently used in day-to-day tasks, such as data exploration and model evaluation. Understanding these core concepts thoroughly will significantly enhance your ability to work effectively with data.
On the other hand, advanced topics like hypothesis testing and Bayesian inference may only be required for specific projects or roles. The key is to focus on building a strong foundational understanding and to revisit these concepts when the need arises in your work.

If you found this article helpful, I’d love to hear your thoughts! Feel free to leave a comment below with your feedback or any questions you have. Don’t forget to share this article with others who might find it useful. Your support means a lot — thank you for reading!

Mathematics for Data Science Part 2- Probability & Statistics

Written by Kiran Nagarkoti

No responses yet