Shapiro–Wilk test: Test of normality

The Shapiro-Wilk test is a widely used statistical test for assessing the normality of a dataset. It helps determine whether a sample comes from a normally distributed population. This test is particularly useful in fields such as data science, finance, and biology, where the assumption of normality is often critical for further statistical analysis.

Understanding the Shapiro-Wilk Test

The Shapiro-Wilk test is based on the null hypothesis that the sample data is drawn from a normal distribution. If the p-value obtained from the test is less than a predetermined significance level (commonly 0.05), we reject the null hypothesis, suggesting that the data does not follow a normal distribution. Conversely, if the p-value is greater than 0.05, we do not have enough evidence to reject the null hypothesis, indicating that the data could be normally distributed.

Key Points

Null Hypothesis (H0): The sample is from a normal distribution.
Alternative Hypothesis (H1): The sample is not from a normal distribution.
P-value: A measure that helps determine the validity of the null hypothesis.

Performing the Shapiro-Wilk Test in Python

In Python, the Shapiro-Wilk test can be easily performed using the scipy.stats library. The main function used is shapiro(), which returns both the test statistic and the p-value.

Example Code

Here’s a step-by-step guide to performing the Shapiro-Wilk test in Python:

Import Required Libraries:
You need to import numpy for generating sample data and shapiro from scipy.stats for performing the test.

   import numpy as np
   from scipy.stats import shapiro

Generate Sample Data:
You can create sample data from a normal distribution and a non-normal distribution for testing purposes.

   # Normal distribution
   normal_data = np.random.normal(loc=0, scale=1, size=100)

   # Non-normal distribution (e.g., Poisson)
   non_normal_data = np.random.poisson(lam=5, size=100)

Perform the Shapiro-Wilk Test:
Call the shapiro() function with your sample data.

   # Test on normally distributed data
   stat, p_value = shapiro(normal_data)
   print(f'Statistic: {stat}, P-value: {p_value}')

   # Test on non-normally distributed data
   stat, p_value = shapiro(non_normal_data)
   print(f'Statistic: {stat}, P-value: {p_value}')

Output Interpretation

For the normal data, you might see a high p-value (e.g., 0.8689), indicating that we fail to reject the null hypothesis, suggesting that the data is normally distributed.
For the non-normal data, a low p-value (e.g., 0.00299) would indicate that we reject the null hypothesis, suggesting that the data does not follow a normal distribution.

Important Considerations

The Shapiro-Wilk test is most effective for small to moderate sample sizes (typically less than 5,000). For larger datasets, the test may yield unreliable results, and alternative methods like the Kolmogorov-Smirnov test may be more appropriate.
Always visualize your data (e.g., using histograms or Q-Q plots) in conjunction with statistical tests to get a better understanding of its distribution.

Conclusion

The Shapiro-Wilk test is a powerful tool for assessing normality in datasets. By utilizing Python’s scipy.stats library, you can easily implement this test and interpret its results to inform your statistical analyses. Understanding the implications of the test results is crucial for making informed decisions based on your data.

Citations:
[1] https://www.geeksforgeeks.org/how-to-perform-a-shapiro-wilk-test-in-python/
[2] https://builtin.com/data-science/shapiro-wilk-test
[3] https://www.statology.org/shapiro-wilk-test-python/
[4] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html
[5] https://spureconomics.com/shapiro-wilk-test-for-normality/