Chi-square statistics in research for data analysis
In this blog, I will explain you what is a chi-square test in a more clear way and how it can be used for data analysis.
Chi-Square Test
Things don’t generally turn out the way in which you expect in statistical insights. There might be a shrouded predisposition in the decisions individuals make or possibly the information are not made equally. We use a unique statistical test called a Chi-Square Test to address the expected vs the unexpected. It is a unique sort of test that manages frequency of data rather than means as in other statistical tests.
Chi-square test is often determines whether to retain the null hypothesis or the problem of the study. If you have two categorical variables in your data and you want to test the relationship between the two, then chi-square test serves the purpose. For any data analysis, the important thing is to formulate the research plan (test statistic, significance level). It should describe how to use the data to accept or reject the null. Suppose, if you wish to conduct a chi-square testing problem to check the independence of two categorical variables, then the following are the main requirements for the analysis:
- Degrees of freedom
- Expected frequencies
- Test statistic
- P-value
Let me explain you what exactly the chi-square does and how it can be used for data analysis by means of an example.
I love to watch horror movies. With lot of curiosity, I once inquired some of my classmates if they like to watch horror movies too. So, I gathered the data so that I could investigate it and identify some patterns. And the data I got is:
Like to watch horror movies | ||
Yes | No | |
Women | 32 | 38 |
Men | 30 | 12 |
Total | 62 (55.4%) | 50 (44.6%) |
By looking at the data, it would seem that both men and women watch horror movies in equal proportion. However, if you look closely, it is not! This is the place where bias places a significant role. This situation lead me to analyse the data or Data Analysis for statistical significance.
In this case, there are no mean values to work with!
Well, the data is purely categorical in nature, so I should use a test which deals with count data instead of mean values. In order to test the statistical significance for this situation, I would adapt the most widely used chi-square test like other test like t-test and F-test for means. The problem of claim or the problem statement is:
Null Hypothesis: There is no significant difference between the movie preferences and gender.
Alternative hypothesis: There is a significant difference between the movie preference and gender.
And from Agresti (2002), the chi-square test statistic can be represented as
Now comes a question! What is the expected mean here? How do you calculate?
Before that, we need to frame the null hypothesis stating that the participants who love to watch horror movie are independent of gender. Let us calculate the expected frequencies for the computation purpose.
The chance of a woman who likes to watch horror movie (Women-yes) is (70/112) * (62/112) = 38.75. Likewise, other expected frequencies are calculated and found to be Women-no = 31.25, Men-yes = 23.25, and Men-no = 18.75.
Thus, the chi-square value will be 7.02 based on the formula. Next, we have to take a decision whether it is statistically significant or not. For this, we need to compare the value with the critical value of the distribution with the corresponding degrees of freedom. Degrees of freedom is calculated as (no.of rows -1) * (no.of columns -1). If the calculated value exceed the critical, then we conclude there is a lack of independence. Thus, for this horror movies example, our calculated value is higher than the critical value with 1 degrees of freedom with 5% level of significance, leading us to reject the null hypothesis (i.e) Horror movie liking is not independent of gender.
Main use of the chi-square statistic is to test the statistical significance between the observed and the expected frequencies and it is applicable only when the data is nominal in nature. Chi-Square test is similar to the non-parametric Kolmogorov test. Apart from this, chi-square test have certain limitations: If the expected values is less an 5, then chi-square test may lead to invalid results. In addition, if there is a small sample size, chi-square test will not provide reasonable results.
Uses of chi-square test for data analysis:
Let’s look at situations where the Chi-Square Test is useful for Data Analysis.
- A marketing company wants to identify the relationship between the customer’s geographical location and their brand preferences. In such case, chi-square plays an important role and based on the value of the statistic, the company will develop its marketing strategy to different locations to make profits.
- The Chi-square test will be helpful for data analysis to test the homogeneity or independence between the categorical variables, or to test the goodness-of-fit of the model considered.
- It has the flexibility in handling two or more groups of variables. And it is used in various fields such as research field, marketing, Finance, and Economics, Psychology, Medicine, etc.
- It is a distribution-free test or simply it is a non-parametric test used for categorical data and it is more robust with respect to the distribution of the data.
- It doesn’t require mean or variance like in other test statistics such as t-test, F-test, ANOVA, etc.
- It is easy to compute and a detailed information can be obtained with this test and it is easily carried out in software like R, SAS, SPSS, etc.
The main application of the chi-square statistic could be found in the medicine field. If the researcher wants to identify the performance of a drug with control group, then the chi-square test will satisfies the needs. Likewise, there are many areas still utilizing the omnibus test statistic chi-square for identifying the relationship between two categorical outcomes.
References
- Agresti, A. (2002). Categorical Data Analysis (Second Edition). Wiley, New York.
- Kateri, M. (2014). Contingency Table Analysis. Springer.
- Voinov, V, Nikulin M, Balakrishnan N (2013). Chi-squared Goodness-of-fit Tests with Applications, Elseiver.