Effects of Sample Size, Test Length, and Group Balance on the Performance of the Generalized Mantel–Haenszel (GMH) Method in Detecting Differential Item Functioning for Graded Response Items: A Simulation Study
DOI:
https://doi.org/10.53285/artsep.v7i4.2964Keywords:
Generalized Mantel–Haenszel (GMH);, Differential Item Functioning (DIF), Graded Response Model (GRM);, polytomously scored scales.Abstract
This study examines the effectiveness of the Generalized Mantel–Haenszel (GMH) method for detecting differential item functioning (DIF) in graded response items. Monte Carlo simulation under the graded response model (GRM) was employed, while systematically modifying factors of sample size, test length, group-size balance, the proportion of DIF items, and the type and magnitude of DIF. Results showed that statistical power increased with DIF magnitude and that GMH was more sensitive to uniform than to nonuniform DIF. At the same time, Type I error remained close to the nominal level. Power declined under group imbalance, whereas longer tests and larger samples improved performance. Practically, GMH performed well when tests were longer, groups were balanced, and the overall sample size was adequate. When nonuniform DIF was suspected, increasing the sample size and complementing GMH with additional methods, such as the IRT likelihood-ratio (IRT–LR) test or MIMIC models, can strengthen measurement fairness and accuracy.
References
Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91. https://doi.org/10.1111/j.1745-3984.1992.tb00368.x
Aljodudeh, M. (2021). Item response theory likelihood ratio test performance for deducting DIF items in different levels in samples sizes and different levels of DIF items. Vidyabharati International Interdisciplinary Research Journal 13 (1), 392-399
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
Arikan, C. A., Ugurlu S. , & Atar, B. (2016). A DIF and Bias Study by using MIMIC, SIBTEST, Logistic Regression and Mantel-Haenszel Methods. Journal of Education, 31(1), 34-52.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Cambridge Psychometrics Centre. (2014). Session 4: Overview of polytomous IRT models (GRM thresholds & discrimination).
Elyan, R. M. ., & Al jodeh, M. M. . (2024). The Effectiveness of Mantel Haenszel Log Odds Ratio Method in Detecting Differential Item Functioning Across Different Sample Sizes and Test Lengths Using Real Data Analysis. Dirasat: Educational Sciences, 51(3), 37–46. https://doi.org/10.35516/edu.v51i3.6755
Eom, M. (2008). Underlying factors of MELAB listening construct. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6, 77–94.
Fidalgo, A. M., & Madeira, J. M. (2008). Generalized Mantel-Haenszel methods for differential item functioning detection. Educational and Psychological Measurement, 68(6), 940-958
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295.
Finch, W. H. (2022). The Impact and Detection of Uniform Differential Item Functioning. Frontiers in Education (PMC).
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel–Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Lawrence Erlbaum Associates. https://doi.org/10.4324/9780203056905-12
Jafari, P., Bagheri, Z., Hashemi, S. Z., & Shalileh, K. (2013). Assessing whether parents and children perceive the meaning of the items in the PedsQLTM 4.0 quality of life instrument consistently: a differential item functioning analysis. Global Journal of Health Science, 5(5), 80 – 88.
Kabasakala, K., Arsan, N., Gok, B., & Kelecooglu, H. (2014). Comparing Performances (Type I error and Power) of IRT Likelihood Ratio SIBTEST and Mantel-Haenszel Methods in the Determination of Differential Item Functioning. Educational Sciences: Theory & Practice, 14(6), 2186-2193.
Mellenbergh, G. J. (1989). Item bias and item response theory. International journal of educational research, 13(2), 127-143.
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied psychological measurement, 17(4), 297-334. https://doi.org/10.1177/014662169301700401
Narayanon, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied psychological measurement, 20(3), 257-274. https://doi.org/10.1177/014662169602000306
Park, G.(2008). Differential Item Functioning on an English Listening Test across Gender. TESOL Quarterly,42(1), pp. 115-123
Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: a comparison of three Mantel-Haenszel procedures. Appl. Meas. Educ. 14,(3) 235–259. doi: 10.1207/ S15324818AME1403_3
Penfield, R. D. (2010). Distinguishing between net and global DIF in polytomously scored items. Journal of Educational Measurement, 47(1), 129–149. https://doi.org/10.1111/j.1745-3984.2010.00105.x
Penfield, R. D., Gattamorta, K. A., & Childs, R. A. (2009). An NCME instructional module on using differential step functioning to refine the analysis of DIF in polytomous items. Educational Measurement: Issues and Practice, 28(1), 38–49. https://doi.org/10.1111/j.1745-3992.2009.01135.x
R Core Team. (2025). R: A language and environment for statistical computing (Version 4.4.3). R Foundation for Statistical Computing. https://www.r-project.org/
Su, Y. H., & Wang, W. C. (2005). Efficiency of the Mantel, Generalized Mantel–Haenszel, and Logistic Discriminant Function Analysis Methods in Detecting Differential Item Functioning for Polytomous Items. Applied Measurement in Education, 18(4), 313–350. https://doi.org/10.1207/s15324818ame1804_1
Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the computation of the statistics involved in Item Response Theory Likelihood-Ratio tests for Differential Item Functioning. L.L. Thurstone Psychometric Laboratory, University of North Carolina, Chapel Hill, NC.
Ugurlu, S. & Atar, B. (2020). Performances of MIMIC and logistic regression procedures in detecting DIF. Journal of Measurement and Evaluation in Education and Psychology, 11(1), 1-12.
Vahid A., Christine C. & Lee O. (2011). An Investigation of Differential Item Functioning in the MELAB Listening Test. Language Assessment Quarterly, 8, 361–385. DOI:10.1080/15434303.2011.628632
Wagner, A. (2004). A construct validation study of the extended listening sections of the ECRE and MELAB. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 2, 1–23.
Woods, C. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44(1), 1-27
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright and Licensing
copyright is retained by the authors. Articles are licensed under an open access Creative Commons CC BY 4.0 license, meaning that anyone may download and read the paper for free. In addition, the article may be reused and quoted provided that the original published version is cited. These conditions allow for maximum use and exposure of the work.







