Evaluation of the reliability of the online multiple choice assessment tool by the generalizability theory
Introduction: Due to the COVID-19 pandemic in our country, it has been decided by the Higher Education Council (CoHE) to apply online measurement and evaluation in higher education, valid only for the pandemic period.
Aim: In our study, it is aimed to evaluate the basic analyzes of the multiple-choice assessment tool, which is applied online for 1st year and 3rd commitee students at Süleyman Demirel University, with classical test and generalizability theory.
Method: Our study is designed in quantitative research design. The population of the study was determined as first year students actively studying at Süleyman Demirel University Faculty of Medicine (n: 271). In the study, the 3rd commitee applied online to the 1st year students was analyzed with the multiple choice assessment tool SPSS and EduG.
Results: When the exam was evaluated over 100 points, the average was 78.5 ± 11.05 (min: 27.4; max: 98.0), variance 122.229, kurtosis -1.196 and skewness 1.683. The average item difficulty was 0.785, the mean discrimination index was 0.262, and the reliability coefficient (KR-20) was 0.902. For the 95-item exam, the G value was calculated as 0.91 and the Phi value as 0.90.
Discussion: Our faculty has also ensured the monitoring of measurement-evaluation practices in distance education in line with the recommendations of CoHE and the Association for Evaluation and Accreditation of Medical Education Programs. Although the evaluation of a single assessment tool was considered as a limitation for our study, our study provided valuable information about the review of the basic analysis and assessment of the multiple-choice measurement tools. Based on this experience, we believe that the basic analysis of online measurement and evaluation applications can be preferred for the analysis of measurement tools after the pandemic.
Due to the COVID-19 pandemic in our country, it was decided by the Council of Higher Education (CoHE) that assessment and evaluation would be completed online in higher education, to be valid only for the pandemic period.(1) Main principles such as transparency, fairness, and controllability were identified for assessment and evaluation practices in higher education during the pandemic.(2)
In line with these principles, the Association for Development of Medical Education (TEPDAD) published a document titled “Suggestions of Association for Evaluation and Accreditation of Medical Education Programs on Assessment and Evaluation in Medical Education during the COVID-19 Pandemic” on May 31, 2020.(3) Also, in the National Standards for Pre-Graduate Medical Education-2020, it is recommended that the medical faculty continuously improve its system with assessment and evaluation practices as well as in development standards and evaluate the effectiveness of practices.(4)
These recommendations have provided in-depth explanations on the asses ment and evaluation process.(5) An “Assessment / Evaluation monitorization” in which the validity, reliability, and practicality of assessment tools are evaluated, is recommended for this evaluation.(4,7) One of the essential parameters of this monitoring is reliability.(8) Reliability tests the acquisition of reliable assessment results, and therefore, it is recommended that “the reliability of scores obtained through an assessment tool be measured”.(9)
The generalizability theory focuses on the generalization of assessment results to the universe.(10) It calculates a single reliability value by evaluating multiple error sources simultaneously. The G-theory enables the following: evaluating multiple sources of variance in a single analysis, identifying the size of each source of variance, calculating two different coefficients related to relative decision making based on individual performance and to absolute decision making based on individual performance (respectively; G coefficient and phi coefficient), and performing assessments in which the measuring error could be minimized (Decision “D” studies).(11,12) In the present study, the G-theory, in which multiple error sources could be evaluated, involving reliability analysis and decision study was preferred.
The study aims to evaluate the multiple-choice assessment tool offered to students at Süleyman Demirel University online for 1st year, Committee 3 using the classical test and generalizability theory.
Our study had a quantitative research design. The study universe was identified as 1st-year students who were actively studying at the Medical Faculty of Süleyman Demirel University (n:271). In the study, the multiple-choice assessment tool for Phase 1, Committee 3 offered to students online was analyzed.
The assessment tool was prepared by the faculty members of the relevant department, and it was comprised of 95 items in accordance with the learning objectives of Phase 1, Committee 3. This assessment tool was conducted through the faculty’s Learning Management System (MOODLE). In the assessment tool that had been prepared using the learning management system, exam security was ensured by student ID numbers, student passwords, IP addresses, and analysis of exam speed.
Students were able to reconnect after connection problems. Questions and answers were mixed for each implementation. The exam duration [the number of questions multiplied by 1.50] and exam implementation duration [the number of questions multiplied by 1.25] were calculated. 271 students at the phase participated in the online exam. There were no students who could not participate in the exam. After the implementation, the assessment tool was shared with students for further questions and score objection. Then, feedback on online assessment and evaluation was provided.
MS-Excel, SPSS, and EduG software programs were used in data analysis.(13,14) Applications of generalizability were compared with SPSS and EduG.
In our study, we evaluated the exam for Phase 1, Committee 3, held on April 24, 2020. The exam was accessible for 2 hours and 30 minutes. Two hundred seventy-one students participated in the process. When the exam is evaluated on a 100-point scale, it was found that the mean was 78.5±11.05 (min: 27.4; max: 98.0), the variance was 122.229, the kurtosis was -1.196 and skewness was 1.683. The general characteristics of this assessment are given in Table 1.
The mean item difficulty was found to be 0.785, and the mean discrimination index was 0.262, while the reliability coefficient (KR-20) was 0.902. High group min score (n=75) was 82.000 and low group max score (n=80) was 71.00. The mean score of questions, difficulty indices, and discrimination indices for all items on the assessment tool is provided in Table 2.
In the analysis of scores on the assessment tool using the G-theory, in a crossed design with a single surface, the relative percentage of the estimated variance component for individuals was 7.2%, percentage of the estimated variance component for items was 17.8%, and percentage of the estimated variance component for individual-item was 75% (Table 3).
G value was found to be 0.91, and the Phi value was 0.90 for the 95-item exam. G and Phi values calculated in the decision (D) study conducted with the number of items on the assessment tool are given in Table 4.
The decision of CoHE on the fact that final exams and other exams during the spring semester of 2019- 2020 academic year could not be held face-to-face due to the COVID-19 pandemic and on implementing alternative methods via digital means or assignments and projects were notified to universities on May 11.(1) In this context, in the evaluation process in distance learning, online supervised or unsupervised open-ended/ multiple-choice exams, assignments, online quizzes, projects, activities on Learning Management System (LMS), the use of LMS analytics and similar practices could be used.
Based on “transparency and controllability” on online exams, it was recommended that exam security measures be implemented according to the LMS or digital means. Among these measures were a random selection of questions, functioning of full screen or browser lock.
First of all, in online assessment and evaluation practices, the approach, system, and main principles for the assessment and evaluation system were identified at our faculty with the help of literature.(15-16) The online service infrastructure of the faculty was evaluated in this context. Before assessment and evaluation practices, trial practices and analyses, comprised of a few questions, were conducted for the adaptation of students into the system. In line with these evaluations, it was decided that the assessment and evaluation practice in the learning management system would meet the requirements of this service.
It had been recommended that information and training activities be provided for faculty members and students on the use of online exams at universities. Faculty members and students were provided with information and technical support regularly on our faculty’s website. Practice principles including the course of the exam, exam duration, grading principles, student responsibilities, exam objections and ethical rules on online assessment/evaluation were prepared and published on the faculty’s website.(17) Then, an assessment tool, comprised of items in accordance with learning objectives, was formed.
A trial assessment/evaluation practice was conducted for the adaptation of students into the system. After the improvements in line with the feedback provided during this preparation phase, the implementation period started. After the implementation, the assessment tool was shared with students for further questions and score objection. Then, feedback on online assessment and evaluation was provided.
It is recommended that an assessment tool be valid, reliable, and practical.(7,18,19) Since subject-matter expert faculty members in accordance with the learning objectives of the faculty’s accredited education program prepared the online Phase 1, Committee 3 exam held on April 24, 2020, the scope validity of the study was met. When evaluated in terms of practicality, the exam demonstrates that it is practical in terms of item entry, the course of exams, and the satisfaction of student feedback.
When evaluated in terms of reliability, the exam implementation was found to be 0.90, which was reliable according to the classical test and generalizability theories. The mean item difficulty was 0.785, and the mean discrimination index was 0.262 in item analyses of the assessment tool. Based on analyses 14 potential problematic questions were identified. 13 of these items were considered problematic because the difficulty level was greater than 0.95, and Item 93 had both the difficulty level was greater than and negative point biserial correlation coefficient. Feedback was provided in order to improve the item-level quality of assessment tools.
Moreover, in the analysis of scores on the assessment tool using the G-theory, in a crossed design with a single surface, the fact that the relative value of the estimated variance component percentage for individuals is low suggests that it is insufficient. The fact that the estimated variance component percentage for items has an unbalanced distribution at difficulty levels weakens the generalizability.
In contrast, the size of the estimated variance component for individual-item suggests that systematic or non-systematic error sources cannot be controlled. Given the scope validity in the decision (D) study conducted with the number of items on the assessment tool, it is recommended that the number of questions is changed. The unique value of this study is that it presents the analysis, which has not been frequently seen in medical education, to the use of the field.
Although the evaluation of a single assessment tool is a limitation in our study, the study has provided valuable information regarding the review and status evaluation of analyses and also it is precious for the sustainability of future analysis. In line with recommendations of CoHE and Association for Evaluation and Accreditation of Medical Education Programs, our faculty has performed assessment-evaluation practices in distance education as well as ensuring their monitorization.
Along with the pandemic experience, new approaches will be developed for the practices proposed in the measurement and evaluation area.(20,22) The most important practical implication of this study is to enable the faculty’s assessment and evaluation system to be monitored with sustainable analysis and the social implication of this study is its contribution to the training of physicians with the quality promised by the faculty goals in relation to increasing the quality in higher education. Based on the experience, we are of the opinion that online assessment and evaluation practices could be reliably used for summative assessment during the pandemic and for formative assessment after the pandemic.
We thank Süleyman Demirel University Rector’s Office, which supports the distance education approach of our faculty materially and spiritually during the distance education process.
- Yüksek Öğretim Kurulu. YÖK’ten Üniversitelerdeki Uzaktan Eğitime Yönelik Değerlendirme. https://www.yok.gov.tr/Say- falar/Haberler/2020/uzaktan-egitime-yonelik-degerlendirme. aspx adresinden 08.05.2020 tarihinde erişilmiştir.
- Yüksek Öğretim Kurulu. YÖK’ten Sınavlara İlişkin Karar. https://www.yok.gov.tr/Sayfalar/Haberler/2020/yok-ten-sinavlara- iliskin-karar.aspx adresinden 08.05.2020 tarihinde erişilmiştir.
- TEPDAD. Mezuniyet Öncesi Tıp Eğitimi Ulusal Standartları-2020. http://tepdad.org.tr/announcement/9 adresinden 08.05.2020 tarihinde erişilmiştir.
- TEPDAD. COVID-19 Nedeniyle TEPDAD tarafından yapılan önerilerin tümü. http://tepdad.org.tr/announcement/9 adresinden 04.06.2020 tarihinde erişilmiştir.
- Van der Vleuten CPM. and Schuwirth LWT. Assessing professional competence: from methods to programmes. Medical education. 2005; 39(3): 309–17. doi:10.1111/j.1365- 2929.2005.02094.x.
- Ercan İ. and Kan İ. Ölçeklerde Güvenirlik ve Geçerlik. Uludağ Üniversitesi Tıp Fakültesi Dergisi 2004; 30(3): 211–6.
- Norcini JJ. and McKinley DW. Assessment methods in medical education. Teaching and Teacher Education 2007; 23(3): 239– 50. doi:https://doi.org/10.1016/j.tate.2006.12.021.
- Cronbach, LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951;16:297–334. doi: 10.1007/ BF02310555.
- Güler N. Genellenebilirlik kuramı ve SPSS ile GENOVA programlarıyla hesaplanan G ve K çalışmalarına ilişkin sonuçların karşılaştırılması. Eğitim ve Bilim 2009; 34(154): 93–104.
- Brennan RL. Generalizability theory. New York, US: Springer Verlag Publishing (Statistics for social science and public policy.), 2001. doi: 10.1007/978-1-4757-3456-0.
- Shavelson RJ. and Webb NM. Generalizability theory: A primer. Vol.1. Thousand Oaks, CA, US: Sage Publications, Inc (Measurement methods for the social sciences series), 1991.
- Atılgan H. Genellenebilirlik Kuramı ve Uygulaması. 1. Baskı. Ankara, 2019.
- Mushquash C. and O’Connor, BP. SPSS, SAS, and MATLAB programs for generalizability theory analyses. Behavior Research Methods 2006; 38(3): 542–7.
- EduG. English program, IRDP. Institut de recherche et de documentation pédagogique. https://www.irdp.ch/institut/ english-program-1968.html adresinden 05.06.2020 tarihinde erişilmiştir.
- Al-Wardy NM. Assessment methods in undergraduate medical education. Sultan Qaboos University medical journal 2010;10(2): 203–9. doi: 10.4103/0331-8540.108463.
- Epstein RM. Assessment in Medical Education Assessment in Medical Education. The new england journal of medicine 2011; 841(January): 7–64. doi: 10.1056/NEJMra054784.
- Tıp Fakültesi Dekanlığı. Uzaktan Eğitim İle İlgili Duyurular. Süleyman Demirel Üniversitesi Tıp Fakültesi Dekanlığı. http://tip.sdu.edu.tr/tr/sayfalar/uzaktan-egitim-ile-ilgili-duyurular- 11163s.html adresinden 10.06.2020 tarihinde erişilmiştir.
- Hays R. Assessment in medical education: Roles for clinical teachers. Clinical Teacher 2008; 5(1): 23–7. doi: 10.1111/j.1743- 498X.2007.00165.x.
- Speyer R. et al. Reliability and validity of student peer assessment in medical education: A systematic review. Medical Teach- er 2011;33(11): 572–85. doi: 10.3109/0142159X.2011.610835.
- Challis M. AMEE Medical Education Guide No.11 (re- vised): Portfolio-based learning and assessment in medical education. Medical Teacher 1999; 21(4): 370–86. doi: 10.1080/01421599979310.
- Pololi LH. et al. A needs assessment of medical school faculty: Caring for the caretakers. Journal of Continuing Education in the Health Professions 2003; 23(1): 21–9. doi: 10.1002/ chp.1340230105.
- Durak HI. et al. Use of case-based exams as an instructional teaching tool to teach clinical reasoning. Medical Teacher 2007; 29(6): 170–4. doi: 10.1080/01421590701506866.