Interpretive argument: Statements that specify the interpretation and use of the test performances in terms of the inferences and assumptions used to get from a person’s test performanc
Trang 1
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
******
NGUYỄN THỊ QUỲNH YẾN
DOCTORAL DISSERTATION
AN INVESTIGATION INTO THE CUT-SCORE VALIDITY
OF THE VSTEP.3-5 LISTENING TEST
MAJOR: ENGLISH LANGUAGE TEACHING METHODOLOGY CODE: 9140231.01
HANOI, 2018
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
******
NGUYỄN THỊ QUỲNH YẾN
DOCTORAL DISSERTATION
AN INVESTIGATION INTO THE CUT-SCORE VALIDITY
OF THE VSTEP.3-5 LISTENING TEST (Nghiên cứu xác trị các điểm cắt của kết quả bài thi Nghe Đánh giá năng lực tiếng Anh từ bậc 3 đến bậc 5 theo Khung năng lực Ngoại ngữ 6 bậc dành cho Việt Nam)
MAJOR: ENGLISH LANGUAGE TEACHING METHODOLOGY
CODE: 9140231.01
SUPERVISORS: 1 PROF NGUYỄN HÒA
2 PROF FRED DAVIDSON
HANOI, 2018
Trang 3i
This dissertation was completed at the University of Languages and International Studies, Vietnam National University, Hanoi
This dissertation was defended on 10th May 2018
This dissertation can be found at:
- National Liberary of Vietnam
- Liberary and Information Center -Vietnam National University, Hanoi
Trang 4ii
DECLARATION OF AUTHORSHIP
I hereby certify that the thesis I am submitting is entirely my own original work except where otherwise indicated I am aware of the University's regulations concerning plagiarism, including those regulations concerning disciplinary actions that may result from plagiarism Any use of the works of any other author, in any form, is properly acknowledged at their point of use
Date of submission: _
Ph.D Candidate’s Signature: _
Trang 5iii
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy
_
Prof Fred Davidson (Co-supervisor)
Trang 6iv
TABLE OF CONTENTS
LIST OF FIGURES……… viii
LIST OF TABLES………
LIST OF KEY TERMS………
ix xiii ABSTRACT………
ACKNOWLEDGMENTS………
xvii xix CHAPTER I: INTRODUCTION……… 1
1 Statement of the problem……… 1
2 Objectives of the study……… 4
3 Significance of the study ….……… 4
4 Scope of the study………
5 Statement of research questions………
4 5 6 Organization of the study……… 5
CHAPTER II: LITERATURE REVIEW……… 7
1 Validation in language testing……….……… 7
1.1 The evolution of the concept of validity ……… 7
1.2 Aspects of validity.………
1.3 Argument-based approach to validation………
9 11 2 Standard setting for an English proficiency test……… 15
2.1 Definition of standard setting……… ……… 15
2.2 Overview of standard setting methods……… 17
2.3 Common elements in standard setting……… 21
2.3.1 Selecting a standard-setting method……… 21
2.3.2 Choosing a standard setting panel……… 23
2.3.3 Preparing descriptions of performance-level descriptors……… 24
2.3.4 Training panelists……… 24
2.3.5 Providing feedback to panelists……… 26
2.3.6 Compiling ratings and obtain cut scores……… 27
2.3.7 Evaluating standard setting……… 27
2.4 Evaluating standard setting……….…
2.4.1 Procedural evidence.……….………
28
30
Trang 7v
2.4.2 Internal evidence……… 32
2.4.3 External evidence……… 32
2.4.3.1 Comparisons to other standard-setting methods……… 33
2.4.3.2 Comparisons to other sources of information……… 33
2.4.3.3 Reasonableness of cut scores………
3 Testing listening….……… …
3.1 Communicative language testing………
3.2 Listening construct………
4 Statistical analysis for a language test………
4.1 Statistical analysis of multiple choice (MC) items………
4.2 Investigating reliability of a language test………
5 Review of validation studies………
5.1 Review of validation studies on standard setting………
5.2 Review of studies employing argument-based approach in validating language tests………
6 Summary………
34 34 34 36 42 42 46 49 49 52 60 CHAPTER III: METHODOLOGY………
1 Context of the study………
1.1 About the VTEP.3-.5 test………
1.1.1 The development history of the VSTEP.3-5 test………
1.1.2 The administration of the VSTEP.3-5 test in Vietnam………
1.1.3 Test takers………
1.1.4 Test structure and scoring rubrics………
1.1.5 The establishment of the cut scores ………
1.2 About the VSTEP.3-5 listening test………
1.2.1 Test purpose………
1.2.2 Test format………
1.2.3 Performance standards ………
1.2.4 The establishment for the cut scores of the VSTEP.3-5 listening test……
2 Building an interpretive argument for the VSTEP.3-5 listening test………
3 Methodology………
61
61
61
61
62
62
62
63
64
64
64
64
68
68
70
Trang 8vi
3.1 Research questions………
3.2 Description of methods of the study………
3.2.1 Analysis of the test tasks and test items………
3.2.1.1 Analysis of test tasks………
3.2.1.2 Analysis of test items………
3.2.2 Analysis of test reliability………
3.2.3 Validation of cut-scores………
3.2.3.1 Procedural………
3.2.3.2 Internal………
3.2.3.3 External………
3.3 Description of Bookmark standard setting procedures ………
3.4 Selection of participants of the study………
3.4.1 Test takers of early 2017 administration………
3.4.2 Participants for Bookmark standard setting method………
3.5 Descriptions of tools for data analysis………
3.5.1 Text analyzing tools………
3.5.1.1 English Profile………
3.5.1.2 Readable.io………
3.5.2 Speech rate analyzing tool………
3.5.3 Statistical analyzing tools………
3.5.3.1 WINSTEPS (3.92.1)………
3.5.3.2 Iteman 4.3 ………
4 Summary………
CHAPTER IV: DATA ANALYSIS………
70 71 72 72 73 75 76 76 76 77 78 81 81 82 83 83 83 84 84 85 85 86 87 89 1 Analysis of the test tasks and test items………… ………
1.1 Analysis of the test tasks……….…………
1.1.1 Characteristics of the test rubric………
1.1.2 Characteristics of the input………
1.1.3 Relationship between the input and response………
1.2 Analysis of the test items………
1.2.1 Overall statistics of item difficulty and item discrimination………
89
89
89
94
102
102
102
Trang 9vii
1.2.2 Item analysis………
2 Analysis of the test reliability…….….………
3 Analysis of the cut-scores…… ………
3.1 Procedural evidence………
3.2 Internal evidence………
3.3 External evidence………
CHAPTER V: FINDINGS AND DISCUSSIONS………
1 The characteristics of the test tasks and test items………
2 The reliability of the VSTEP.3-5 listening test………
3 The accuracy of the cut scores of the VSTEP.3-5 listening test ………
CHAPTER VI: CONCLUSION ……….… ………
1 Overview of the thesis………
2 Contributions of the study………
3 Limitations of the study………
4 Implications of the study….………
5 Suggestions for further research………
LIST OF THESIS-RELATED PUBLICATIONS………
REFERENCES………
APPENDIX 1: Structure of the VSTEP.3-5 test………
APPENDIX 2: Summary of the directness and interactiveness between the texts and the questions of the VSTEP.3-5 listening test………
APPENDIX 3: Consent form (workshops)………
APPENDIX 4: Agenda for Bookmark standard-setting procedure………
APPENDIX 5: Panelist recording form………
APPENDIX 6: Evaluation form for standard-setting participants ………
APPENDIX 7: Control file for WINSTEPS………
APPENDIX 8: Timeline of the VSTEP.3-5 test administration………
APPENDIX 9: List of the VSTEP.3-5 developers………
107
128
130
130
131
132
145
145
151
151
154
154
157
158
158
159
161
162
172
174
177
179
180
181
183
185
186
Trang 10viii
LIST OF FIGURES
Figure 2.1: Model of Toulmin’s argument structure (1958, 2003)……… 12
Figure 2.2: Sources variance in test scores (Bachman, 1990)………
Figure 2.3: Overview of interpretive argument for ESL writing course placements………
47 57 Figure 4.1: Item map of the VSTEP.3-5 listening test……… … ……… 105
Figure 4.2: Graph for item 2……… 108
Figure 4.3: Graph for item 3……… 110
Figure 4.4: Graph for item 6……… 112
Figure 4.5: Graph for item 13……… 115
Figure 4.6: Graph for item 14……… 117
Figure 4.7: Graph for item 15……… 119
Figure 4.8: Graph for item 19……… 121
Figure 4.9: Graph for item 20……… 123
Figure 4.10: Graph for item 28……… 125
Figure 4.11: Graph for item 34………
Figure 4.12: Total score for the scored items………
126 129
Trang 11ix
LIST OF TABLES
Table 2.1: Review of standard-setting methods (Hambleton & Pitoniak, 2006)………… 21 Table 2.2: Standard setting Evaluation Elements (Cizek & Bunch, 2007)….……… 30 Table 2.3: Common steps required for standard setting (Cizek & Bunch, 2007)………… Table 2.4: A framework for defining listening task characteristics (Buck, 2001)…………
32
38 Table 2.5: Criteria for item selection and interpretation of item difficulty index………… 44 Table 2.6: Criteria for item selection and interpretation of item discrimination index…… 46 Table 2.7: General guideline for interpreting test reliability (Bachman, 2004)……… 48 Table 2.8: Number of proficiency levels & test reliability……… Table 2.9: Summary of the warrant and assumptions associated with each inference in the
TOEFL interpretive argument (Chapelle et al., 2008)……… ………
48
56 Table 3.1: Structure of the VSTEP.3-5 test………
Table 3.2: The cut scores of the VSTEP.3-5 test………
Table 3.3: Performance standard of Overall Listening Comprehension (CEFR: learning, teaching, assessment)……… Table 3.4: Performance standard of Understanding conversation between native speakers (CEFR: learning, teaching, assessment)……… Table 3.5: Performance standard of Listening as a member of a live audience (CEFR: learning, teaching, assessment)……… Table 3.6: Performance standard of Listening to announcements and instructions (CEFR: learning, teaching, assessment)……… Table 3.7: Performance standard of Listening to audio media and recordings (CEFR: learning, teaching, assessment)……… Table 3.8: The cut scores of the VSTEP.3-5 test…… ……… Table 3.9: Criteria for item selection and interpretation of item difficulty index…… …… Table 3.10: Criteria for item selection and interpretation of item discrimination index…… Table 3.11: Number of proficiency levels & test reliability……… Table 3.12: The venue for Angoff and Bookmark standard setting method………
Trang 12x
Table 3.14: Summary of the interpretative argument for the interpretation and use of the
VSTEP.3-5 listening cut-scores ……… 88
Table 4.1: General instruction of the VSTEP.3-5 listening test…….……… 90
Table 4.2: Instruction for Part 1……….……….……… 91
Table 4.3: Instruction for Part 2……… ……… 92
Table 4.4: Instruction for Part 3…….……… 93
Table 4.5: Information provided in the specifications for the VSTEP.3-5 listening test…… 94
Table 4.6: Summary of the texts for items 1-8……… ………
Table 4.7: Description of language levels for texts of items 1 -8 in the specification……
96 97 Table 4.8: Summary of the texts for items 9-20………
Table 4.9: Description of language levels for texts of items 9 -20 in the specification…
98 99 Table 4.10: Summary of the texts for items 21-35………
Table 4.11: Description of language levels for texts of items 21-35 in the specification……
100 101 Table 4.12: Summary of item discrimination and item difficulty………
Table 4.13: Summary statistics for the flagged items………
Table 4.14: Information for item 2………
Table 4.15: Item statistics for item 2………
Table 4.16: Option statistics for item 2………
Table 4.17: Quantile plot data for item 2………
Table 4.18: Information for item 3………
Table 4.19: Item statistics for item 3………
Table 4.20: Option statistics for item 3………
Table 4.21: Quantile plot data for item 3………
Table 4.22: Information for item 6………
Table 4.23: Item statistics for item 6………
Table 4.24: Option statistics for item 6………
Table 4.25: Quantile plot data for item 6………
Table 4.26: Information for item 13………
Table 4.27: Item statistics for item 13………
Table 4.28: Option statistics for item 13………
Table 4.29: Quantile plot data for item 13………
104
106
108
109
109
109
110
110
111
111
112
112
113
113
115
115
116
116
Trang 13xi
Table 4.30: Information for item 14………
Table 4.31: Item statistics for item 14………
Table 4.32 Option statistics for item 14………
Table 4.33: Quantile plot data for item 14………
Table 4.34: Information for item 15………
Table 4.35: Item statistics for item 15………
Table 4.36: Option statistics for item 15………
Table 4.37: Quantile plot data for item 15………
Table 4.38: Information for item 19………
Table 4.39: Item statistics for item 19………
Table 4.40: Option statistics for item 19………
Table 4.41: Quantile plot data for item 19………
Table 4.42: Information for item 20………
Table 4.43: Item statistics for item 20………
Table 4.44: Option statistics for item 20………
Table 4.45: Quantile plot data for item 20………
Table 4.46: Information for item 28……… …
Table 4.47: Item statistics for item 28………
Table 4.48: Option statistics for item 28………
Table 4.49: Quantile plot data for item 28………
Table 4.50: Information for item 34………
Table 4.51: Item statistics for item 34………
Table 4.52: Option statistics for item 34………
Table 4.53: Quantile plot data for item 34………
Table 4.54: Summary of statistics………
Table 4.55: Test reliability ……… …….………
Table 4.56: The person reliability and item reliability of the test………
Table 4.57: Number of proficiency levels and test reliability……….…………
Table 4.58: The test reliability of the VSTEP.3-5 listening test……….………
Table 4.59: Order of items in the booklet………
Table 4.60: Summary of Output from Round 1 of Bookmark standard-setting Procedure ……
118
118
118
118
120
120
120
120
121
121
122
122
123
123
124
124
125
125
125
126
127
127
127
127
129
129
130
131
132
133
135
Trang 14xii
Table 4.61: Conversion table………
Table 4.62: Summary of statistics in raw score metric for round 1………
Table 4.63: Summary of Output from Round 2 of Bookmark standard-setting Procedure……… Table 4.64: Round 3 Feedback for Bookmark Standard-setting Procedure………
Table 4.65: Summary of Output from Round 3 of Bookmark standard-setting Procedure……… Table 4.66: The cut scores set for the VSTEP.3-5 listening test by Bookmark method……
Table 4.67: The cut scores set for the VSTEP.3-5 listening test by Angoff method………
Table 4.68: Comparison between the results of two standard-setting methods………
Trang 15xiii
LIST OF KEY TERMS
Construct: A construct refers to the knowledge, skill or ability that's being tested
In a more technical and specific sense, it refers to a hypothesized ability or mental trait which cannot necessarily be directly observed or measured, for example, listening ability Language tests attempt to measure the different constructs which
underlie language ability
Cut score: A score that represents achievement of the criterion, the line between
success and failure, mastery and non-mastery
Descriptor: A brief description accompanying a band on a rating scale, which
summarizes the degree of proficiency or type of performance expected for a test taker to achieve that particular score
Distractor: The incorrect options in multiple-choice items
Expert panel: A group of target language experts or subject matter experts who
provide comments about a test
High-stakes test: A high-stakes test is any test used to make important decisions
about test takers
Inference: A conclusion that is drawn about something based on evidence and
reasoning
Input: Input material provided in a test task for the test taker to use in order to
produce an appropriate response
Interpretive argument: Statements that specify the interpretation and use of the
test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results
Item (also, test item): Each testing point in a test which is given a separate score or
scores Examples are: one gap in a cloze test; one multiple choice question with
Trang 16xiv
three or four options; one sentence for grammatical transformation; one question to which a sentence-length response is expected
Key: The correct option or response to a test item
Multiple-choice item: A type of test item which consists of a question or
incomplete sentence (stem), with a choice of answers or ways of completing the sentence (options) The test taker’s task is to choose the correct option (key) from a set of possibilities There may be any number of incorrect possibilities (distractors)
Options: The range of possibilities in a multiple-choice item or matching tasks
from which the correct one (key) must be selected
Panelist: A target language expert or subject matter expert who provides comments
about a test
Performance level description: Brief operational definitions of the specific
knowledge, skills, or abilities that are expected of examinees whose performance on
a test results in their classification into a certain performance; elaborations of the achievement expectations connoted by performance level labels
Performance level label: A hierarchical group of single words or short phrases that
are used to label the two or more performance categories created by the application
of cut scores to examinee performance on a test
Performance standard: The abstract conceptualization of the minimum level of
performance distinguishing examinees who possess an acceptable level of knowledge, skill, or ability judged necessary to be assigned to a category, or for some other specific purpose, and those who do not possess that level This term is sometimes used interchangeably with cut score
Proficiency test: A test which measures how much of a language someone has
learned Proficiency tests are designed to measure the language ability of examinees regardless of how, when, why, or under what circumstances they may have experienced the language
Trang 17xv
Readability: Readability is the ease with which a reader can understand a written
text The readability of text depends on its content (the complexity of its vocabulary and syntax) and its presentation (such as typographic aspects like font size, line height, and line length)
Reliability: The reliability of a test is concerned with the consistency of scoring and
the accuracy of the administration procedures of the test
Response probability (RP) criterion: In the context of Bookmark and similar
item-mapping standard-setting procedures, the criterion used to operationalize participants’ judgment regarding the probability of a correct response (for dichotomously scored items) or the probability of achieving a given score point or higher (for polytomously scored items) In practical applications, two PR criteria appear to be used most frequently (RP50 and RP67); other PR criteria have also been used though considerably less frequently
Rubric: A set of instructions or guidelines on an exam paper
Selected-response: An item format in which the test taker must choose the correct
answer from alternative provided
Specifications (also, test specifications): A description of the characteristics of a
test, including what is tested, how it is tested, and details such as number and length
of forms, item types used
Standard setting: A measurement activity in which a procedure is applied to
systematically gather and analyze human judgment for the purpose of deriving one
or more cut scores for a test
Standardized test: A standardized test is any form of test that (1) requires all test
takers to answer the same questions, or a selection of questions from common bank
of questions, in the same way, and that (2) is scored in a “standard” or consistent
manner, which makes it possible to compare the relative performance of individual students or groups of students
Trang 18xvi
Test form: Test forms refer to different versions of tests that are designed in the
same format and used for different administrations
Validation: An action of checking or proving the validity or accuracy of something
The validity of a test can only be established through a process of validation
Validity: The degree to which a test measures what it is supposed to measure, or
can be used successfully for the purpose for which it is intended A number of different statistical procedures can be applied to a test to estimate its validity Such procedures generally seek to determine what the test measures, and how well it does
Trang 19xvii
ABSTRACT
Standard setting is an important phase in the development of an examination program, especially for a high-stakes test Standard setting studies are designed to identify reasonable cut scores and to provide backing for this choice of cut scores This study was aimed at investigating the validity of the cut scores established for a VSTEP.3-5 listening test administered in early 2017 on 1562 test takers by one institution permitted by the Ministry of Education and Training, Vietnam to design and administer the VSTEP.3-5 tests The study adopted the current argument-based validation approach with a focus on three main inferences constructing the validity argument They were (1) test tasks and items, (2) test reliability and (3) cut scores The argument is that in order for the cut-scores of the VSTEP.3-5 listening test to
be valid, the test tasks and test items first needed to be designed in accordance with the characteristics specified in the specifications Second, the listening test scores should be sufficiently reliable so as to reasonably reflect test-takers’ listening proficiency Third, the cut scores were reasonably established for the VSTEP.3-5 listening test
In this study, both qualitative and quantitative methods were combined and structured to back for and against the assumptions in each of these three inferences With regards to the first inference and second inference, an analysis of the test tasks and the test items was conducted whereas test reliability was investigated in order to see if it was in the acceptable range or not In terms of the third inference about the cut scores of the VSTEP.3-5 listening test, Bookmark standard setting method was implemented and the results were compared with those currently applied for the test This study offers contributions in three areas First, this study supports the widely-held notion of validity as a unitary concept and validation is the process of building an interpretive argument and collecting evidence in support of that argument Second, this study contributes towards raising the awareness of the
Trang 20xviii
importance of evaluating the cut scores of the high stakes language tests in Vietnam so that fairness can be ensured for all of the test takers Third, this study contributes to the construction of a systematic, transparent and defensible body of validity argument for the VSTEP.3-5 test in general and its listening component in particular The results of this study are helpful in providing informative feedback to the establishment of the cut scores for the VSTEP.3-5 listening test, the test specifications, and the test development process The positive results can provide evidence to strengthen the reasonableness of the cut scores, the specifications and the quality of the VSTEP.3-5 listening test The negative results can give suggestions for changes or improvement in the cut scores, the specifications and the design of the VSTEP.3-5 listening test
Trang 21tremendous mentor
I would also like to thank my co-supervisor, Fred Davidson, Professor Emeritus from the University of Illinois, for giving me the very first ideas, advice and guidance on how to start my Ph.D study His advice on both research as well as on
my career has been invaluable
I am especially thankful to Professor Nathan T Carr from California State University, Fullerton for conducting a series of workshops on designing and analyzing language tests at the University of Languages and International Studies, Vietnam National University - Hanoi Being able to discuss my work with him has been invaluable for developing my ideas Sharing hisknowledge and experience about language testing and assessment in general and standard-setting methods in particular have been a great contribution to the completion of my Ph.D thesis
I want to thank all of my colleagues at the University of Languages and International Studies, Vietnam National University - Hanoi, especially my
Trang 22Post-Words cannot express how grateful I am to my family I want to say thank you to
my parents and siblings for their encouragement during the time I conducted my study
This thesis is dedicated to my beloved husband and my daughter for their love, endless support, encouragement and sacrifices throughout this experience
As a final word, I would like to thank each and every individual who has been a
source of support and encouragement and helped me to achieve my goal and complete my thesis work successfully
Trang 231
CHAPTER I INTRODUCTION
This chapter is to introduce the topic of the study and present the main reasons for choosing it After that, the chapter presents the questions that are going to be addressed within the scope of the study A brief overview of the organization of the thesis will close the chapter
1 Statement of the problem
The term “cut scores” refers to the lowest possible scores on a standardized test, high-stakes test or other forms of assessment that help to separate a test score scale into two or more regions, creating categories of performance or classification of examinees Clearly, if the cut scores are not appropriately set, the results of the assessment could come into question For this reason, establishing cut scores for a test has been considered an important and practical aspect of standard setting In Kane’s (2006) recent discussion for test validation, besides emphasizing the importance of carefully defining the selected cut scores, he highlights the evaluation
of the reasonableness of the cut scores and states that the establishment of the cut
scores is a complex endeavor, but the validation of the cut scores is even more difficult
According to the Standards for Educational and Psychological Testing (AERA et
al., 1999, p.9), validity is defined as “the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests” and test validation is the process of making a case for the proposed interpretation and use of test scores This case takes the forms of an argument that states a series of propositions supporting the proposed interpretation and use of test scores and summarizes the evidence supporting these propositions (Kane, 2006) With regard
to standard setting, since there are no ‘gold standards” and “true cut scores”, to
Trang 242
validate established cut scores means to provide evidence in support of the plausibility and appropriateness of the proposed cut score interpretation, their credibility and defensibility (Kane, et al., 1999) In the world, though plenty of studies have been conducted on the validity of cut scores established for a test, these studies mainly aim at cross-validating two different methods of standard setting and comparing the results of these methods instead of investigating the validity of cut scores as a whole
In Vietnam, the National Foreign Language 2020 Project (NFL2020) was initiated
in 2008 with the aim to “renovate the teaching and learning of foreign languages
within the national education system” so that “… by 2020, most Vietnamese students graduating from secondary, vocational schools, colleges and universities will be able to use a foreign language confidently in their daily communication, their study and work in an integrated, multi-cultural and multi-lingual environment, making foreign languages a comparative advantage of development for Vietnamese people in the cause of industrialization and modernization for the country” (Decision 1400/QD-TTg) Language assessment is considered a major component
of this project The biggest achievement of this component is the emergence of the first ever-standardized test of English proficiency (the VSTEP.3-5 test) The test was officially released by the Ministry of Education and Training, Vietnam on 11 th
March 2015 The test aims at measuring English ability across a broad language proficiency continuum from level 3 to level 5, which is equivalent to B1 - C1 CEFR levels (Common European Framework of Reference for Languages) The cut scores
of the VSTEP.3-5 test help to categorize test takers and certify them based on the levels they achieve These cut scores are applied for all of the results of the VSTEP.3-5 tests, which are supposed to be strictly built in accordance with the test specifications
At the moment, the results and certificates of the VSTEP.3-5 test are used by many companies as the requirement for a job position and by many educational institutions as a “visa” for learners to be accepted into or graduate from an academic
Trang 253
program For example, English teachers from primary schools and secondary schools throughout Vietnam are expected to obtain level 4 in English (equivalent to B2) while the requirement for those working in high schools, colleges and universities is level 5 (equivalent to C1) Besides, in order to graduate from universities, English major students need to show the evidence of their English in
level 5 (equivalent to C1) and that for non-English major students is level 3 (equivalent to B1) This shows that the uses of the VSTEP.3-5 test and the decisions that are made from the test cut scores have important consequences for the stakeholders Like other high-stakes tests such as TOEFL, IELTS, PTE, or Cambridge Tests, in order to gain credibility and defensibility, more research needs
to be conducted on the test in general and the validity of the VSTEP.3-5 cut scores
in particular However, so far, there have been few studies on the VSTEP.3-5 test
and there is no validation research on the cut scores of the test
Among the skills tested in high stakes examination, listening is the skill that the fewest researchers choose to conduct a study on According to Buck (2001), the assessment of listening ability is one of the least understood and least developed areas of language and assessment However, Buck (2001) also states that the assessment of listening ability is one of the most important testing aspects In terms
of standard setting and cut score validation, the procedure for listening tests is also much more complicated and time-consuming However, for the author of this study, listening is a skill that is really interesting and thus needs discovering
All of the reasons mentioned above have intrigued the author of this doctoral thesis
to conduct a validation study on the cut scores of the VSTEP.3-5 listening test by
using validity argument-based model proposed by Kane (2013) A validity argument is a set of related propositions that, taken together, form an argument in
support of an intended use or interpretation of the test scores With the rooted desire to develop a good proficiency listening test in Vietnam, this research
deeply-is expected to bring the author of thdeeply-is doctoral thesdeeply-is a profound insight into thdeeply-is specific area of interest for her future professional development
Trang 264
2 Objectives of the study
As mentioned, since the VSTEP.3-5 test is a newly developed high-stakes test, the need to standardize it is imperative Thus, this doctoral research is conducted as an ongoing attempt in building a systematic, transparent and defensible body of validity argument for the VSTEP.3-5 test in general and its listening component in particular By adopting the argument-based approach recommended by Kane (2013), the study aims at investigating the validity of the cut-scores of the VSTEP.3-5 listening test
3 Significance of the study
This study will be a significant endeavor in building a systematic, transparent and defensible body of validity argumentation for the VSTEP.3-5 test in general and its listening component in particular This study will also contribute to the practice of validating the cut score validity of a test by adopting the argument-based approach Moreover, the results of this study will be helpful in providing a close look at the
test specifications of the VSTEP.3-5 listening test, the test development process and the establishment of the cut scores of the test These results can provide evidence to either support the reasonableness of the test specifications of the VSTEP.3-5 listening test, the test development process and the establishment of the cut scores
of the test or suggest the adjustment for them
4 Scope of the study
In the current context of English testing and assessment in Vietnam, the cut scores
of the VSTEP.3-5 listening test are pre-established and applied for all of the test forms which are supposed to be strictly designed in accordance with the specifications Thus, when a VSTEP.3-5 listening test is delivered by any authorized institution, it is supposed to have been constructed based on the specifications so that the cut scores can be interpreted in the preset way Within the scope of this study, the focus is on the validation of the cut-scores of the VSTEP.3-5 listening test administered in early 2017 by one institution permitted by the Ministry
Trang 275
of Education and Training, Vietnam (MOET) to design and administer the VSTEP.3-5 test (hereinafter referred to as the VSTEP.3-5 listening test) Based on the argument that in order for the cut scores of the VSTEP.3-5 listening test to be valid, (1) the test tasks and test items are designed in accordance with the test specifications; (2) the test scores are reliable in measuring test-takers’ proficiency; (3) the cut scores are reasonably established so that they are useful for making decisions about test takers’ English listening competency Thus, the three aspects of the test that will be taken into consideration in this study include: (1) the design of test tasks and test items; (2) the test reliability; and (3) the accuracy of the cut scores
5 Statement of research questions
Based on the interpretive argument for the validity of the cut scores of the VSTEP.3-5 listening test, there is one main research question for the study, which then needs to be clarifying by three other sub-questions
The main research question is:
To what extent do the cut scores of the VSTEP.3-5 listening test provide reasonable interpretation of the test-takers’ listening ability?
The three sub-questions that help to clarify the main research question are:
1 To what extent are the test tasks and the test items of the VSTEP.3-5 listening test properly designed in accordance with the specifications?
2 To what extent are the VSTEP.3-5 listening test scores reliable in measuring the test takers’ English proficiency?
3 To what extent are the cut scores reasonably established for the VSTEP.3-5 listening test?
6 Organization of the study
The study consists of six chapters as follows:
Chapter I: Introduction
Trang 286
Chapter II: Literature Review
Chapter III: Methodology
Chapter IV: Data analysis
Chapter V: Findings and Discussions
Chapter VI: Conclusion
Chapter I is aimed at introducing the topic of the study and presenting the main reasons for the author to implement this project
Chapter II is to provide profound theoretical and empirical background with a critical discussion on the relevant concepts, models, or theories for the study
Chapter III presents the context of the study and how the study is conducted together with a review on each selected method
Chapter IV presents the data analysis of the study
Chapter V presents the findings of the study and discusses these results
Chapter VI has two aims First, it specifies the limitations of the study Second, it
suggests some directions for future studies
Trang 297
CHAPTER II LITERATURE REVIEW
This chapter reviews theories and research which are fundamental to the current study The first part of this chapter starts with the presentation on how the concept
of validity has changed over the years Then, it discusses the validation approaches and procedures in articulating a validation argument before describing different kinds of evidence that can be collected in support of a validation argument The second part of this chapter focuses on standard setting including definitions of the concept, an overview of different standard setting methods, a discussion of common elements in standard setting and issues related to standard setting validation The third part of this chapter first addresses the issues about testing listening and then
presents the framework for analyzing the listening test tasks The fourth part of this chapter describes the important statistical analysis for a language test including item difficulty, item discrimination and test reliability Finally, the review of related validation studies ends the chapter
1 Validation in language testing
1.1 The evolution of the concept of validity
Validity is considered one of the most important concepts in psychometrics, but as Sireci (2009) states validity has taken on many different meanings over the years
In the early 20th century, validity was primarily defined in terms of the correlation
of test scores with some other criteria and tests were described as valid for anything they correlated with (Kelly, 1927; Thurstone, 1932; Bingham, 1937) Another theoretical definition of validity was also proposed by Garret (1937) He defined simply that validity is the degree to which “a test measures what it is supposed to
Trang 30Technical Recommendations (APA, 1954) proposed four types of validity – namely,
predictive validity, concurrent validity, content validity and construct validity In contrast, Anastasi (1954) organized the presentation of validity in terms of face validity, content validity, factorial validity and empirical validity before adopting
the framework promulgated in the 1954 Standards However, the framework then
was presented in terms of content validity, criterion-related validity and construct validity (Anastasi, 1982; Cronbach, 1955, 1960) In this framework, predictive validity and concurrent validity were seen as criterion-oriented validity whereas more space was allocated to construct validity and content validity was mentioned
in an additional extended discussion of proficiency tests and construct validity
In the Standards for Educational and Psychological Testing (AERA et al., 1985),
validity is defined as follows: “The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores Test validation is the process of accumulating evidence to support such inferences
A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference Validity, however, is a unitary concept” (APA, 1985, p.9) It was clearly stated that
“the inferences regarding specific uses of a test are validated, not the test itself” (APA, 1985, p.9)
The definition in the 1985 Standards was well explained and elaborated by
Bachman (1990) According to Bachman (1990), “the process of validation starts with the inferences that are drawn and the uses that are made of scores These uses
Trang 319
and inferences dictate the kinds of evidence and logical argument that are required
to support judgments regarding validity Judging the extent to which an interpretation or use of a given test score is valid thus requires the collection of evidence supporting the relationship between the test score and an interpretation or use” (Bachman, 1990, p.243) Messick (1989) defines validity as a unitary concept and describes validity “as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and hactions based on test scores” (p.13) This theoretical conceptualization of validity as a unitary concept has been widely endorsed by the measurement profession as a whole (Shepard, 1993; AERA et al., 1999; Messick, 1989; Kane, 1992, 2006, 2009, 2013) and it has a strong influence
on current practice in educational and psychological testing in many parts of the world In this study, this conceptualization of validity as is adopted as the theoretical foundation for understanding different aspects of validity
it can be quite reasonable to talk about the validity of a test if only an interpretation
or use has already been adopted explicitly or implicitly
Second, validity is a matter of degree, and it may change over time as the interpretation/use develop and as new evidence accumulates Since a particular test score can never provide a perfectly accurate measure of a given ability, and the validity of interpretation and use always depends on the logic of the interpretive argument and the strength of the evidence provided in support of this argument, we
Trang 3210
can never prove that our interpretation and use are valid Thus, it is better to provide evidence that the intended interpretation and use are more plausible than other interpretation that might be offered
Third, validity is always specific to a particular use or interpretation When a test is developed, we always have a particular set of interpretation and use in mind These intended interpretation will depend on how the construct or ability to be measured is defined The way we define a given ability may be different from the way it might
be defined for different purposes, or different groups of test-takers Thus, the scores
of a particular test will not necessarily be appropriately for other situations or other purposes
Fourth, validity is a unitary concept We often hear about different types of validity such as content validity, concurrent validity, predictive validity or construct validity However, as defined above, validity is a single quality of the ways in which we use a particular test Many different kinds of evidence, such as test content analysis or correlations with other measures of ability can be provided in support of the intended interpretation and use For this reason, the evidence needed for validation depends on the interpretation and use and different interpretation or use will require different kinds and different amount of evidence for their validation
Fifth, validity involves an overall evaluative judgment Since a validation argument typically includes several parts and is supported by different kinds of evidence, none of which by itself is sufficient enough to justify the intended inferences and uses of a particular test Thus, when evaluating the validity of inferences and uses, it
is important to consider the interpretive argument in its entirely, as well as all the supporting evidence
In summary, with the aspects described above about validity, investigating the validity of test use, which is called validation, can be seen as the process of building
an interpretive argument and collecting evidence in support of that argument (Kane,
Trang 3311
1992, 2006, 2013) This validation approach was termed as argument-based approach to validation by Kane (1992) and is widely applied in the practice of validation nowadays
1.3 Argument-based approach to validation
Kane (1992, 2006, 2013) recommends that researchers follow an argument-based approach to make the task of validating inferences from test scores both scientifically sound and manageable In this approach, the validator builds an argument that focuses on defending the interpretation and use of test scores for a particular purpose and is based on empirical evidence to support the particular use According to Kane (2006), validation consists of two types of arguments, an interpretive argument and a validity argument The interpretive argument is built upon a number of inferences and assumptions that are meant to justify score interpretation and use whereas the validity argument provides an evaluation of the interpretive argument in terms of how reasonable and coherent it is as well as how plausible the assumptions are (Cronbach, 1988) To be more specific, the establishment of an interpretive argument involves (1) the determination of the inferences based on test scores, (2) the articulation of assumptions, (3) the decision
on the sources of evidence that can support or refute those inferences, (4) collection
of appropriate data and (5) analysis of the evidence In order for the interpretation or use of test scores to be valid, all of the inferences and assumptions inherent in the
interpretation or use of test scores have to be plausible
Kane (1992, 2006, 2013) cites Toulmin’s (1958, 2003) framework as a guide for applying his approach to validation Toulmin’s (1958) framework essentially requires that a chain of reasoning be established that is able to build a case towards
a conclusion, which in this case would be to determine the plausibility and reasonableness of score interpretation and use Figure 2.1 shows Toulmin’s (1958, 2003) argument structure, which is built on several components, including the grounds, claim, warrant, backing, and rebuttal
Trang 3412
Figure 2.1: Model of Toulmin’s argument structure (1958, 2003)
In terms of test score interpretation and use, the claim of an argument is the conclusion drawn about a test-taker based on test performance whereas the grounds serve as the data or observations upon which the claim is based upon An example
can be illustrated by the case given by Mislevy et al (2003) One may make the claim that the student’s English speaking abilities are inadequate for studying in an English-medium university based on the grounds/observation that the student did not perform well in the final oral examination in an intensive English class The performance that the teacher observed when the student spoke in front of the class
on the assigned topic is characterized by hesitations and mispronunciations
In the interpretive argument underlying the claim about the student’s readiness for
university study, the inference linking the grounds to the claim is not given and therefore justification is needed in the form of a warrant (or assumption) Figure 2.1 shows the interpretive argument as consisting of one inference, which is authorized
by a warrant The warrant in Toulmin’s model is regarded as a rule, principle, or
established procedure that is meant to provide justification for the inference connecting the grounds to the claim In the example provided above, the warrant is the generally held principle that hesitations and mispronunciations are characteristics of students with low levels of English speaking ability, who would
have difficulty at an English-medium university
Trang 3513
The warrants in turn need backing (or evidence) which comes in the form of theories, research, data, and experience In relation to the example provided above, backing might be drawn from teacher’s training and previous experience with non-
native speakers at an English-medium university Stronger backing could be obtained by having the student’s speaking performance rated by another teacher and then showing the agreement between the two raters
Finally, a rebuttal also acts as a link between the ground and claim but serves to
weaken the initial argument by providing evidence or possible explanation that may call into question the warrant Going back to the previous example, a possible rebuttal may be that the assigned topic for the oral presentation required the student
to use highly technical and unfamiliar vocabulary This rebuttal would serve to weaken the inference connecting the grounds – that the oral presentation contained many hesitations and mispronunciations - and the claim that the student’s speaking ability was at a level that would not allow him to succeed at an English-medium university
As can be seen, these components are all connected with each other and are essential for establishing an inferential connection between the claims and grounds Using Toulmin’s framework, test score interpretation and use can be seen as an interpretive argument (Kane 2001, 2006), involving a number of inferences, each supported by a warrant To validate the interpretive argument for the proposed interpretation and use of test scores, it is necessary to evaluate the clarity, coherence, and completeness of the interpretive argument and to evaluate the plausibility of each of the inferences and assumptions in the interpretive argument The proposed interpretation and use of the test scores are supposed to be explicitly stated in terms of a network of inferences and assumptions, and then the plausibility
of the warrants for these inferences can be evaluated using relevant evidence The
validation can be simple if the interpretation and use are simple and limited, involving a few plausible inferences In contrast, if the interpretation and use are
Trang 36be consistent with the interpretation
Second, the argument-based approach to validation provides definite guidelines for systematically evaluating the validity of proposed interpretation and use of test scores In other words, this approach provides the researcher with a clear place to
begin and a direction to follow and as a result, this helps the researcher to focus serious attention on validation
Third, in the argument-based approach, the evaluation of the interpretive argument does not lead to any absolute decision about validity but it does provide a way to
measure progress (Kane, 2006) In other words, this approach views the validation
as an on-going and critical process instead of just answering either “valid” or
“invalid” Because the most problematic inferences and their supporting assumptions are checked and are either supported by the evidence or at least less
problematic, the reasonableness of the interpretive argument as a whole can improve
Fourth, the argument-based approach to validation may increase the chance that research on validity will lead to improvements in measurement procedures Since the argument-based approach focuses attention on specific parts of the interpretive
argument and on specific aspects of measurement procedures, evidence which indicates the existence of a problem such as inadequate coverage of content or the
Trang 372 Standard setting for an English proficiency test
2.1 Definition of standard setting
Probably the most difficult and controversial part of testing is standard setting In
this study, standard setting refers to the establishment of cut scores for a test, i.e.,
determining the points on the score scale for separating examinees into performance categories such as pass/fail However, in order to have a complete understanding of the concept, it is important to get familiarized with the theoretical foundations of the term
Cizek (1993) suggests an elaborate and theoretically grounded definition of standard setting He defines standard setting as “the proper following of a described, rational system of rules or procedures resulting in the assignment of a number to differentiate between two or more states or degrees of performance” (Cizek, 1993, p.10) This definition highlights the procedural aspect of standard setting and draws
on the legal framework of due process and traditional definitions of measurement However, the definition suggested by Cizek (1993) suffers from at least one deficiency in that it addresses only one aspect of the legal principle known as due
Trang 38the procedure lead to a decision or result that is fundamentally fair However, the
notion of fairness is, to some extent, subjective The aspect of fundamental fairness
is related to what has been called the “consequential basis of test use” in Messick’s (1989, p.84) explication of the various sources of evidence in support for the use
and interpretation of the test scores
Kane (1994) provides another definition of standard setting that highlights the conceptual nature of the endeavor He states that “it is useful to draw a distinction between the passing score, defined as a point on the score scale, and the performance standard, defined as the minimally adequate level of performance for some purpose… The performance standard is the conceptual version of the desired level of competence, and the passing score is the operational version” (Kane, 1994, p.426) For this reason, as stated, this study uses the term “standard setting” and
“cut scores” interchangeably
Another concept, which is a key concept underlying Kane’s definition of standard setting, is the concept of inference As discussed in the previous part, an inference is the interpretation, conclusion, or meaning that is intended to make about an examinee’s underlying, unobserved level of knowledge, skill, or ability Thus, from this perspective, validity refers to the accuracy of the inferences made about the examinee, usually based on observations of the examinee’s performance The assumption underlying the inference of standard setting is that the passing scores creates meaningful categories that distinguish between individuals who meet some performance standard and those who do not
Thus, for this study, the primacy of test purpose and the intended inference or test score interpretation is essential to understanding the definition of standard setting
Trang 3917
To validate the standard setting or the establishment of the cut scores is to evaluate the accuracy of the inferences that are made when examinees are classified based on application of the cut scores
Finally, in wrapping up the definition of standard setting, it is important to note what standard setting is not According to Cizek and Bunch (2007:18), “standard setting does not seek to find “true” cut scores that separate real, unique categories
on a continuous competence” since there is no external truth for all the things we
care about, no set of minimum competences that are necessary and sufficient for life success, then all standard setting is judgmental Cizek and Bunch (2007) also state that to some degree, because standard setting necessarily involves human opinions and values, it can be also seen as a combination of technical, psychometric methods and policy making Seen in this way, standard setting can be seen as a procedure
that enables participants to bring their judgments by using a specified method in such a way as to translate the policy positions of authorizing entities into locations
on a score scale
2.2 Overview of standard-setting methods
In recent years, more attention has been paid to establishing the credibility of existing standard-setting methods and investigating new methods Many of the new approaches have been developed to present judges with meaningful activities and to better accommodate the changing nature of assessments Cizek & Bunch (2007) summarize 18 different methods of cut-score establishment However, one common element of all these methods is that they involve, to one degree or another, human beings expressing informed judgments based on the best evidence available to them, and these judgments are summarized in some systematic way, typically with the aid
of a mathematical model, to produce one or more cut scores Cizek & Bunch (2007) state that each standard method combines art and science and different methods may yield different results For this reason, it is hard to say that the cut scores established by one particular method are always superior to those set by another
Trang 40Method Overview of panelist’s task
Methods involving ratings of test items and scoring rubrics
Angoff Panelists estimate the probability that the borderline examinee will
answer each multiple-choice item correctly
Ebel Panelists estimate each item along two dimensions – difficulty and
relevance For each combination of the dimensions, panelists estimate the percentage of items that the borderline examinee will answer correctly
Nedelsky Panelists estimate for each item the number of distractors that they
think the borderline examinee would be able to rule out as incorrect; the reciprocal of the number of distractors not ruled out, plus one (for the correct answer), is taken as the probability that the borderline examinee will answer the item correctly by resorting to random