Nghiên cứu xác trị các điểm cắt của kết quả bài thi nghe đánh giá năng lực tiếng anh từ bậc 3 đến bậc 5 theo khung năng lực ngoại ngữ 6 bậc dành cho việt nam degree

Interpretive argument: Statements that specify the interpretation and use of the test performances in terms of the inferences and assumptions used to get from a person’s test performanc

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES

******

NGUYỄN THỊ QUỲNH YẾN

DOCTORAL DISSERTATION

AN INVESTIGATION INTO THE CUT-SCORE VALIDITY

OF THE VSTEP.3-5 LISTENING TEST

MAJOR: ENGLISH LANGUAGE TEACHING METHODOLOGY CODE: 9140231.01

HANOI, 2018

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES

******

NGUYỄN THỊ QUỲNH YẾN

DOCTORAL DISSERTATION

AN INVESTIGATION INTO THE CUT-SCORE VALIDITY

OF THE VSTEP.3-5 LISTENING TEST (Nghiên cứu xác trị các điểm cắt của kết quả bài thi Nghe Đánh giá năng lực tiếng Anh từ bậc 3 đến bậc 5 theo Khung năng lực Ngoại ngữ 6 bậc dành cho Việt Nam)

MAJOR: ENGLISH LANGUAGE TEACHING METHODOLOGY

CODE: 9140231.01

SUPERVISORS: 1 PROF NGUYỄN HÒA

2 PROF FRED DAVIDSON

HANOI, 2018

Trang 3

i

This dissertation was completed at the University of Languages and International Studies, Vietnam National University, Hanoi

This dissertation was defended on 10th May 2018

This dissertation can be found at:

- National Liberary of Vietnam

- Liberary and Information Center -Vietnam National University, Hanoi

Trang 4

ii

DECLARATION OF AUTHORSHIP

I hereby certify that the thesis I am submitting is entirely my own original work except where otherwise indicated I am aware of the University's regulations concerning plagiarism, including those regulations concerning disciplinary actions that may result from plagiarism Any use of the works of any other author, in any form, is properly acknowledged at their point of use

Date of submission: _

Ph.D Candidate’s Signature: _

Trang 5

iii

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy

_

Prof Fred Davidson (Co-supervisor)

Trang 6

iv

TABLE OF CONTENTS

LIST OF FIGURES……… viii

LIST OF TABLES………

LIST OF KEY TERMS………

ix xiii ABSTRACT………

ACKNOWLEDGMENTS………

xvii xix CHAPTER I: INTRODUCTION……… 1

1 Statement of the problem……… 1

2 Objectives of the study……… 4

3 Significance of the study ….……… 4

4 Scope of the study………

5 Statement of research questions………

4 5 6 Organization of the study……… 5

CHAPTER II: LITERATURE REVIEW……… 7

1 Validation in language testing……….……… 7

1.1 The evolution of the concept of validity ……… 7

1.2 Aspects of validity.………

1.3 Argument-based approach to validation………

9 11 2 Standard setting for an English proficiency test……… 15

2.1 Definition of standard setting……… ……… 15

2.2 Overview of standard setting methods……… 17

2.3 Common elements in standard setting……… 21

2.3.1 Selecting a standard-setting method……… 21

2.3.2 Choosing a standard setting panel……… 23

2.3.3 Preparing descriptions of performance-level descriptors……… 24

2.3.4 Training panelists……… 24

2.3.5 Providing feedback to panelists……… 26

2.3.6 Compiling ratings and obtain cut scores……… 27

2.3.7 Evaluating standard setting……… 27

2.4 Evaluating standard setting……….…

2.4.1 Procedural evidence.……….………

28

30

Trang 7

v

2.4.2 Internal evidence……… 32

2.4.3 External evidence……… 32

2.4.3.1 Comparisons to other standard-setting methods……… 33

2.4.3.2 Comparisons to other sources of information……… 33

2.4.3.3 Reasonableness of cut scores………

3 Testing listening….……… …

3.1 Communicative language testing………

3.2 Listening construct………

4 Statistical analysis for a language test………

4.1 Statistical analysis of multiple choice (MC) items………

4.2 Investigating reliability of a language test………

5 Review of validation studies………

5.1 Review of validation studies on standard setting………

5.2 Review of studies employing argument-based approach in validating language tests………

6 Summary………

34 34 34 36 42 42 46 49 49 52 60 CHAPTER III: METHODOLOGY………

1 Context of the study………

1.1 About the VTEP.3-.5 test………

1.1.1 The development history of the VSTEP.3-5 test………

1.1.2 The administration of the VSTEP.3-5 test in Vietnam………

1.1.3 Test takers………

1.1.4 Test structure and scoring rubrics………

1.1.5 The establishment of the cut scores ………

1.2 About the VSTEP.3-5 listening test………

1.2.1 Test purpose………

1.2.2 Test format………

1.2.3 Performance standards ………

1.2.4 The establishment for the cut scores of the VSTEP.3-5 listening test……

2 Building an interpretive argument for the VSTEP.3-5 listening test………

3 Methodology………

61

62

63

64

68

70

Trang 8

vi

3.1 Research questions………

3.2 Description of methods of the study………

3.2.1 Analysis of the test tasks and test items………

3.2.1.1 Analysis of test tasks………

3.2.1.2 Analysis of test items………

3.2.2 Analysis of test reliability………

3.2.3 Validation of cut-scores………

3.2.3.1 Procedural………

3.2.3.2 Internal………

3.2.3.3 External………

3.3 Description of Bookmark standard setting procedures ………

3.4 Selection of participants of the study………

3.4.1 Test takers of early 2017 administration………

3.4.2 Participants for Bookmark standard setting method………

3.5 Descriptions of tools for data analysis………

3.5.1 Text analyzing tools………

3.5.1.1 English Profile………

3.5.1.2 Readable.io………

3.5.2 Speech rate analyzing tool………

3.5.3 Statistical analyzing tools………

3.5.3.1 WINSTEPS (3.92.1)………

3.5.3.2 Iteman 4.3 ………

4 Summary………

CHAPTER IV: DATA ANALYSIS………

70 71 72 72 73 75 76 76 76 77 78 81 81 82 83 83 83 84 84 85 85 86 87 89 1 Analysis of the test tasks and test items………… ………

1.1 Analysis of the test tasks……….…………

1.1.1 Characteristics of the test rubric………

1.1.2 Characteristics of the input………

1.1.3 Relationship between the input and response………

1.2 Analysis of the test items………

1.2.1 Overall statistics of item difficulty and item discrimination………

89

94

102

Trang 9

vii

1.2.2 Item analysis………

2 Analysis of the test reliability…….….………

3 Analysis of the cut-scores…… ………

3.1 Procedural evidence………

3.2 Internal evidence………

3.3 External evidence………

CHAPTER V: FINDINGS AND DISCUSSIONS………

1 The characteristics of the test tasks and test items………

2 The reliability of the VSTEP.3-5 listening test………

3 The accuracy of the cut scores of the VSTEP.3-5 listening test ………

CHAPTER VI: CONCLUSION ……….… ………

1 Overview of the thesis………

2 Contributions of the study………

3 Limitations of the study………

4 Implications of the study….………

5 Suggestions for further research………

LIST OF THESIS-RELATED PUBLICATIONS………

REFERENCES………

APPENDIX 1: Structure of the VSTEP.3-5 test………

APPENDIX 2: Summary of the directness and interactiveness between the texts and the questions of the VSTEP.3-5 listening test………

APPENDIX 3: Consent form (workshops)………

APPENDIX 4: Agenda for Bookmark standard-setting procedure………

APPENDIX 5: Panelist recording form………

APPENDIX 6: Evaluation form for standard-setting participants ………

APPENDIX 7: Control file for WINSTEPS………

APPENDIX 8: Timeline of the VSTEP.3-5 test administration………

APPENDIX 9: List of the VSTEP.3-5 developers………

107

128

130

131

132

145

151

154

157

158

159

161

162

172

174

177

179

180

181

183

185

186

Trang 10

viii

LIST OF FIGURES

Figure 2.1: Model of Toulmin’s argument structure (1958, 2003)……… 12

Figure 2.2: Sources variance in test scores (Bachman, 1990)………

Figure 2.3: Overview of interpretive argument for ESL writing course placements………

47 57 Figure 4.1: Item map of the VSTEP.3-5 listening test……… … ……… 105

Figure 4.2: Graph for item 2……… 108

Figure 4.11: Graph for item 34………

Figure 4.12: Total score for the scored items………

126 129

Trang 11

ix

LIST OF TABLES

Table 2.1: Review of standard-setting methods (Hambleton & Pitoniak, 2006)………… 21 Table 2.2: Standard setting Evaluation Elements (Cizek & Bunch, 2007)….……… 30 Table 2.3: Common steps required for standard setting (Cizek & Bunch, 2007)………… Table 2.4: A framework for defining listening task characteristics (Buck, 2001)…………

32

38 Table 2.5: Criteria for item selection and interpretation of item difficulty index………… 44 Table 2.6: Criteria for item selection and interpretation of item discrimination index…… 46 Table 2.7: General guideline for interpreting test reliability (Bachman, 2004)……… 48 Table 2.8: Number of proficiency levels & test reliability……… Table 2.9: Summary of the warrant and assumptions associated with each inference in the

TOEFL interpretive argument (Chapelle et al., 2008)……… ………

48

56 Table 3.1: Structure of the VSTEP.3-5 test………

Table 3.2: The cut scores of the VSTEP.3-5 test………

Table 3.3: Performance standard of Overall Listening Comprehension (CEFR: learning, teaching, assessment)……… Table 3.4: Performance standard of Understanding conversation between native speakers (CEFR: learning, teaching, assessment)……… Table 3.5: Performance standard of Listening as a member of a live audience (CEFR: learning, teaching, assessment)……… Table 3.6: Performance standard of Listening to announcements and instructions (CEFR: learning, teaching, assessment)……… Table 3.7: Performance standard of Listening to audio media and recordings (CEFR: learning, teaching, assessment)……… Table 3.8: The cut scores of the VSTEP.3-5 test…… ……… Table 3.9: Criteria for item selection and interpretation of item difficulty index…… …… Table 3.10: Criteria for item selection and interpretation of item discrimination index…… Table 3.11: Number of proficiency levels & test reliability……… Table 3.12: The venue for Angoff and Bookmark standard setting method………

Trang 12

x

Table 3.14: Summary of the interpretative argument for the interpretation and use of the

VSTEP.3-5 listening cut-scores ……… 88

Table 4.1: General instruction of the VSTEP.3-5 listening test…….……… 90

Table 4.2: Instruction for Part 1……….……….……… 91

Table 4.3: Instruction for Part 2……… ……… 92

Table 4.4: Instruction for Part 3…….……… 93

Table 4.5: Information provided in the specifications for the VSTEP.3-5 listening test…… 94

Table 4.6: Summary of the texts for items 1-8……… ………

Table 4.7: Description of language levels for texts of items 1 -8 in the specification……

96 97 Table 4.8: Summary of the texts for items 9-20………

Table 4.9: Description of language levels for texts of items 9 -20 in the specification…

98 99 Table 4.10: Summary of the texts for items 21-35………

Table 4.11: Description of language levels for texts of items 21-35 in the specification……

100 101 Table 4.12: Summary of item discrimination and item difficulty………

Table 4.13: Summary statistics for the flagged items………

Table 4.14: Information for item 2………

Table 4.15: Item statistics for item 2………

Table 4.16: Option statistics for item 2………

Table 4.17: Quantile plot data for item 2………

104

106

108

109

110

111

112

113

115

116

Trang 13

xi

Table 4.32 Option statistics for item 14………

Table 4.46: Information for item 28……… …

Table 4.54: Summary of statistics………

Table 4.55: Test reliability ……… …….………

Table 4.56: The person reliability and item reliability of the test………

Table 4.57: Number of proficiency levels and test reliability……….…………

Table 4.58: The test reliability of the VSTEP.3-5 listening test……….………

Table 4.59: Order of items in the booklet………

Table 4.60: Summary of Output from Round 1 of Bookmark standard-setting Procedure ……

118

120

121

122

123

124

125

126

127

129

130

131

132

133

135

Trang 14

xii

Table 4.61: Conversion table………

Table 4.62: Summary of statistics in raw score metric for round 1………

Table 4.63: Summary of Output from Round 2 of Bookmark standard-setting Procedure……… Table 4.64: Round 3 Feedback for Bookmark Standard-setting Procedure………

Table 4.65: Summary of Output from Round 3 of Bookmark standard-setting Procedure……… Table 4.66: The cut scores set for the VSTEP.3-5 listening test by Bookmark method……

Table 4.67: The cut scores set for the VSTEP.3-5 listening test by Angoff method………

Table 4.68: Comparison between the results of two standard-setting methods………

Trang 15

xiii

LIST OF KEY TERMS

Construct: A construct refers to the knowledge, skill or ability that's being tested

In a more technical and specific sense, it refers to a hypothesized ability or mental trait which cannot necessarily be directly observed or measured, for example, listening ability Language tests attempt to measure the different constructs which

underlie language ability

Cut score: A score that represents achievement of the criterion, the line between

success and failure, mastery and non-mastery

Descriptor: A brief description accompanying a band on a rating scale, which

summarizes the degree of proficiency or type of performance expected for a test taker to achieve that particular score

Distractor: The incorrect options in multiple-choice items

Expert panel: A group of target language experts or subject matter experts who

provide comments about a test

High-stakes test: A high-stakes test is any test used to make important decisions

about test takers

Inference: A conclusion that is drawn about something based on evidence and

reasoning

Input: Input material provided in a test task for the test taker to use in order to

produce an appropriate response

Interpretive argument: Statements that specify the interpretation and use of the

test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results

Item (also, test item): Each testing point in a test which is given a separate score or

scores Examples are: one gap in a cloze test; one multiple choice question with

Trang 16

xiv

three or four options; one sentence for grammatical transformation; one question to which a sentence-length response is expected

Key: The correct option or response to a test item

Multiple-choice item: A type of test item which consists of a question or

incomplete sentence (stem), with a choice of answers or ways of completing the sentence (options) The test taker’s task is to choose the correct option (key) from a set of possibilities There may be any number of incorrect possibilities (distractors)

Options: The range of possibilities in a multiple-choice item or matching tasks

from which the correct one (key) must be selected

Panelist: A target language expert or subject matter expert who provides comments

about a test

Performance level description: Brief operational definitions of the specific

knowledge, skills, or abilities that are expected of examinees whose performance on

a test results in their classification into a certain performance; elaborations of the achievement expectations connoted by performance level labels

Performance level label: A hierarchical group of single words or short phrases that

are used to label the two or more performance categories created by the application

of cut scores to examinee performance on a test

Performance standard: The abstract conceptualization of the minimum level of

performance distinguishing examinees who possess an acceptable level of knowledge, skill, or ability judged necessary to be assigned to a category, or for some other specific purpose, and those who do not possess that level This term is sometimes used interchangeably with cut score

Proficiency test: A test which measures how much of a language someone has

learned Proficiency tests are designed to measure the language ability of examinees regardless of how, when, why, or under what circumstances they may have experienced the language

Trang 17

xv

Readability: Readability is the ease with which a reader can understand a written

text The readability of text depends on its content (the complexity of its vocabulary and syntax) and its presentation (such as typographic aspects like font size, line height, and line length)

Reliability: The reliability of a test is concerned with the consistency of scoring and

the accuracy of the administration procedures of the test

Response probability (RP) criterion: In the context of Bookmark and similar

item-mapping standard-setting procedures, the criterion used to operationalize participants’ judgment regarding the probability of a correct response (for dichotomously scored items) or the probability of achieving a given score point or higher (for polytomously scored items) In practical applications, two PR criteria appear to be used most frequently (RP50 and RP67); other PR criteria have also been used though considerably less frequently

Rubric: A set of instructions or guidelines on an exam paper

Selected-response: An item format in which the test taker must choose the correct

answer from alternative provided

Specifications (also, test specifications): A description of the characteristics of a

test, including what is tested, how it is tested, and details such as number and length

of forms, item types used

Standard setting: A measurement activity in which a procedure is applied to

systematically gather and analyze human judgment for the purpose of deriving one

or more cut scores for a test

Standardized test: A standardized test is any form of test that (1) requires all test

takers to answer the same questions, or a selection of questions from common bank

of questions, in the same way, and that (2) is scored in a “standard” or consistent

manner, which makes it possible to compare the relative performance of individual students or groups of students

Trang 18

xvi

Test form: Test forms refer to different versions of tests that are designed in the

same format and used for different administrations

Validation: An action of checking or proving the validity or accuracy of something

The validity of a test can only be established through a process of validation

Validity: The degree to which a test measures what it is supposed to measure, or

can be used successfully for the purpose for which it is intended A number of different statistical procedures can be applied to a test to estimate its validity Such procedures generally seek to determine what the test measures, and how well it does

Trang 19

xvii

ABSTRACT

Standard setting is an important phase in the development of an examination program, especially for a high-stakes test Standard setting studies are designed to identify reasonable cut scores and to provide backing for this choice of cut scores This study was aimed at investigating the validity of the cut scores established for a VSTEP.3-5 listening test administered in early 2017 on 1562 test takers by one institution permitted by the Ministry of Education and Training, Vietnam to design and administer the VSTEP.3-5 tests The study adopted the current argument-based validation approach with a focus on three main inferences constructing the validity argument They were (1) test tasks and items, (2) test reliability and (3) cut scores The argument is that in order for the cut-scores of the VSTEP.3-5 listening test to

be valid, the test tasks and test items first needed to be designed in accordance with the characteristics specified in the specifications Second, the listening test scores should be sufficiently reliable so as to reasonably reflect test-takers’ listening proficiency Third, the cut scores were reasonably established for the VSTEP.3-5 listening test

In this study, both qualitative and quantitative methods were combined and structured to back for and against the assumptions in each of these three inferences With regards to the first inference and second inference, an analysis of the test tasks and the test items was conducted whereas test reliability was investigated in order to see if it was in the acceptable range or not In terms of the third inference about the cut scores of the VSTEP.3-5 listening test, Bookmark standard setting method was implemented and the results were compared with those currently applied for the test This study offers contributions in three areas First, this study supports the widely-held notion of validity as a unitary concept and validation is the process of building an interpretive argument and collecting evidence in support of that argument Second, this study contributes towards raising the awareness of the

Trang 20

xviii

importance of evaluating the cut scores of the high stakes language tests in Vietnam so that fairness can be ensured for all of the test takers Third, this study contributes to the construction of a systematic, transparent and defensible body of validity argument for the VSTEP.3-5 test in general and its listening component in particular The results of this study are helpful in providing informative feedback to the establishment of the cut scores for the VSTEP.3-5 listening test, the test specifications, and the test development process The positive results can provide evidence to strengthen the reasonableness of the cut scores, the specifications and the quality of the VSTEP.3-5 listening test The negative results can give suggestions for changes or improvement in the cut scores, the specifications and the design of the VSTEP.3-5 listening test

Trang 21

tremendous mentor

I would also like to thank my co-supervisor, Fred Davidson, Professor Emeritus from the University of Illinois, for giving me the very first ideas, advice and guidance on how to start my Ph.D study His advice on both research as well as on

my career has been invaluable

I am especially thankful to Professor Nathan T Carr from California State University, Fullerton for conducting a series of workshops on designing and analyzing language tests at the University of Languages and International Studies, Vietnam National University - Hanoi Being able to discuss my work with him has been invaluable for developing my ideas Sharing hisknowledge and experience about language testing and assessment in general and standard-setting methods in particular have been a great contribution to the completion of my Ph.D thesis

I want to thank all of my colleagues at the University of Languages and International Studies, Vietnam National University - Hanoi, especially my

Trang 22

Post-Words cannot express how grateful I am to my family I want to say thank you to

my parents and siblings for their encouragement during the time I conducted my study

This thesis is dedicated to my beloved husband and my daughter for their love, endless support, encouragement and sacrifices throughout this experience

As a final word, I would like to thank each and every individual who has been a

source of support and encouragement and helped me to achieve my goal and complete my thesis work successfully

Trang 23

1

CHAPTER I INTRODUCTION

This chapter is to introduce the topic of the study and present the main reasons for choosing it After that, the chapter presents the questions that are going to be addressed within the scope of the study A brief overview of the organization of the thesis will close the chapter

1 Statement of the problem

The term “cut scores” refers to the lowest possible scores on a standardized test, high-stakes test or other forms of assessment that help to separate a test score scale into two or more regions, creating categories of performance or classification of examinees Clearly, if the cut scores are not appropriately set, the results of the assessment could come into question For this reason, establishing cut scores for a test has been considered an important and practical aspect of standard setting In Kane’s (2006) recent discussion for test validation, besides emphasizing the importance of carefully defining the selected cut scores, he highlights the evaluation

of the reasonableness of the cut scores and states that the establishment of the cut

scores is a complex endeavor, but the validation of the cut scores is even more difficult

According to the Standards for Educational and Psychological Testing (AERA et

al., 1999, p.9), validity is defined as “the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests” and test validation is the process of making a case for the proposed interpretation and use of test scores This case takes the forms of an argument that states a series of propositions supporting the proposed interpretation and use of test scores and summarizes the evidence supporting these propositions (Kane, 2006) With regard

to standard setting, since there are no ‘gold standards” and “true cut scores”, to

Trang 24

2

validate established cut scores means to provide evidence in support of the plausibility and appropriateness of the proposed cut score interpretation, their credibility and defensibility (Kane, et al., 1999) In the world, though plenty of studies have been conducted on the validity of cut scores established for a test, these studies mainly aim at cross-validating two different methods of standard setting and comparing the results of these methods instead of investigating the validity of cut scores as a whole

In Vietnam, the National Foreign Language 2020 Project (NFL2020) was initiated

in 2008 with the aim to “renovate the teaching and learning of foreign languages

within the national education system” so that “… by 2020, most Vietnamese students graduating from secondary, vocational schools, colleges and universities will be able to use a foreign language confidently in their daily communication, their study and work in an integrated, multi-cultural and multi-lingual environment, making foreign languages a comparative advantage of development for Vietnamese people in the cause of industrialization and modernization for the country” (Decision 1400/QD-TTg) Language assessment is considered a major component

of this project The biggest achievement of this component is the emergence of the first ever-standardized test of English proficiency (the VSTEP.3-5 test) The test was officially released by the Ministry of Education and Training, Vietnam on 11 th

March 2015 The test aims at measuring English ability across a broad language proficiency continuum from level 3 to level 5, which is equivalent to B1 - C1 CEFR levels (Common European Framework of Reference for Languages) The cut scores

of the VSTEP.3-5 test help to categorize test takers and certify them based on the levels they achieve These cut scores are applied for all of the results of the VSTEP.3-5 tests, which are supposed to be strictly built in accordance with the test specifications

At the moment, the results and certificates of the VSTEP.3-5 test are used by many companies as the requirement for a job position and by many educational institutions as a “visa” for learners to be accepted into or graduate from an academic

Trang 25

3

program For example, English teachers from primary schools and secondary schools throughout Vietnam are expected to obtain level 4 in English (equivalent to B2) while the requirement for those working in high schools, colleges and universities is level 5 (equivalent to C1) Besides, in order to graduate from universities, English major students need to show the evidence of their English in

level 5 (equivalent to C1) and that for non-English major students is level 3 (equivalent to B1) This shows that the uses of the VSTEP.3-5 test and the decisions that are made from the test cut scores have important consequences for the stakeholders Like other high-stakes tests such as TOEFL, IELTS, PTE, or Cambridge Tests, in order to gain credibility and defensibility, more research needs

to be conducted on the test in general and the validity of the VSTEP.3-5 cut scores

in particular However, so far, there have been few studies on the VSTEP.3-5 test

and there is no validation research on the cut scores of the test

Among the skills tested in high stakes examination, listening is the skill that the fewest researchers choose to conduct a study on According to Buck (2001), the assessment of listening ability is one of the least understood and least developed areas of language and assessment However, Buck (2001) also states that the assessment of listening ability is one of the most important testing aspects In terms

of standard setting and cut score validation, the procedure for listening tests is also much more complicated and time-consuming However, for the author of this study, listening is a skill that is really interesting and thus needs discovering

All of the reasons mentioned above have intrigued the author of this doctoral thesis

to conduct a validation study on the cut scores of the VSTEP.3-5 listening test by

using validity argument-based model proposed by Kane (2013) A validity argument is a set of related propositions that, taken together, form an argument in

support of an intended use or interpretation of the test scores With the rooted desire to develop a good proficiency listening test in Vietnam, this research

deeply-is expected to bring the author of thdeeply-is doctoral thesdeeply-is a profound insight into thdeeply-is specific area of interest for her future professional development

Trang 26

4

2 Objectives of the study

As mentioned, since the VSTEP.3-5 test is a newly developed high-stakes test, the need to standardize it is imperative Thus, this doctoral research is conducted as an ongoing attempt in building a systematic, transparent and defensible body of validity argument for the VSTEP.3-5 test in general and its listening component in particular By adopting the argument-based approach recommended by Kane (2013), the study aims at investigating the validity of the cut-scores of the VSTEP.3-5 listening test

3 Significance of the study

This study will be a significant endeavor in building a systematic, transparent and defensible body of validity argumentation for the VSTEP.3-5 test in general and its listening component in particular This study will also contribute to the practice of validating the cut score validity of a test by adopting the argument-based approach Moreover, the results of this study will be helpful in providing a close look at the

test specifications of the VSTEP.3-5 listening test, the test development process and the establishment of the cut scores of the test These results can provide evidence to either support the reasonableness of the test specifications of the VSTEP.3-5 listening test, the test development process and the establishment of the cut scores

of the test or suggest the adjustment for them

4 Scope of the study

In the current context of English testing and assessment in Vietnam, the cut scores

of the VSTEP.3-5 listening test are pre-established and applied for all of the test forms which are supposed to be strictly designed in accordance with the specifications Thus, when a VSTEP.3-5 listening test is delivered by any authorized institution, it is supposed to have been constructed based on the specifications so that the cut scores can be interpreted in the preset way Within the scope of this study, the focus is on the validation of the cut-scores of the VSTEP.3-5 listening test administered in early 2017 by one institution permitted by the Ministry

Trang 27

5

of Education and Training, Vietnam (MOET) to design and administer the VSTEP.3-5 test (hereinafter referred to as the VSTEP.3-5 listening test) Based on the argument that in order for the cut scores of the VSTEP.3-5 listening test to be valid, (1) the test tasks and test items are designed in accordance with the test specifications; (2) the test scores are reliable in measuring test-takers’ proficiency; (3) the cut scores are reasonably established so that they are useful for making decisions about test takers’ English listening competency Thus, the three aspects of the test that will be taken into consideration in this study include: (1) the design of test tasks and test items; (2) the test reliability; and (3) the accuracy of the cut scores

5 Statement of research questions

Based on the interpretive argument for the validity of the cut scores of the VSTEP.3-5 listening test, there is one main research question for the study, which then needs to be clarifying by three other sub-questions

The main research question is:

To what extent do the cut scores of the VSTEP.3-5 listening test provide reasonable interpretation of the test-takers’ listening ability?

The three sub-questions that help to clarify the main research question are:

1 To what extent are the test tasks and the test items of the VSTEP.3-5 listening test properly designed in accordance with the specifications?

2 To what extent are the VSTEP.3-5 listening test scores reliable in measuring the test takers’ English proficiency?

3 To what extent are the cut scores reasonably established for the VSTEP.3-5 listening test?

6 Organization of the study

The study consists of six chapters as follows:

Chapter I: Introduction

Trang 28

6

Chapter II: Literature Review

Chapter III: Methodology

Chapter IV: Data analysis

Chapter V: Findings and Discussions

Chapter VI: Conclusion

Chapter I is aimed at introducing the topic of the study and presenting the main reasons for the author to implement this project

Chapter II is to provide profound theoretical and empirical background with a critical discussion on the relevant concepts, models, or theories for the study

Chapter III presents the context of the study and how the study is conducted together with a review on each selected method

Chapter IV presents the data analysis of the study

Chapter V presents the findings of the study and discusses these results

Chapter VI has two aims First, it specifies the limitations of the study Second, it

suggests some directions for future studies

Trang 29

7

CHAPTER II LITERATURE REVIEW

This chapter reviews theories and research which are fundamental to the current study The first part of this chapter starts with the presentation on how the concept

of validity has changed over the years Then, it discusses the validation approaches and procedures in articulating a validation argument before describing different kinds of evidence that can be collected in support of a validation argument The second part of this chapter focuses on standard setting including definitions of the concept, an overview of different standard setting methods, a discussion of common elements in standard setting and issues related to standard setting validation The third part of this chapter first addresses the issues about testing listening and then

presents the framework for analyzing the listening test tasks The fourth part of this chapter describes the important statistical analysis for a language test including item difficulty, item discrimination and test reliability Finally, the review of related validation studies ends the chapter

1 Validation in language testing

1.1 The evolution of the concept of validity

Validity is considered one of the most important concepts in psychometrics, but as Sireci (2009) states validity has taken on many different meanings over the years

In the early 20th century, validity was primarily defined in terms of the correlation

of test scores with some other criteria and tests were described as valid for anything they correlated with (Kelly, 1927; Thurstone, 1932; Bingham, 1937) Another theoretical definition of validity was also proposed by Garret (1937) He defined simply that validity is the degree to which “a test measures what it is supposed to

Trang 30

Technical Recommendations (APA, 1954) proposed four types of validity – namely,

predictive validity, concurrent validity, content validity and construct validity In contrast, Anastasi (1954) organized the presentation of validity in terms of face validity, content validity, factorial validity and empirical validity before adopting

the framework promulgated in the 1954 Standards However, the framework then

was presented in terms of content validity, criterion-related validity and construct validity (Anastasi, 1982; Cronbach, 1955, 1960) In this framework, predictive validity and concurrent validity were seen as criterion-oriented validity whereas more space was allocated to construct validity and content validity was mentioned

in an additional extended discussion of proficiency tests and construct validity

In the Standards for Educational and Psychological Testing (AERA et al., 1985),

validity is defined as follows: “The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores Test validation is the process of accumulating evidence to support such inferences

A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference Validity, however, is a unitary concept” (APA, 1985, p.9) It was clearly stated that

“the inferences regarding specific uses of a test are validated, not the test itself” (APA, 1985, p.9)

The definition in the 1985 Standards was well explained and elaborated by

Bachman (1990) According to Bachman (1990), “the process of validation starts with the inferences that are drawn and the uses that are made of scores These uses

Trang 31

9

and inferences dictate the kinds of evidence and logical argument that are required

to support judgments regarding validity Judging the extent to which an interpretation or use of a given test score is valid thus requires the collection of evidence supporting the relationship between the test score and an interpretation or use” (Bachman, 1990, p.243) Messick (1989) defines validity as a unitary concept and describes validity “as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and hactions based on test scores” (p.13) This theoretical conceptualization of validity as a unitary concept has been widely endorsed by the measurement profession as a whole (Shepard, 1993; AERA et al., 1999; Messick, 1989; Kane, 1992, 2006, 2009, 2013) and it has a strong influence

on current practice in educational and psychological testing in many parts of the world In this study, this conceptualization of validity as is adopted as the theoretical foundation for understanding different aspects of validity

it can be quite reasonable to talk about the validity of a test if only an interpretation

or use has already been adopted explicitly or implicitly

Second, validity is a matter of degree, and it may change over time as the interpretation/use develop and as new evidence accumulates Since a particular test score can never provide a perfectly accurate measure of a given ability, and the validity of interpretation and use always depends on the logic of the interpretive argument and the strength of the evidence provided in support of this argument, we

Trang 32

10

can never prove that our interpretation and use are valid Thus, it is better to provide evidence that the intended interpretation and use are more plausible than other interpretation that might be offered

Third, validity is always specific to a particular use or interpretation When a test is developed, we always have a particular set of interpretation and use in mind These intended interpretation will depend on how the construct or ability to be measured is defined The way we define a given ability may be different from the way it might

be defined for different purposes, or different groups of test-takers Thus, the scores

of a particular test will not necessarily be appropriately for other situations or other purposes

Fourth, validity is a unitary concept We often hear about different types of validity such as content validity, concurrent validity, predictive validity or construct validity However, as defined above, validity is a single quality of the ways in which we use a particular test Many different kinds of evidence, such as test content analysis or correlations with other measures of ability can be provided in support of the intended interpretation and use For this reason, the evidence needed for validation depends on the interpretation and use and different interpretation or use will require different kinds and different amount of evidence for their validation

Fifth, validity involves an overall evaluative judgment Since a validation argument typically includes several parts and is supported by different kinds of evidence, none of which by itself is sufficient enough to justify the intended inferences and uses of a particular test Thus, when evaluating the validity of inferences and uses, it

is important to consider the interpretive argument in its entirely, as well as all the supporting evidence

In summary, with the aspects described above about validity, investigating the validity of test use, which is called validation, can be seen as the process of building

an interpretive argument and collecting evidence in support of that argument (Kane,

Trang 33

11

1992, 2006, 2013) This validation approach was termed as argument-based approach to validation by Kane (1992) and is widely applied in the practice of validation nowadays

1.3 Argument-based approach to validation

Kane (1992, 2006, 2013) recommends that researchers follow an argument-based approach to make the task of validating inferences from test scores both scientifically sound and manageable In this approach, the validator builds an argument that focuses on defending the interpretation and use of test scores for a particular purpose and is based on empirical evidence to support the particular use According to Kane (2006), validation consists of two types of arguments, an interpretive argument and a validity argument The interpretive argument is built upon a number of inferences and assumptions that are meant to justify score interpretation and use whereas the validity argument provides an evaluation of the interpretive argument in terms of how reasonable and coherent it is as well as how plausible the assumptions are (Cronbach, 1988) To be more specific, the establishment of an interpretive argument involves (1) the determination of the inferences based on test scores, (2) the articulation of assumptions, (3) the decision

on the sources of evidence that can support or refute those inferences, (4) collection

of appropriate data and (5) analysis of the evidence In order for the interpretation or use of test scores to be valid, all of the inferences and assumptions inherent in the

interpretation or use of test scores have to be plausible

Kane (1992, 2006, 2013) cites Toulmin’s (1958, 2003) framework as a guide for applying his approach to validation Toulmin’s (1958) framework essentially requires that a chain of reasoning be established that is able to build a case towards

a conclusion, which in this case would be to determine the plausibility and reasonableness of score interpretation and use Figure 2.1 shows Toulmin’s (1958, 2003) argument structure, which is built on several components, including the grounds, claim, warrant, backing, and rebuttal

Trang 34

12

Figure 2.1: Model of Toulmin’s argument structure (1958, 2003)

In terms of test score interpretation and use, the claim of an argument is the conclusion drawn about a test-taker based on test performance whereas the grounds serve as the data or observations upon which the claim is based upon An example

can be illustrated by the case given by Mislevy et al (2003) One may make the claim that the student’s English speaking abilities are inadequate for studying in an English-medium university based on the grounds/observation that the student did not perform well in the final oral examination in an intensive English class The performance that the teacher observed when the student spoke in front of the class

on the assigned topic is characterized by hesitations and mispronunciations

In the interpretive argument underlying the claim about the student’s readiness for

university study, the inference linking the grounds to the claim is not given and therefore justification is needed in the form of a warrant (or assumption) Figure 2.1 shows the interpretive argument as consisting of one inference, which is authorized

by a warrant The warrant in Toulmin’s model is regarded as a rule, principle, or

established procedure that is meant to provide justification for the inference connecting the grounds to the claim In the example provided above, the warrant is the generally held principle that hesitations and mispronunciations are characteristics of students with low levels of English speaking ability, who would

have difficulty at an English-medium university

Trang 35

13

The warrants in turn need backing (or evidence) which comes in the form of theories, research, data, and experience In relation to the example provided above, backing might be drawn from teacher’s training and previous experience with non-

native speakers at an English-medium university Stronger backing could be obtained by having the student’s speaking performance rated by another teacher and then showing the agreement between the two raters

Finally, a rebuttal also acts as a link between the ground and claim but serves to

weaken the initial argument by providing evidence or possible explanation that may call into question the warrant Going back to the previous example, a possible rebuttal may be that the assigned topic for the oral presentation required the student

to use highly technical and unfamiliar vocabulary This rebuttal would serve to weaken the inference connecting the grounds – that the oral presentation contained many hesitations and mispronunciations - and the claim that the student’s speaking ability was at a level that would not allow him to succeed at an English-medium university

As can be seen, these components are all connected with each other and are essential for establishing an inferential connection between the claims and grounds Using Toulmin’s framework, test score interpretation and use can be seen as an interpretive argument (Kane 2001, 2006), involving a number of inferences, each supported by a warrant To validate the interpretive argument for the proposed interpretation and use of test scores, it is necessary to evaluate the clarity, coherence, and completeness of the interpretive argument and to evaluate the plausibility of each of the inferences and assumptions in the interpretive argument The proposed interpretation and use of the test scores are supposed to be explicitly stated in terms of a network of inferences and assumptions, and then the plausibility

of the warrants for these inferences can be evaluated using relevant evidence The

validation can be simple if the interpretation and use are simple and limited, involving a few plausible inferences In contrast, if the interpretation and use are

Trang 36

be consistent with the interpretation

Second, the argument-based approach to validation provides definite guidelines for systematically evaluating the validity of proposed interpretation and use of test scores In other words, this approach provides the researcher with a clear place to

begin and a direction to follow and as a result, this helps the researcher to focus serious attention on validation

Third, in the argument-based approach, the evaluation of the interpretive argument does not lead to any absolute decision about validity but it does provide a way to

measure progress (Kane, 2006) In other words, this approach views the validation

as an on-going and critical process instead of just answering either “valid” or

“invalid” Because the most problematic inferences and their supporting assumptions are checked and are either supported by the evidence or at least less

problematic, the reasonableness of the interpretive argument as a whole can improve

Fourth, the argument-based approach to validation may increase the chance that research on validity will lead to improvements in measurement procedures Since the argument-based approach focuses attention on specific parts of the interpretive

argument and on specific aspects of measurement procedures, evidence which indicates the existence of a problem such as inadequate coverage of content or the

Trang 37

2 Standard setting for an English proficiency test

2.1 Definition of standard setting

Probably the most difficult and controversial part of testing is standard setting In

this study, standard setting refers to the establishment of cut scores for a test, i.e.,

determining the points on the score scale for separating examinees into performance categories such as pass/fail However, in order to have a complete understanding of the concept, it is important to get familiarized with the theoretical foundations of the term

Cizek (1993) suggests an elaborate and theoretically grounded definition of standard setting He defines standard setting as “the proper following of a described, rational system of rules or procedures resulting in the assignment of a number to differentiate between two or more states or degrees of performance” (Cizek, 1993, p.10) This definition highlights the procedural aspect of standard setting and draws

on the legal framework of due process and traditional definitions of measurement However, the definition suggested by Cizek (1993) suffers from at least one deficiency in that it addresses only one aspect of the legal principle known as due

Trang 38

the procedure lead to a decision or result that is fundamentally fair However, the

notion of fairness is, to some extent, subjective The aspect of fundamental fairness

is related to what has been called the “consequential basis of test use” in Messick’s (1989, p.84) explication of the various sources of evidence in support for the use

and interpretation of the test scores

Kane (1994) provides another definition of standard setting that highlights the conceptual nature of the endeavor He states that “it is useful to draw a distinction between the passing score, defined as a point on the score scale, and the performance standard, defined as the minimally adequate level of performance for some purpose… The performance standard is the conceptual version of the desired level of competence, and the passing score is the operational version” (Kane, 1994, p.426) For this reason, as stated, this study uses the term “standard setting” and

“cut scores” interchangeably

Another concept, which is a key concept underlying Kane’s definition of standard setting, is the concept of inference As discussed in the previous part, an inference is the interpretation, conclusion, or meaning that is intended to make about an examinee’s underlying, unobserved level of knowledge, skill, or ability Thus, from this perspective, validity refers to the accuracy of the inferences made about the examinee, usually based on observations of the examinee’s performance The assumption underlying the inference of standard setting is that the passing scores creates meaningful categories that distinguish between individuals who meet some performance standard and those who do not

Thus, for this study, the primacy of test purpose and the intended inference or test score interpretation is essential to understanding the definition of standard setting

Trang 39

17

To validate the standard setting or the establishment of the cut scores is to evaluate the accuracy of the inferences that are made when examinees are classified based on application of the cut scores

Finally, in wrapping up the definition of standard setting, it is important to note what standard setting is not According to Cizek and Bunch (2007:18), “standard setting does not seek to find “true” cut scores that separate real, unique categories

on a continuous competence” since there is no external truth for all the things we

care about, no set of minimum competences that are necessary and sufficient for life success, then all standard setting is judgmental Cizek and Bunch (2007) also state that to some degree, because standard setting necessarily involves human opinions and values, it can be also seen as a combination of technical, psychometric methods and policy making Seen in this way, standard setting can be seen as a procedure

that enables participants to bring their judgments by using a specified method in such a way as to translate the policy positions of authorizing entities into locations

on a score scale

2.2 Overview of standard-setting methods

In recent years, more attention has been paid to establishing the credibility of existing standard-setting methods and investigating new methods Many of the new approaches have been developed to present judges with meaningful activities and to better accommodate the changing nature of assessments Cizek & Bunch (2007) summarize 18 different methods of cut-score establishment However, one common element of all these methods is that they involve, to one degree or another, human beings expressing informed judgments based on the best evidence available to them, and these judgments are summarized in some systematic way, typically with the aid

of a mathematical model, to produce one or more cut scores Cizek & Bunch (2007) state that each standard method combines art and science and different methods may yield different results For this reason, it is hard to say that the cut scores established by one particular method are always superior to those set by another

Trang 40

Method Overview of panelist’s task

Methods involving ratings of test items and scoring rubrics

Angoff Panelists estimate the probability that the borderline examinee will

answer each multiple-choice item correctly

Ebel Panelists estimate each item along two dimensions – difficulty and

relevance For each combination of the dimensions, panelists estimate the percentage of items that the borderline examinee will answer correctly

Nedelsky Panelists estimate for each item the number of distractors that they

think the borderline examinee would be able to rule out as incorrect; the reciprocal of the number of distractors not ruled out, plus one (for the correct answer), is taken as the probability that the borderline examinee will answer the item correctly by resorting to random

Định dạng
Số trang	208
Dung lượng	1,88 MB