A study on the validity of a vietnamese developed test of english reading proficiency = nghiên cứu xác trị bài thi Đọc hiểu của một Định dạng bài thi tiếng anh của việt nam

INTRODUCTION

Rationale of the study

One of the important reasons to choose the topic of the current study can be attributed to the important position of validity in the field of theories and practices in language testing and assessment (AERA, APA, & NCME, 1985; Bachman, 1990) Throughout history, tests have played indispensable and irreplaceable role in any educational context Since the first half of the twentieth century, test validity has increasingly become the core issue and the basic quality of a test Theoretical exploration and empirical research on test validity never stops (Lado, 1961; Valette, 1977; Heaton, 1975; Harrison, 1983; Finocchiaro & Sako, 1983; Henning, 1987; Hughs, 1989; Messick, 1988; Messick, 1989; Messick, 1995; Bachman, 1990; Borsboom, Mellenbergh, & Heerden, 2004; Weir, 2005; Chapelle, 2012; AERA et al.,

2014) Admittedly, it is reasonable to continuously discuss issues of validity in language testing and assessment, to explore on whether there are suitable theoretical frameworks to support any given tests or whether evidence is sufficient to generate and collect to manipulate these frameworks

In Vietnam, one of the significant achievements in English language testing and assessment is the development of the Vietnamese Standardized Test of English Proficiency (VSTEP) VSTEP was independently developed by a team of researchers in Vietnam The Ministry of Education and Training officially issued the decision on the format of the test on March 2015 (Nguyen Thi Quynh Yen, 2018) From then on, VSTEP has been in effect countrywide for eight years VSTEP.3-5 was designed to assess English proficiency from level 3 to level 5 according to the Common European Framework of Reference for languages for Vietnamese learners (CEFR-VN) or from level B1 to level C1 according to the Common European Framework of Reference for Languages (CEFR) for test takers in different majors and professions However, VSTEP.3-5 is a newly-developed test and has not yet gained wider national and international recognition Thus, validation is really crucial for its development to gain wider recognition nationally and internationally Basically, VSTEP.3-5 is organized in the paper-based form and the computer-based form The current study only focuses on investigating the validity of the paper-based form

Reading is a remarkable activity of the human brain and a basic skill for school Most of the information people get every day comes from reading activities There is no doubt that an accurate test of this basic competence has great implications for all stakeholders However, testing reading is by no means an easy task In fact, it has been one of the most controversial issues in language testing and assessment because it lacks an overt process that can be directly observed, unlike writing or speaking competence, which can be performed by test takers in concrete ways for researchers to scrutinize various indicators The traditional approach to studying the validity of reading tests is to observe the test taker’s responses through indirect testing methods and infer the nature of his/her reading process (Kong, 2016) However, it is not enough to just validate reading tests from this perspective, in fact, it urgently needs multi-dimensional sources of evidence from test products and test processes to explore new and valuable findings All of the reasons mentioned above have prompted the researcher to conduct the current study.

Objective of the study

The objective is to make an ongoing attempt in building a systematic, transparent and reasonable section of validity argumentation for the VSTEP.3-5 reading test To be specific, framed within an argument-based approach combining qualitative and quantitative analyses, this study aims to investigate the validity of VSTEP.3-5 reading test from both product-based and process-based perspectives comprehensively

It is important to pay special attention to the inferences which are most central to the proposed interpretations and uses of test scores In addition, test product evidence is not sufficient enough to support the statement that a test is valid Emphasis should be placed on including evidence of the test taking processes of test takers In this way, one can prioritize the kinds of evidence needed to justify the most relevant inferences in the validity argument As stated, the current study aims to provide evidence for the assumptions corresponding two inferences in the proposed interpretive argument for the VSTEP.3-5 reading test Thus, the main research questions structured for this study are as follows:

1 To what extent do the test tasks and test items of the VSTEP.3-5 reading test meet the requirements of the test specifications?

2 To what extent do the test takers’ genders and majors affect their reading scores?

3 To what extent are the VSTEP.3-5 reading test scores adequately reliable in measuring the test takers’ English proficiency?

4 What reading processes are applied to answer VSTEP.3-5 reading test correctly? From think aloud protocols, to what extent do the expected reading processes correspond with the processes actually engaged in by test takers while taking VSTEP.3-5 reading test?

5 How do successful and unsuccessful test takers differ in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements? From eye tracking, to what extent does the VSTEP.3-5 reading test discriminate test takers at different proficiency levels in terms of cognitive reading process?

Among them, the first three questions are answered to provide evidence for supporting the assumptions proposed in the generalization inference of the interpretive argument for VSTEP.3-5 reading test, which belong to the investigation for the validity of VSTEP.3-5 reading test from product-based perspective Specifically, the first question is answered for verifying assumptions that configuration of tasks on VSTEP.3-5 reading measure is appropriate for intended interpretation and the test items are designed with the predetermined difficulty as described in the test specifications of the test The second question aims to be answered for providing evidence for the assumption that genders and majors of test takers cannot affect their reading scores The third question is investigated for proving the assumption that the test reliability is in good range of the VSTEP.3-5 reading test The last two questions are answered to offer evidence for the assumptions proposed in the explanation inference of the interpretive argument for VSTEP.3-5 reading test, which belong to the investigation for the validity of VSTEP.3-5 reading test from process-based perspective To be specific, the fourth question is explored for supporting the assumption that reading process engaged by test takers vary according to theoretical expectations The last question inquiries into the evidence for assumptions that there are differences between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements and VSTEP.3-5 reading test can discriminate test takers at different proficiency levels in terms of cognitive reading process.

Scope of the study

This study was conducted in the context of the COVID-19, which increased the difficulty of collecting data Because both Vietnam and China had severe outbreaks at that time, it was unavailable for the researcher to carry out the whole study in Vietnam Thus, data such as the test takers’ scores of VSTEP.3-5 reading test, think aloud protocols, eye tracking etc were collected in China for this study The research sites were two universities in China One was a comprehensive university; the other was a normal university The two universities are major public tertiary education institutions in China with high reputation for quality teaching and learning English is taught both for general and specific purposes Both of them offered undergraduate and postgraduate programs to more than 20,000 students in different academic disciplines.Admittedly, VSTEP.3-5 aims at assessing English proficiency from level 3 to level 5 according to the CEFR-VN for Vietnamese learners But as its designers stated, it aims to be both nationally and internationally recognized, thus the results of this study may indicate that the test can be taken by international students.

Clarifications of the key terms

If the key terms in the current study are not clarified first, the investigation cannot be conducted effectively Validity in the current study refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test, which is defined in The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) Furthermore, validity defined here after is closely related not only to test products but also to the processes which give rise to the test products and the understanding of the characteristics being measured requires an in-depth understanding of the cognitive processes required by test takers to complete the test tasks Product-based perspective in the current study refers to rely mainly on collected evidence from assessment products such as using observed test scores to validate a test Another term which is opposite to product- based perspective is process-based perspective, which refers to rely mainly on collected evidence from the test-taking processes to validate a test In order to draw clearer conclusions about the validity of the test, this study aims to combine evidence of test products with experimental evidence of specific cognitive processes involved in the construction of testing processes by test takers.

Context of the study

The following aspects including introduction of VSTEP.3-5 and introduction of VSTEP.3-5 reading test provide the necessary information for the context of the study

VSTEP is a 6-level program for Vietnamese students which is considered a breakthrough solution in renewing and evaluating foreign language competencies in Vietnam The VSTEP.3-5 has been officially implemented nationwide in Vietnam since March 2015 after developing by researchers from a professional scientific research team in University of Languages and International studies, Vietnam National University-Hanoi VSTEP.3-5 is the first English proficiency test developed by Vietnam according to the process of building a rigorous English language proficiency test conducted by Association of Language Testers in Europe (ALTE), aiming to assess levels 3 to 5 in accordance with the CEFR-VN English proficiency or based on the CEFR to assess the proficiency of English level B1 to C1

By 2021, the Ministry of Education and Training allows several universities to organize VSTEP.3-5, which are University of Languages and International Studies, Vietnam National University-Hanoi; University of Foreign Studies-University of Da Nang; University of Foreign Studies-Hue University; Ho Chi Minh City University of Education; Thai Nguyen University; Can Tho University; Hanoi National University of Education; Hanoi University; Saigon University; Tra Vinh University; Van Lang University and Vinh University Basically, VSTEP.3-5 is organized in two forms: paper-based and computer-based As of August 2020, University of Languages and International Studies, Vietnam National University-Hanoi was still organizing paper- based tests The rest of the universities administered test takers to take tests on computers

VSTEP.3-5 is widely used as a standard to measure the comprehensive English ability of Vietnamese English learners According to the National Foreign Language

2020 Project (NFL 2020) and related documents, the test takers of VSTEP.3-5 includes several categories as follows: (1) VSTEP.3 (level B1) is for students who prepare to defend their master’s degree and prepare to apply for doctor’s degree, non- English major students of Universities and Colleges, those who prepare for the postgraduate entrance examinations, and those who prepare for the civil service examination or currently work as civil servants in major and professional positions (2) VSTEP.4 (level B2) is for English teachers at primary and lower secondary schools; those who prepare for doctor’s degree; high level students of Vietnam National University, Hanoi, and Senior specialists (3) VSTEP.5 (level C1) is for high school English teachers; non-specialized English teachers at universities and colleges, and students on strategic mission of Vietnam National University, Hanoi

VSTEP.3-5 includes four subtests which are listening test, reading test, writing test and speaking test Table 1.1 presents the format of each subtest of VSTEP.3-5

Format of each subtest of VSTEP.3-5

Subtest Number of questions/ tasks

Time (mins) Task types Objectives

40, including time to transfer answers to the answer sheet

Test takers are required to listen to short announcements/ instructions, extended dialogues, lectures and talks; then answer multiple-choice questions in the test paper

To examine test takers’ listening proficiency ranging from level 3 to level 5 (i.e., level B1 to C1 based on CEFR) as follows:

-Listening for details -Listening for gist -Listening for attitudes -Inferring meaning

60, including time to transfer answers to the answer sheet

Test takers are required to read 4 passages on different topics, the passages’ levels range from level 3 to 5 The total number of words is between 1900 and

2050 There are 10 questions for each passage

To examine test takers’ reading proficiency ranging from level 3 to level 5 (i.e., level B1 to C1 based on CEFR) as follows:

-Reading for details -Reading for main idea -Reading for attitudes -Inferring meaning -Understanding vocabulary in context

Letter/Email writing (at least

To examine test takers’ interactive and productive writing skills the test score) Task 2: Essay writing (at least

250 words, 2/3 of the test score)

Solution discussion Test takers are provided a situation with three suggested solutions Test takers need to give their own opinions of the best solution among three options

Test takers give a talk about a topic with suggested ideas or their own ideas

There are some follow-up questions when they finish their talk

To examine sub-skills of speaking including interacting, discussing and presenting a topic

1.5.1.5 Cut-off scores of VSTEP.3-5

The scores of each subtest in VSTEP.3-5 are applied to assess each language skill of test takers respectively The final score is the average of the four subtests’ scores The cut-off scores are preset on the model VSTEP.3-5, and the results are applied to all tests assuming that the tests are strictly designed according to the test specifications The following table shows the cut-off scores of VSTEP.3-5

Cut-off scores of VSTEP.3-5

1.5.2 Introduction of VSTEP.3-5 reading test

1.5.2.1 Test purpose of VSTEP.3-5 reading test

VSTEP.3-5 reading test is a subtest of VSTEP.3-5 The test is designed in order for assessing the English reading competence of Vietnamese English learners Generally speaking, VSTEP.3-5 reading test is designed for testing different reading subskills, with difficulty levels from 3 to 5: reading for details, reading for main idea, reading for attitudes, inferring meaning, understanding vocabulary in context according to the test specifications The results are applied to infer the reading competence of test takers The inference of reading tests and the results of other subtests such as listening, writing, and speaking tests would be used to determine whether a test taker is below Level 3 (B1) or at Level 3 (B1), Level 4 (B2), or Level 5 (C1)

1.5.2.2 Test format of VSTEP.3-5 reading test

VSTEP.3-5 reading test consist of four different passages, each of which is followed by 10 multiple-choice questions For these 40 questions, test takers are supposed to choose the one best answer for each question Then, the test takers should find the question number on the answer sheet and fill in the space corresponding to the selected answer letter Test takers need to answer all questions following a passage on the basis of what is stated or implied in that passage Test takers have 60 minutes to answer all the questions, including the time to transfer the answers to the answer sheet The length of the four passages varies from each other, ranging from 400 words to 550 words, with the total number of words being approximately 1900 to 2050 words The passages of VSTEP.3-5 reading test cover a wide range of texts, roughly divided into the following categories: daily life readings, natural science articles, social science articles, other disciplines articles, specialized academic or professional publications, literary writings, and background information about Asia, Association of Southeast Asian Nations (ASEAN) or Vietnam

1.5.2.3 Proficiency level description of VSTEP.3-5 reading test

VSTEP.3-5 reading test focuses on evaluating English language learners’ reading proficiency from level 3 to level 5 Table 1.3 presents the detailed information about proficiency level of VSTEP.3-5 reading test

Proficiency level description of VSTEP.3-5 reading test

Test takers at level 5 can understand in detail lengthy, complex texts commonly encountered in social, professional and academic life; can perceive subtle details including the attitude

Specifically, test takers at level 5 can perceive the purpose/probability of details and/or arguments given by the author or speaker, and are quite sensitive to cultural elements of English embedded in the text Test takers can appreciate a variety of of the writer or the related person, and the explicated or implicated opinions expressions and structures that are highly idiomatic Test takers can analyze the structure of complex texts to see how information is structured and can predict logical developments of the text

Test takers at level 4 can independently read newspapers, articles, reports on a wide range of career fields as well as specialized academic or professional publications in that field

Specifically, test takers at level 4 have a wide active reading vocabulary but may still experience some difficulty with less familiar idiomatic expressions New words may not interfere much with their reading comprehension because test takers can guess the meaning based on the context or know how to skip new vocabulary and still understand the content Test takers can skim long and complex texts for content and understand ideas expressed in texts Test takers can also skim the texts for general information, understand rephrased information, or implicit details such as author’s attitude, opinions, style, though with some difficulty

Test takers at level 3 can read explicit factual texts on subjects related to their field of interest

Specifically, test takers at level 3 can understand/recognize important information in explicit texts on familiar topics Test takers can skim long texts based on linguistic cues such as means of connections, words of reference, etc to locate and find the information; Test takers can gather information from different parts of a text or from different texts to perform a particular task and perceive explicit information rephrased Test takers can recognize the structure of arguments presented in the text, but not at a specific level

Significance of the study

The current study contributes theoretically and practically to the field of research on VSTEP.3-5 reading tests Test validation has by far been dealt with evidence derived from test product and the collected test product data have usually been analyzed through statistical methods Test takers’ cognitive processes in testing conditions have been seldom tapped into Theoretically, this current study may explore the possibility of test validation from product-based and process-based perspectives which provide a referable paradigm for future studies trying to validate VSTEP.3-5 tests based on certain theoretical framework In other words, this study may strengthen the understanding on the current theoretical basis for test validation Practically, this current study may provide the references to raise the design of VSTEP.3-5 reading test for the test designers, aiming at improving its validity Meanwhile, this study may help the English teachers and learners well comprehend VSTEP.3-5 reading test Therefore, it can benefit the English teachers in effective language teaching and for English learners in test preparation.

Overall structure of the thesis

The whole dissertation consists of seven chapters Chapter I has provided a brief introduction of the whole study Rationale of the study has been first presented with the main reasons that motivate the conduction of this study After that, objective of the study, scope of the study, clarifications of key terms have been stated Besides, context of the study, significance of the study and overall structure of the thesis have been also clearly described in this chapter

Chapter II reviews the literature related to both subjects and approaches of the current study For the first part, nature of second language (L2) reading and construct of L2 reading are discussed The differences between L2 reading and first language (L1) reading are mentioned in the nature of L2 reading section The construct of L2 reading section is discussed from the processing perspective, the task perspective and the reader purpose perspective which are particularly relevant to the purpose of the current study Because the aim of this study is to validate VSTEP.3-5 reading test not only from the perspective of test products but also from the perspective of test-taking processes, so multidimensional evidence should be collected For the second part, Item Response Theory (IRT) and Rasch models are reviewed for the perspective of test products As to the perspective of test-taking processes, think aloud protocols and eye tracking are discussed

Chapter III starts with a brief introduction to the concept of validity It then introduces five representative validation frameworks under the unitary validity view with the advantages and disadvantages of each framework, which lays a sound foundation for the bringing up of the current research validation framework that is an interpretive argument And then, the chapter discusses the justification for the use of the interpretive argument Finally, it elaborates articulation of an interpretive argument for VSTEP.3-5 reading test

Chapter IV is devoted to research methodology In this chapter, the design of the current study is elaborated on the several aspects including ontology and epistemology for the design, namely, the mixed methods which combines the strengths and compensates for the weaknesses of quantitative and qualitative research methods, the convergent mixed methods design, participants, instruments, data collection procedures, tools for data analysis, and data analysis procedures of the study

Chapter V depicts findings and discussions on validity of VSTEP.3-5 reading test from the product-based perspective In this chapter, findings of test tasks are firstly generated through comparing with the test specifications in the aspects of characteristics of input and characteristics of the expected response and combining the answers of expert interview Secondly, findings of test items are explored that the role of each test item in the test of test takers, and how appropriately they are designed in line with the difficulty level predetermined in the test specification Thirdly, findings of DIF by genders and majors are investigated to make sure whether or not the genders and majors of test takers affect their reading scores Fourthly, findings of test reliability are presented aiming at answer whether or not the test is adequately reliable to measure test takers’ proficiency Discussions are triggered after each of the findings with necessary and important points

Chapter VI presents findings and discussions on validity of VSTEP.3-5 reading test from the process-based perspective In this chapter, findings of the expert judgement, think aloud protocols and eye tracking follow with due attention to the cognitive processes involved in the test-taking phase respectively Discussions are centered on providing the reflections after each of the findings

Chapter VII is the concluding chapter of the whole dissertation First, the proposed interpretive argument for VSTEP.3-5 reading test in this study is revisited with generalization inference and explanation inference Next, the theoretical, methodological and practical implications of the current study are presented Besides, the limitations of the current study are also mentioned Suggestions for future studies are listed lastly.

Summary

This chapter has briefly discussed rationale of the study, including several reasons for investigating validity of VSTEP.3-5 reading test from both product-based perspective and process-based perspectives In this chapter, objective of the study, scope of the study, clarifications of key terms, context of the study have also been identified respectively Furthermore, research significance has been stated in both theoretical and practical aspects Lastly, overall structure of the thesis has been also clearly presented in this chapter.

LITERATURE REVIEW

Review of L2 reading

Because VSTEP.3-5 reading belongs to L2 reading, a review of L2 reading may provide a better understanding of L2 reading to conduct the study on the validity of VSTEP.3-5 reading test This part begins with descriptions of the nature of L2 reading Then, by articulating three main perspectives, this part continues to explain how these perspectives provide information for the conceptualization of constructs of L2 reading

Through reviewing the research literature in reading, it is hard to define reading in a sufficient and comprehensive way though it should be a necessary and logical step in such a work Therefore, the diversified versions of initial introduction or description of reading were adopted in different studies In the most general terms, it is accepted that reading is a dynamic process, involving the reader, the text and the interaction between the reader and the text (Rumelhart, 1977) To specify the process of this interaction, professional studies indicate that it is readers’ comprehension and interpretation of the symbols on the paper (Harris & Sipay, 1985; Keith & Alexander 1989; William & Fredricka, 2005) and it engages in both the cognitive operation and linguistic knowledge, with which readers construct a meaningful representation of an author’s message (Francoise, 1981) Different readers are diversified in their re- construction of the written language as they search and process information and make sense of written words based on their own background knowledge, building metal models of the text by integrating their own cognitive and mental particularities

The basic model of general reading process works as above, then what would happen to L2 readers who deal with the target text in a language they have to learn by effort but not acquire in a natural situation? In the absence of a comprehensive theory of L2 reading development, researchers focusing on L2 reading tended to reply on the theoretical framework established in the study of the native language and assumed that L2 reading development comprise components that are similar or identical to those identified or studied in L1 (Jolly, 1978) However, the simple inferential theories could not stand for long and abundant research with English as a second language (ESL) readers as subjects were conducted and analyzed, resulting in progressive reports of evident difference existing between L2 and L1 reading process According to Adrian (2003, p.157), “reading in a second language may place additional demands on the readers due to L2 language and cultural proficiency as well as previous literacy experiences and beliefs”, William and Fredricka (2005) summarize the difference between LI and L2 readings from three aspects: (1) linguistic and processing differences; (2) individual and experiential differences and

(3) socio-cultural and institutional differences The salient and unique features of L2 reading sub-categorized in each of these three aspects include: (1) L2 readers usually have less lexical, grammatical and discourse knowledge which impact their understanding of the passage; (2) L2 readers show greater metalinguistic and metacognitive awareness in L2 settings; (3) different ways of organizing discourse and texts are relatively unacquainted to L2 readers and so forth In summary, L2 reading follows the universal rules of reading behavior but, compared with LI reading, it involves more complexities in the comprehension process, which require special efforts and attention from researchers in their relevant studies

To conclude this section, a definition of reading employed by Programme for International Student Assessment (OECD, 2016) is selected, which demonstrates the multiplicity of the nature of reading: Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s goals, to develop one’s knowledge and potential and to participate in society In sum, the activity of reading is not simply a decoding process to extract information Instead, it combines readers’ active contributions to constructing meaning for a variety of purposes and the activity is embedded in a sociocultural context

Unlike testing productive skills, testing receptive skills such as reading is more complicated because the actual reading process is internal to the reader, and difficult to observe, describe and inspect (Alderson, Haapakangas, Huhta, Nieminen, & Ullakonoja, 2015) In effect, how to test reading ability hinges on how to conceptualize and operationalize the constructs of reading Enright et al (2000) present three perspectives in thinking about the constructs of reading comprehension for the TOEFL 2000 project – the processing perspective, the task perspective and the reader purpose perspective which are particularly relevant to the purpose of the current study and are discussed below respectively

From a processing perspective, an array of linguistic and processing variables is conceived to drive the construct of reading In order to investigate whether a reading test is valid, it is important to have a good understanding of what test takers do to complete reading comprehension tasks in the test There has been a growing recognition of the importance of gaining a better understanding of reading comprehension test-taking processes as part of construct validation (Anderson, Bachman, Perkins & Cohen, 1991, p 42) Cohen notes that in order to assess test takers’ reading comprehension competence in a second or foreign language, it is necessary to have a working knowledge of what test takers’ reading comprehension test-taking processes entail (Cohen, 1994, p 211) The main processes and conceptual models are explained in the following part

From reviewing the literature in reading, the evolution of theories and studies on the processes of L2 reading have been influenced by the following main approaches from L1 reading The first one is the bottom-up approach Associated with behaviorism in the 1940s and 1950s, this approach views reading process as a serial model where readers decode graphic-phonemic-syntactic-semantic systems in a linear sequence (Alderson, 2000; Gough, 1972) The bottom-up approach emphasizes the lower-level skills or linguistic knowledge – the ability to decipher the orthographic, phonological, lexical, syntactic, and discourse features of the texts (Birch, 2002; Cohen & Upton, 2006)

Opposite to the bottom-up approach, the top-down approach underscores what readers bring to the texts with their background knowledge Owing much of its work to Smith and Goodman (1971), this approach draws on the schema theory Bartlett

(1932, p.201) defines schema as an active organization of past reactions, or past experience, which must always be supposed to be operating in any well-adapted organic response The top-down approach assumes that readers activate their schemata – prior knowledge – to interact with the texts Whilst schema theory brings to light readers’ contributions in the reading process, it has been questioned by some researchers since how readers’ prior knowledge is called up from long-term memory and used in understanding texts remains unclear (Grabe, 1991)

Both bottom-up and top-down models mirror the process of reading at some level and both have gained some convincing evidence However, the bottom-up model does not attend to higher-level processes, nor does it explain how other sources of knowledge contribute to reading comprehension, while the top-down model glosses over the lower-level processes, especially the use of graphic cues A more adequate approach, which takes into account both bottom-up and top-down models, is known as the interactive approach (Grabe, 1988; Rumelhart, 1977, 1980; Stanovich, 1980) This approach takes the position that the interaction during reading takes place between the reader and the text, as well as among different linguistic skills involved in reading (Grabe, 1991) Readers employ both lower-level and higher-level skills to construct meaning from the text and the interaction modes vary among readers and texts In a nutshell, the reader and the text variables intertwine to have a joint impact on the reading process and product

Apart from these three metaphors of reading process, a useful addition to the approaches takes a sociocultural view of reading As both readers and texts are rooted in a sociocultural context, the reading activity is socially embedded and context- bound (Alderson, 2000) As a result, reading is no longer conceived as an individualized mental activity, but one of the social literacy events (Barton, 1994) In this view, what literacy entails, how reading is valued and how good or bad readers are evaluated differ from culture to culture and should not be taken for granted This sociocultural dimension of reading poses challenges for reading pedagogy and assessment, especially for L2 reading

Urquhart and Weir (1998) distinguish between careful reading which refers to the thorough processing of a text to extract full meaning from it, and expeditious reading, which refers to rapid, selective, and efficient reading to obtain the needed information that is contained in the text Khalifa and Weir (2009) propose a cognitive processing approach to establishing the construct measured by reading tests and examining tests’ cognitive validity As a strand of construct validity, cognitive validity addresses ― the extent to which a test requires a test taker to engage in cognitive processes that resemble or parallel those that would be employed in non-test circumstances (Field,

2013, p 78) As reading is a cognitive activity which largely takes place in the mind (Urquhart & Weir, 1998), the cognitive validity of reading tests lies in whether the tasks elicit the cognitive processes involved in the target reading contexts beyond the test itself (Khalifa & Weir, 2009) Thus, the aim of Khalifa and Weir’s cognitive model of reading is to reflect what readers do when they engage in different reading activities in real life

Khalifa and Weir’s cognitive model of reading (2009)

In the model illustrated in Figure 2.1, the left-hand column specifies a goal setter because readers’ decision of what type of reading to employ determines the levels of processing to be activated in the central core of the model Based on Urquhart and Weir’s (1998) classification, the reading goals can be either expeditious or careful reading, and take place at the local or global level Readers monitor whether their reading is progressing in line with the generated goals and remediate their reading behavior when necessary (Brunfaut & McCray, 2015) The processing core in the middle column includes eight cognitive processes in a hierarchical system with the lower levels comprised of word recognition, lexical access, syntactic parsing, establishing propositional meaning and the higher levels including inferencing, building a mental model, and creating a text level or intertextual representation (Khalifa & Weir, 2009) The lower levels of cognitive processes refer to the local comprehension – the understanding of propositions within the sentence (individual phrases, clauses and sentences) – while the higher levels focus on the global comprehension beyond the sentence level (Weir, Hawkey, Green, & Devi, 2009) The right-hand column indicates the knowledge base readers deploy when engaged in reading, which is linked to specific levels of the processing core macro-structure of a text and logical or rhetorical relationships among ideas The model distinguishes three types of expeditious reading: scanning, skimming, and search reading Scanning is a form of expeditious reading at the local level, which involves reading highly and selectively to find specific words, numbers or phrases in text Skimming is often defined as rapid reading by sampling text to abstract key points, general impressions, or superordinate ideas (Urquhart & Weir, 1998; Weir, 2005) Skimming specifically involves global reading to establish macrostructures of texts and discourse topics Unlike skimming, search reading involves predetermined topics The reader does not necessarily have to build a macro-propositional structure for the entire text, but rather seeks information that meets his/her requirements Besides, unlike scanning, search reading is not for exact word matches, but for words in the same semantic domain as the target information Search reading can include local reading and global reading Search reading can be classified as local in cases where the desired information can be found within a single sentence, and as global in cases where the information must be constructed across sentences

Review of test validation approaches

As discussed above, the internal cognitive processing including text comprehension and task completion from test takers cannot be directly observed The processes of reading comprehension are closely related to the products, as the former leads to the latter Thus, how test takers do with the text is equally important as the products The methodological advancements in language testing and assessment in the 1980s led researchers such as Sternberg (1981, p 20) to call for a shift beyond product measurement to process measurement Alderson believes that collecting information about the test taking process is a part of construct validation and using introspective data is to elucidate the nature of the considered features (1990a, p 437) Therefore, the focus of the current study is to validate VSTEP.3-5 reading test not only from the perspective of test products but also from the perspective of test taking processes, which means multidimensional evidence is taken into consideration Specifically, for the perspective of test products, the current study is mainly based on IRT and adopts the suitable Rasch model to collect evidence As to the perspective of test taking processes, the current study mainly collects evidence by using think aloud protocols and eye tracking By integrating the data of products and processes, the test taking processes of reading comprehension from test takers can be more comprehensively depicted Therefore, it would be natural and reasonable to review these related test validation approaches which used during the validation relate to both product-based perspective and process-based perspective

2.2.1 Test validation approaches from product-based perspective

Item Response Theory (IRT) (Rasch, 1960; Lord & Novick, 1968), as an extension of Classical Test Theory (CTT) (Gulliksen, 1950), is characterized by a collection of mathematical models in which a set of parameters (discrimination and/or difficulty and/or guessing) is used to describe the probability of a given response to an item conditional on a test taker’s ability level In contrast with CTT which uses aggregated test scores as the level of analysis, IRT analyzes item-level data by plotting an Item Characteristic Curve (ICC) (for dichotomous IRT models) or

Category Characteristic Curve (CCC) (for polytomous IRT models) showing unique properties of individual test items (Baker, 1985; Embretson & Reise 2000) The biggest criticism of CTT is its inability to separate the dependency between test characteristics and test taker characteristics; however, IRT overcomes the shortcoming through its property of item parameter invariance and ability parameter invariance (Yang, 2005) This means that the estimated parameters of an item are not contingent on the particular set of test takers to whom it is administered, and the estimated ability of a test taker does not depend on the set of items the test taker is administered These merits have contributed to the prevalenceof IRT in both research and operational test settings (Embretson & Reise, 2000)

It is imperative to use rigorous methods and standards to obtain evidence of the reliability of measures and validity of inferences in an era of high-stakes testing and evaluation in education, the Rasch model may play a potential role and contribute to this process (Smith Jr., 2004) Rasch analysis is an approach to obtain non-subjective and basic metrics from stochastic observations of ordered class responses After decades of development, the basic Rasch model has grown into Rasch family of models The most commonly used include the dichotomous Rasch model (Rasch,

1960), the rating scale model (also called the polytomous Rasch model) (Andrich,

1978), equidistant model (Andrich, 1982), partial credit model (Masters, 1982), many- facet Rasch model (Linacre, 1989), and mixed Rasch model (Rost, 1990)

The dichotomous Rasch model is the simplest model in the Rasch family of models (Wright & Mok, 2004) It is designed for applying with ordinal data that are scored in two categories (usually 1 or 0) Specifically, the responses to an item may be “right”/ “wrong” or “yes”/ “no”, and the “right” or “yes” response is marked as “1” and the “wrong” or “no” one is marked as “0” The dichotomous Rasch model uses sum scores from ordinal responses to calculate interval-level estimates that represent person locations (i.e., person ability or person achievement) and item locations (i.e., the difficulty to provide a correct or positive response) on a linear scale that represents the latent variable The difference between person and item locations can be applied to calculate the probability for a correct or positive response, rather than an incorrect or negative response (Stefanie, & Cheng, 2022) The rating scale model (Andrich, 1978) presents that the probability of person “n” scoring “x” on the “m” step item “i” is a function of person’s location θ “n” and the difficulties of the “m” steps In this model, thresholds are combined with item parameter estimates The model assumes equal category thresholds for all items (Khosravi, 2019), and is mainly used for validation questionnaires, and accounts for local dependencies in testlet-based evaluations (Baghaei, 2007; Baghaei, Monshi, & Boori, 2009; Baghaei, 2010; Pishghadam, Baghaei, Shams, & Shamsaee, 2011; Baghaei & Cassady, 2014; Tabatabaee-Yazdi, Motallebzadeh, Ashraf, & Baghaei,

The equidistant model (Andrich, 1982) assumes that the distances between the thresholds within the items are equal but not necessarily across the items This model is especially advised to account for local dependency in educational tests where several items are based on one prompt by forming super-items (Khosravi, 2019) The partial credit model (Masters, 1982) stipulates that the distances between the steps can vary for all the items and within each single item and even the number of steps can vary That is to say, each item has a unique rating scale structure which is the least restrictive model compared to the rating scale model in terms of the distance between steps within and between items (Andrich, 1978) and the equidistant model (Andrich, 1982)

The many-facet Rasch model (Linacre, 1989) is an extension of the basic Rasch model by incorporating more facets than the two that are typically included in a test (i.e., test takers and items), such as raters, scoring criteria, and tasks In the analysis of performance assessments, many-facet Rasch model allows the inclusion and estimation of the effects of additional facets that may be of interest besides test takers and items, such as raters, criteria, tasks, and assessment occasions Within each facet, the model represents each element by a separate parameter The parameters denote distinct attributes of the facets involved, such as proficiency or ability (for test takers), severity or harshness (for raters), and difficulty (for scoring criteria or tasks) (Khosravi, 2019) It can be seen that using the many-facet Rasch model to analyze the test results, especially for the types of test items whose scores are easily affected by multiple aspects in speaking, writing, translation and other tests, has great advantages The many-facet Rasch model can parameterize the degree of these influencing factors and reflect them in values, which help people minimize the impact of other aspects on test takers’ proficiencies and more truly represent proficiency levels of test takers The mixed Rasch model (Rost, 1990) refers to a combination of the Rasch model and latent class analysis The mixed Rasch model could be applied to account for data when difficulty patterns of items consistently differ in classes of population and allow item parameters to vary across classes of population, i.e., when the unidimensional Rasch model does not fit for the entire population (Rost, 1990; Rost, & von Davier,

1995) The mixed Rasch model has been used in personality testing to identify latent classes differing in the use of response scale which can detect test taker heterogeneity, the associated item profiles, the latent score distribution and the size of latent classes (Rost, 1990) There are some studies which adopt this model, such as Baghaei and Carstensen (2013); Pishghadam, Baghaei, and Seyednozadi (2017); and Baghaei, Kemper, Reichert, and Greif (2019)

In general, Rasch models take the objective measurement in the field of natural science as the benchmark, establish a set of objective standards for the measurement in the field of social science, so as to ensure that the information provided by the measurement is more objective and reliable, and provide a feasible solution and enlightenment for solving the problem of objectivity in the field of education and psychological science

Rationale of using dichotomous Rasch model

Because VSTEP.3-5 reading test is composed of 40 multiple-choice items belonging to objective response format which scored in two strategies (1 or 0), so dichotomous Rasch model is selected for analysis in the current study The formula of the dichotomous Rasch model can be written as:

Loge (Pni11 /(1-Pni1)) = Bn-Di

Bn is the “ability” measure of person “n”, Di is the “difficulty” measure of item

“i” “e” is the natural constant e= 2.71828, and Pni1 is the probability that person “n” encountering item “i” is marked as “1” Likewise, the probability that person “n” encountering item “i” is marked as “0” is “1-Pni1” As the formula of dichotomous Rasch model suggests, the probability of correct responses to a set of items depends on the person’s ability and items’ difficulty Under the dichotomous Rasch model, the likelihood that a given test taker would correctly answer a given test item is modeled as a logistic function of the difference between the learner’s ability level (the person parameter) and the difficulty of the test item (the item parameter) It is clear that the dichotomous Rasch model offers a measurement to parameterize the person’s ability with the item’s difficulty, and compare them on the same logit scale It involves complex statistical modelling, but in essence, ranks test takers according to their ability (as determined by their total test scores) and simultaneously ranks the test items according to difficulty (in terms of how many test takers are successfully able to answer them) (Schmitt, 2011) Comparisons between the test takers and test items can then be made The dichotomous Rasch assumptions are that: Each person is characterized by an ability and each item is characterized by a difficulty that can be expressed by numbers along one line Finally, from the difference between the numbers (and nothing else), the probability of observing any particular scored response can be computed (Bond & Fox, 2015, p 32)

Baghaei (2008) believes that the fit index in Rasch model and the distribution of item difficulty in variable graph are indicators of test quality The fit index indicates whether each item fits in a unidimensional variable (Linacre, 2023) If not, it indicates that the item measures dimensions unrelated to the proposed construct In the variable graph, the difficulty of the test items should be evenly distributed without too large distance Too large distance indicates that the difficulty difference of the test items is too big The test takers in the blank area of the item’s distribution have not been accurately measured, and the construction of the test items is not reflected enough In addition to these two indicators, unidimensionality and DIF data are also considered as indicators to verify the validity of test items, indicating whether the question measures a single construct and whether the difficulty of the item is constant in different situations (Bond & Fox, 2007) In the field of educational psychological measurement, DIF is often used to represent the fairness of the test DIF means that after controlling the group ability, an item displays different statistical characteristics in different groups DIF is a scientific and effective method to test the items that may unfairly treat test takers in the test, so as to ensure that all items are fair and effective for the target test takers

There are many studies using the above indicators to verify the validity of dichotomous response items (Eckes & Grotjahn, 2006; Dávid, 2007; Beglar, 2010; Pae, Greenberg, & Morris, 2012) Back to the context of the current study, some researchers use dichotomous Rasch model as a part of analysis to validate the VSTEP.3-5 For example, Nguyen Thi Phuong Thao (2018) discusses the evaluation of the relevance and the coverage of the content of a VSTEP.3-5 reading test in the article An Investigation into The Content Validity of a Vietnamese Standardized Test of

English Proficiency (VSTEP.3-5) Reading Test Nguyen Thi Quynh Yen (2018) discusses the cut-score validity of the VSTEP.3-5 listening test in her doctoral dissertation

The application of dichotomous Rasch model in test quality analysis of the current study is mainly carried out according to the following steps from product- based perspective by WINSTEPS software: firstly, unidimensionality of the test items is checked whether the test items measure only one construct of English reading proficiency; secondly, item estimates are analyzed whether the test items function well enough for their use; thirdly, item and person calibration is analyzed and the direct comparison of the related standings are judged through the Wright map; fourthly, DIF by test takers’ genders and majors are tested whether the test items have substantial DIF; lastly, the reliability and separation for items and persons are checked for the overall quality of the investigated test

2.2.2 Test validation approaches from process-based perspective

The current interest in methodology is to obtain direct data on the cognitive processes experienced by test takers under test conditions A literature review centering on the two concurrent data elicitation approaches to be utilized in the current study which are think aloud protocols and eye tracking are enable the researcher to better evaluate the appropriateness of these two research approaches to meet the purpose of this study

Rationale of using think aloud protocols

Summary

In this chapter, the literature pertaining to nature of L2 reading and construct of L2 reading have been discussed in detail which constitute the first pillar of the literature review also the subject review of the current study Then, test validation approaches of the current study from both perspectives of product and process have been reviewed On one hand, IRT and Rasch models are discussed, which belong to product approach On the other hand, methodological considerations of utilizing process approach, think aloud protocols and eye tracking in particular, have been reviewed The externalization of cognitive processes through such introspection methodologies as think aloud protocols and such observation methodologies as eye tracking is supplement to interpretations inferring from product approaches based solely on test score data The above-mentioned components constitute the second pillar of the literature review also the methodological review of this study which serves as an operational landing point to materialize the current study The literature review sheds light on the study of the nature of reading test product and test-taking process since those discussions not only inspire the researcher to raise the research inferences and questions of the present study but also provide references to resort to during the investigation in this study.

THEORETICAL FRAMEWORK

Validity

It is necessary for validity studies to make a clear clarification for the concepts of validity Validity is the first factor to be considered in the development, interpretation and use of language tests, because it relates to the correct and reasonable interpretation and use of test results (scores) (AERA, APA, & NCME, 1985; Bachman,

1990) For a long time, validity has been the most important indicator for judging the quality of a test and the basic starting point of language testing Therefore, what validity is and how evaluating the validity of a test is an eternal topic that the testing field is constantly exploring

In the past 60 years, with the development of education and psychometric theories, the academic circle’s understanding of validity has successively undergone changes from “single”, “categorized” to “unitary” Before the 1950s, education and psychometrics generally adhered to the validity view of “relevance is effective” And a test can be related to many kinds of things, so it is not easy to determine the “related” thing Kelly (1981) states that the problem of validity is that of whether a test really measures what it purports to measure As a result, researchers began to discuss validity from the perspective of “categorized”, and different types of validity appeared For example, in 1954, the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council of Measurement in Education (NCME) jointly promulgated the Technical Recommendations for Psychological Tests and Diagnostic Techniques Validity is divided into four categories: predictive validity, concurrent validity, content validity and construct validity In 1966, the Standards for Educational and Psychological Tests and Manuals (APA, 1966) combined predictive validity and construct validity, and replaced it with criterion-related validity, making the previous four-dimensional validity becomes three-dimensional validity

In 1961, Lado introduced the concept of validity in the field of education and psychometrics into the field of language testing for the first time in the founding work of modern language testing— Language Testing Lado (1961, p 321) asks, “Does a test measure what it is supposed to measure? If it does, it is valid.” Later, the field of language testing has imitated Lado’s point of view to define validity (such as Valette, 1977; Heaton, 1975; Harrison, 1983; Finocchiaro & Sako, 1983) Among them, Heaton (1975, p 153) divides language test validity into four types: face validity, content validity, construct validity and empirical validity During this period, the language test validation model mainly used the methods proposed by Lado, such as selecting and designing content-related and learning-related items; modifying test items that increase difficulty due to non-verbal factors; using a test with high validity and the self-developed test to test a group of representative student samples and calculate the correlation coefficient between the two test scores to determine the test validity (Lado, 1961) In a nutshell, analyzing the test content and calculating the correlation coefficient of the criterion-related are the main methods of language test validity research during this period

In the mid-1970s, although the third edition of the Standards for Educational and

Psychological Tests and Manuals (1974) still divides validity into different categories, the definition of validity has changed It is believed that “validity mainly refers to the appropriateness of inferences from test results (1974, p 25)

In the 1980s, with the continuous in-depth study of validity, the fourth edition of the Standards for Educational and Psychological Testing (1985) further revises the definition of validity, defining it as the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores Henning (1987) defines validity as the appropriateness of a given test or any of its component parts as a measure of what it is purported to measure Validity study is the process of collecting evidence to support inferences made, and advocates replacing “validity type” with

“evidence type” During this period, the traditional concept of categorized validity was challenged, because the field of education measurement found that the validity evidence collected by categorized methods was too scattered, and the validation did not consider the value meaning of test scores and the social consequences of the use of test scores Hughs (1989, p 22) considers validity the test ability to announce a test to be valid by measuring “accurately what it is intended to measure” Concurrently, Messick (1988, 1989) proposes the unitary concept of validity, pointing out that there is only one validity, that is, construct validity, but the evidence supporting validity can be multifaceted, and uses the facets of validity framework (also known as the progressive matrix) to explain The facets of validity framework include four dimensions: test interpretation, test use, evidential basis and consequential basis Later, Messick (1995, p 742) defines validity broadly as “nothing less than an evaluative summary of both the evidence for and the actual—consequences of score interpretation and use.” Messick’s unitary validity has updated concept of test validity verification Since then, validity verification is not only the evaluation of the test itself and scores, but also the evaluation of the interpretation and use of test results In the field of education and psychometrics, Messick’s unitary validity dominated the validity researches for more than 10 years in the mid to late 1980s (Newton & Shaw,

2014) Since Messick, the scope of validity has been further expanded, and external related factors such as utility value, relevance, and social consequences have also been included (Shepard, 1993)

Following Messick, validity is regarded as an argument of test interpretation and test use and it is all test users’ responsibility to justify the validity of tests according to Bachman (1990) Borsboom, Mellenbergh, & Heerden (2004) proposes another approach to test validity is to label it the test property being evaluated rather than the judgement of the test Weir (2005, p 12) states that “validity is perhaps better defined as the extent to which a test can be shown to produce data, i.e., test scores, which are an accurate representation of a candidate’s level of language knowledge or skills.” Validity is no longer a question of “whether”, but a question of “degree”, test validity is a process of collecting validity evidence from different aspects in the test process (Chapelle, 2012)

From the concepts of validity proposed by above scholars, it can be concluded that the concept of validity shifts the focus from the test itself or the test score to the interpretation of the test score, which can be seen as a great advance As mentioned in the first chapter, The Standards for Educational and Psychological Testing (AERA et al., 2014) defines validity as the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test The current study adopts this definition to collect evidence not only from product-based perspective but also from process-based perspective to validate the VSTEP.3-5 reading test.

Studies on the validity of VSTEP.3-5

There are some researchers have conducted researches on the validity of VSTEP.3-5 from different aspects, such as the validity argument, content validity, cut- score validity, comparative studies of validity, validity from scoring perspective, and validity from test-takers’ perspective

To be specific, for validity argument, Nguyen Thi Quynh Yen (2017) proposes a validity argument for the VSTEP.3-5, which gives a lot of inspirations to later researchers Besides, Vo Ngoc Hoi (2021) builds a validity argument for the reading component of VSTEP.3-5 A mixed-method paradigm is employed to examine three interrelated aspects of the construct of the VSTEP.3-5 reading test, thereby offering insights into the extent to which the pattern of test scores, the reading processes of test takers, and the linguistic features of the reading texts correspond with what is intended to be measured by the test

As for content validity, Nguyen Thi Phuong Thao (2018) discusses the evaluation of the relevance and the coverage of the content of a VSTEP.3-5 reading test in the article An Investigation into The Content Validity of a Vietnamese

Standardized Test of English Proficiency (VSTEP.3-5) Reading Test Although the findings of her research help confirm the content validity of the specific investigated test paper In order to reach generalized conclusions, more tests need to be investigated Moreover, as for the validity research on test, only evaluate content validity is not enough Therefore, the validation of VSTEP.3-5 reading test from different perspectives is supposed to take into consideration

As for cut-score validity, Nguyen Thi Quynh Yen (2018) discusses the cut-score validity of the VSTEP.3-5 listening test in her doctoral dissertation, which contributes towards raising the awareness of the importance of evaluating the cut scores of the high-stakes language tests in Vietnam so that fairness can be ensured for all of the test takers In general, the investigation of cut-score validity of the VSTEP.3-5 listening test acts as an ongoing attempt in building a systematic, transparent and defensible body of validity argumentation for the VSTEP.3-5

For comparative studies of validity, Dunlea et al (2018) compare APTIS and VSTEP.3-5 in the context of higher education in Vietnam, which aims to contribute to the theoretical framework for carrying out such comparability studies by using the socio-cognitive model for language test validation to design a multimethod data collection and analysis approach Wei (2021) conducts a comparative study on the validity of VSTEP.3-5 and PETS-5 reading tests in terms of context validity, criterion- related validity and cognitive validity under the Weir’s socio-cognitive framework With an eye to more comprehensively examining the test-takers’ English language skills and enhancing tests’ validity, both VSTEP.3-5 reading test and PET-5 reading test should be improved on certain aspects VSTEP.3-5 reading test should appropriately rich its item types and PETS-5 reading test should expand its genre and subject matter of reading texts

As to validity from scoring perspective, Nguyen Thi Ngoc Quynh et al (2020) investigate the impact of the VSTEP specking rating scale training session in the rater training program provide by University of Languages and International Studies- Vietnam National University, Hanoi (ULIS-VNU) Meaningful implications in terms of both future practices of rater training and rater training research methodology could be drawn from the study

As to validity from test-takers’ perspective, Wei and Nguyen Dinh Hoa (2021) conduct research of validity on the VSTEP.3-5 based on Weir’s socio-cognitive framework and Cheng and DeLuca’s thematic framework The research collects validity evidences of VSTEP.3-5 from the aspects of test-taker characteristics, context and consequence, uses a thematic coding process to analyze 50 test-takers’ reflective journals on VSTEP.3-5, and proposes a validation framework of VSTEP.3-5 The results show that test preparation, psychological factors and test content are the three main factors that affect the validity of VSTEP.3-5

To sum up, there is no doubt that the above-mentioned empirical studies have made many contributions in different perspectives and have inspired the researcher on the current study Besides, VSTEP.3-5 is a newly developed test; thus, validation is ongoing to ensure that the test is valid and reliable, and validation of a test form in this study may offer implications to the test specification and test development.

Validation

Validation is the process of collecting various types of evidence to ensure validity (AERA et al., 2014; Bachman, 2005, 2010; Messick, 1989) In other word, Validation is an activity conducted by researchers to determine whether a test is validity “Validity is a concept like truth It represents an ideal or to find out which way to go Validity is about ontology; validation is about epistemology” (Borsboom, et al., 2004, p 1063)

In order to solve the operational problem of test validation, different frameworks have been produced in the period of unitary validity view Among them, there are several representative validation frameworks: Bachman & Palmer’s (1996) test usefulness framework, Weir’s (2005) socio-cognitive framework, Kane’s (2006, 2013,

2016) interpretive argument framework, Chapelle, Enright & Jamieson (2008) interpretive argument framework and Bachman & Palmer (2010) assessment use argument framework The first two are evidence-based validation frameworks, and the last three are argument-based validation frameworks Discussing the above five representative validation frameworks under the unitary validity view is conducive to an in-depth understanding of the advantages and disadvantages of each framework, and lays a sound foundation for the bringing up of the current research validation framework by the researcher later

3.3.1.1 Bachman and Palmer’s test usefulness framework

Bachman and Palmer (1996) advocate Messick’s (1989) approach to evidence collection, using the test usefulness framework to provide clearer guidance on the number of types for validity evidence should be collected (see Figure 3.1) The test usefulness framework contains six qualities: reliability, construct validity, authenticity, interactiveness, impact and practicality Reliability refers to the stability of a test result; Construct validity refers to the extent to which the interpretation of test scores is meaningful and appropriate; Authenticity refers to the consistency between test task characteristics and target language use task characteristics; Interactiveness refers to the type and degree of personal characteristics of test takers when they complete the test task; Impact refers to the influence of test on individuals, education system and the whole society; Practicality refers to the relationship between the resources required to design, develop and use a test and the available resources Among them, reliability and construct validity are the most basic quality attributes of all tests Authenticity, interactiveness, impact and practicality are closely related to the social and educational situation of a test Bachman & Palmer (1996) point out that test usefulness is the most important indicator to measure test quality To verify test usefulness, i.e., validation, people need to collect validity evidence of six qualities in the framework on the basis of clarifying the test purpose and based on the social and educational situation of the test They are reliability, construct validity, authenticity, interactiveness, impact and practicality

Test usefulness framework (Bachman & Palmer, 1996)

The test usefulness framework easily interprets Messick’s validity theory, embodies the concept of unitary validity (Han & Luo, 2013), and better solves the operability of the test validation framework (Chapelle & Voss, 2013) In the following ten years, the test usefulness framework became the authoritative model for the validation of language testing (Weigle, 2002; Xi, 2008; Luo, 2019), and played an important role in guiding the development and use of language testing It not only makes the validity verification farewell to the past fragmentary collection of validity evidence, but also turns the main focus of validity verification from the traditional

“reliability” to “construct validity” and “influence” (consequences), so as to promote the field of language testing to pay more attention to the core issues of the definition of constructs, the verification of construct validity, and the washback of tests (Luo,

2019) However, while pursuing operability, the test usefulness framework has lost the theoretical coherence (McNamara, 2003), especially the logical relationship between the six qualities needs to be further clarified (Xi, 2008) Bachman (2005) himself also believes that the test usefulness framework lacks consideration of the importance of the six qualities, and the priority needs to be verified by validity, which is determined by the practitioner according to the actual situation The lack of relevance and logic among the constituent qualities makes the test usefulness framework gradually lose its dominant position in the practice of test validity verification This may explain the reason why Bachman & Palmer (2010) adopt an argument-based framework for investigating validity in their future work (Fulcher,

Similarly, in order to solve the problem of disconnection between validity theory and practical operation under the unitary validity view, Weir (2005) proposes the

“Socio-cognitive framework” (see Figure 3.2) The framework systematically integrates the social, cognitive and scoring dimensions into the validity verification model for the first time (O’Sullivan & Weir, 2011) It advocates that validity verification is the process of evidence collection, and the evidence to be collected is divided into five types according to the types of validity: (1) context validity; (2) theory-based validity; (3) scoring validity; (4) criterion-related validity; (5) consequential validity, which is presented in Figure 3.2

A socio-cognitive framework for test validation (Weir, 2005, p 44–47)

Compared with the test usefulness framework of Bachman & Palmer (1996), the socio-cognitive framework defines the two stages of test validity verification, clarifies the relationship between validity evidence at each stage to a certain extent, points out the possible problems of various validity evidence and further develops Messick’s unitary validity thought The framework is used to verify the validity of both the Aptis and the Main Suite Examinations (MSE)

The socio-cognitive framework provides a more specific and clear guidance on what kind of validity evidence should be collected However, like Bachman and Palmer’s (1996) testing usefulness framework, it may not be easy to operate as solving each element can be time-consuming and complex (Fulcher, 2015) It also does not point out the collection, selection and judgment criteria of the five types of validity evidence Some types even directly followed the concept of the period of classification validity, such as “criterion related validity”, which obviously does not get rid of classification Validity verification still stays in the simple listing of evidence, and there is no way to know where the evidence collection starts and ends Therefore, in essence, test usefulness framework and socio-cognitive framework still belong to categorized concepts of validity Although they were born in the period of unitary validity, they can only be “transition product” (Luo, 2019, p.79) It is also not yet clear how to integrate all kinds of evidence to make validity verification a coherent whole process from start to finish The socio-cognitive framework has not given a satisfactory answer Hence, an argument-based approach to validation is adopted by Kane (1992, 2001, 2002, 2004, 2006) and Kane et al (1999) It has been widely accepted and revised by scholars (e.g., Bachman, 2005; Bachman & Palmer, 2010; Chapelle et al., 2008; Kane, 2013; Mislevy, Steinberg & Almond, 2003) The following sections discuss several validation frameworks under argument-based approach in detail

The interpretive argument framework originates from Kane’s (1992) reflection on the existing definition of validity He considers the definition of validity from Cronbach (1971), AERA et al (1985) and Messick (1989) taking the rationality of the inference made based on the interpretation of test scores into account, but those definitions lack sufficient explanatory in the collection, analysis and demonstration of validity evidence In view of this, Kane (1992) believes that validity should be the overall rationality of the demonstration of the interpretation of test scores

Based on the above thinking on the definition of validity, in 1992, Kane puts forward argument-based validation framework which is composed of interpretive argument and validity argument on the basis of argument-based approach to validation The framework is divided into two steps: the first step is to build a theoretical framework (interpretive argument); The second step is to test the theoretical framework (validity argument) The interpretive argument is later changed to interpretive/use argument to show that equal attention is paid to the interpretation and use of scores (Kane, 2013, 2016) From the adjusted expression, the interpretive/use argument explicitly brings the test consequences into the scope of validation, which reflects Messick’s unitary validity idea, that is, test validation is not just about the interpretation of scores also includes the evaluation of the use of test scores and their consequences

Model of Toulmin’s argument structure (1958, 2003)

The validation framework of Kane (2006) adopts the model of Toulmin’s argument structure (see Figure 3.3) as the basis for logical reasoning The model consists of three necessary elements: data, claims and warrants Among them, data are the basis of claims, and warrants are the basis for proving the rationality of claims In addition, the three elements: backing, rebuttals and rebuttal evidence only need to provide further support or rebuttal evidence when the warrant is questioned; qualifier (such as “probably”, “possible” and “certain”) are used to strengthen or weaken the support of warrants for claims, and they are also the object of rebuttals (Toulmin

2003) Therefore, according to the model of Toulmin’s argument structure, inferencing is a derivation process from data to claims This process should be supported by sufficient warrants, and at the same time, the warrants should also be proved by related backing The rationality of reasoning depends on whether the backing supporting the warrant is reliable and whether the rebuttals can be established

If there is a rebuttal, it is necessary to provide conclusive rebuttal evidence to cancel it Kane’s interpretive argument framework (2006, 2013, 2016) specifies the scoring inference that links observations as data (test takers’ performance) to observed scores (test scores), the generalization inference that links the observed scores (test scores) to universe scores (consistent scores), the extrapolation inference that links the universe scores (consistent scores) to the target scores (the predicted performance in real life) as evidence for test use and, finally, the decision-making inference that links the target score to decisions as a claim (see Figure 3.4)

Empirical validation studies with interpretive argument framework

Several validation studies have adopted interpretive argument frameworks so far The current study mainly reviews some major validity studies under Chapelle et al

(2008) interpretive argument framework, in order to provide some references and thoughts for the current study

Lim (2009) conducts a study on the writing of Michigan English Language Assessment Battery (MELAB) which is designed by the English Language Institute of the University of Michigan and used internationally with Chapelle et al (2008) interpretive argument framework, which mainly focuses on evaluation inference The participants are 23 raters and 10536 test takers The data are 29831 ratings The findings of study show that different prompts and raters do not pose a threat to the validity of test results

Chapelle, Enright, & Jamieson (2010) employ interpretive argument framework to conduct a study on an academic grammar test designed and used by the Iowa State University Two sets of the grammar test questions were used be instrument in the study Several findings were brought as follows: The test questions were developed based on research in the development of L2 grammar, and were modified according to the opinions of test takers and teachers The scoring standards were formulated and revised according to the analysis of test scores Generally speaking, test questions were reliable There were differences groups with different proficiency levels, and there was a positive correlation between the test and the TOEFL scores, but the correlation was not high The predicted and actual test results were consistent

Gaillard (2014) applies the interpretive argument framework (Chapelle et al.,

2008) in investigating an elicited imitation task for French proficiency assessment in institutional and research settings The participants include teaching assistants and language testing experts Through the analysis of internal consistency, interrater reliability, correlation coefficient and regression on expert judgement and feedback for scoring rubric and test items, test scores of the elicited imitation task etc., the study get several sources of evidence for the elicited imitation task supported the meaning of test scores and test use

Brooks & Swain (2014) do a study on the TOEFL iBT speaking section employing interpretive argument framework (Chapelle et al., 2008) The TOEFL iBT speaking, recordings of in-class and out-class speaking activity, and background questionnaire interviews serve as instruments Specking performances regarding grammatical, discourse and vocabulary features and perspectives on TOEFL oral tasks and their oral performance in academic research were analyzed to obtain similarities and differences between oral performance in the class and in test

Li (2015) apply Chapelle et al.’s interpretive argument framework to examine the English Placement Test (EPT) developed and adopted by the Iowa State University The study takes inferences of extrapolation and utilization into consideration, aiming at analyzing test scores of the EPT and the TOEFL iBT, test scores and essays from in class tests and final exams in ESL courses, responses from questionnaire and self- assessment and interview data The results indicate that there is a moderate relationship between EPT scores and TOEFL iBT scores, and a weak to moderate relationship with self-assessment ESL students initially felt frustrated, but due to the benefits of attending mandatory courses, they were satisfied with the decision Teachers were usually satisfied with the accuracy of the proficiency level of EPT test takers Academic advisors actively perceive the EPT and placement decisions EPT scores predict test takers’ academic performance

Riazi (2016) investigates the TOEFL iBT writing section with the interpretive argument framework (Chapelle et al, 2008) Twenty international postgraduate students are appointed to be participants Text features from the test tasks and assignments are analyzed through repeated measure analysis of covariance The research results indicate a high degree of consistency between text features in tests and assignments

Sun (2016) conducts research on the College English Test Band 4 (CET-4) which designed by Chinese Ministry of Education and used by universities and enterprises in China by using the interpretive argument framework The research objects mainly focused on CET-4 test developers, users and test takers The researcher analyzed interview data from test developers and test users, as well as a questionnaire on washback and found that the CET-4 scores used by test users were different from the recommended uses by decisions and test developers

There is no doubt that the above-mentioned empirical studies have made many contributions and have inspired the researcher on the current study In general, it is necessary to absorb the merits of the existing frameworks Chapelle et al.’s (2008) interpretive argument framework is adopted to guide the current study to build an interpretive argument for VSTEP.3-5 reading test, because this framework provides holistic, systematic, and more necessary processes for assessing test constructs such as domain description and explanation inferences The next section discusses the justification for the use of interpretive argument and articulation of an interpretive argument for VSTEP.3-5 reading test respectively.

Interpretive argument for the VSTEP.3-5 reading test

3.5.1 Justification for the use of interpretive argument

The current study draws on an argument-based framework as an operational validation model, specifically adapted from Chapelle et al.’s (2008) interpretive argument The reasons are listed as follows On one hand, interpretive argument gives an explicit statement of the proposed interpretation and use of test scores, which sets a concrete research agenda for test validation One of the important purposes in the current study is to investigate the test-taking process of test takers Chapelle et al.’s

(2008) adaptation of the interpretive arguments to the TOEFL iBT project serves as a good example in this case On the other hand, interpretive argument is tolerant and can be applied to any kind of test interpretation as it does not preclude any claim or evidence In this respect, the flexibility enables test developers, users or other stakeholders to customize the interpretive argument to fit in the purpose for which the test is used and allocate the validation efforts reasonably Thus, Chapelle et al.’s (2008) interpretive argument is conceived as the most appropriate framework to guide the validation process of the current study

3.5.2 Articulation of an interpretive argument for the VSTEP.3-5 reading test

As mentioned above, interpretive argument involves a chain of inferences leading from test scores to test uses and consequences For instances, the interpretive argument for the TOEFL iBT (Chapelle et al., 2008) includes six inferences in total, targeting different aspects of score interpretation and use Given the practical constraints, it is impossible for the current study to address all the inferences as the TOEFL project did Instead, a plausible solution is to take into consideration the specific context of VSTEP.3-5 reading test and allocate validation efforts effectively to seek evidence to support the most critical and relevant inferences related to the test purpose

As mentioned in Chapter I, the current study aims to investigate validity of VSTEP.3-5 reading test from both product-based and process-based perspective In this respect, this study prioritized the generalization and explanation inferences as the main research focuses, which are the very important inferences in the validity argument The following part elaborates on the flow of interpretive argument for the VSTEP.3-5 reading test by articulating the warrant, assumption, backing evidence, and potential rebuttal of each inference

The generalization inference involves links from observed scores to expected scores over the relevant parallel versions of tasks and test forms It answers the first three research questions The four assumptions which are proposed to authorize the generalization inference of the VSTEP.3-5 reading test In order to yield sufficient evidence to back these four assumptions, four different lines of research inquiry are also suggested: (1) Findings of characteristics of the VSTEP.3-5 reading test tasks (2) Findings of the statistical data about item difficulty of the VSTEP.3-5 reading test (3) Findings of DIF in test takers’ genders and majors (4) Findings of the data about test reliability of the VSTEP.3-5 reading test The strength with which the generalization inference claim is made increases if it survives the rebuttals Potential rebuttals to the generalization inference of the VSTEP.3-5 reading test may include: (1) The test tasks are beyond the scope specified in the test specifications (2) The test items are not properly designed in accordance with the test specifications (3) The VSTEP.3-5 reading test includes DIF of genders or majors (4) The test reliability is lower than expected The warrant, assumption, backing evidence and potential rebuttal underlying the generalization inference are listed in Table 3.2

Summary of the generalization inference for the VSTEP.3-5 reading test

Warrant Observed scores are estimates of expected scores over the relevant parallel versions of tasks and test forms

Assumption 1 Configuration of tasks on VSTEP.3-5 reading measures is appropriate for intended interpretation

2 The test items are designed with the predetermined difficulty as described in the test specifications of the test

3 Genders and majors of test takers cannot affect their reading scores

4 The test reliability is in good range of the VSTEP.3-5 reading test

1 Findings of characteristics of the VSTEP.3-5 reading test tasks

2 Findings of the statistical data about items of the VSTEP.3-5 reading test

3 Findings of DIF in test takers’ genders and majors

4 Findings of the data about test reliability of the VSTEP.3-5 reading test

1 The test tasks are beyond the scope specified in the test specifications

2 The test items are not properly designed in accordance with the test specifications

3 The VSTEP.3-5 reading test includes DIF of genders or majors

4 The test reliability is lower than expected

Due to the fact that explanation inference refers to a link from test scores to theoretical structure, it is fundamentally similar to the traditional concept of construct validity It is important to have a thorough understanding of influential perspectives on construct validity in order to formulate assumptions that support this inference It answers the last two research questions The three assumptions which are proposed to authorize the explanation inference of the VSTEP.3-5 reading test In order to yield sufficient evidence to back these assumptions, three main different lines of research inquiry are also suggested: (1) Findings of the comparison between the reading process that experts judged according to the test items and the reading process that the test takers actually engaged (2) Findings of analysis for eye movements between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items (3) Findings of the extent of the VSTEP.3-5 reading test discriminate test takers at different proficiency levels in terms of cognitive reading process If one survives the rebuttal, the strength of the explanation inference claim will increase Potential rebuttals to the explanation inference of the VSTEP.3-5 reading test may include: (1) Test takers might employ test-wise strategies to answer the test questions (2) Successful and unsuccessful test takers are the same in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements (3) The VSTEP.3-5 reading test cannot discriminate test takers at different proficiency levels in terms of cognitive reading process The warrant, assumption, backing evidence, and potential rebuttal underlying the explanation inference are listed in Table 3.3

Summary of the explanation inference for the VSTEP.3-5 reading test

Warrant Test takers’ scores on the VSTEP.3-5 reading test can be attributed to the construct of English reading proficiency

Assumption 1 Reading processes engaged by test takers vary according to theoretical expectations

2 There are differences between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements

3 VSTEP.3-5 reading test can discriminate test takers at different proficiency levels in terms of cognitive reading process

1 Findings of the comparison between the reading process that experts judged according to the test items and the reading process that the test takers actually engaged

2 Findings of analysis for eye movements between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items

3 Findings of the extent of the VSTEP.3-5 reading test discriminate test takers at different proficiency levels in terms of cognitive reading process

1 Test takers might employ test-wise strategies to answer the test questions

2 Successful and unsuccessful test takers are the same in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements

3 The VSTEP.3-5 reading test cannot discriminate test takers at different proficiency levels in terms of cognitive reading process.

Summary

Validation is an endless process of investigation In this chapter, the concept development of validity and the representative validation frameworks have been documented in the first part to provide guidance and reference for the current study In the second part, justification for the use of the interpretive argument and articulation of an interpretive argument for VSTEP.3-5 reading test have been elaborated This chapter serves as the theoretical underpinning for achieving aims of the current study which investigates the validity of VSTEP.3-5 reading test from both product-based and process-based perspectives.

RESEARCH METHODOLOGY

Ontology and epistemology for the design of the study

In order to make up for the shortcomings of independent qualitative and quantitative method, the mixed-methods research has become the third type of research design parallel to quantitative research and qualitative research, and it is widely used in the fields of sociology, education, medicine, evaluation, etc (Greene 2007; Teddlie & Tashkori, 2008) It emerged in the 1960s and became common in the 1980s Researchers have realized that the mixed methods have a number of potential advantages In other words, quantitative method and qualitative method can complement with each other and enhance the reliability and validity of the research results

With the deepening of the understanding of mixed methods, different classifications have emerged Tashakkori & Teddlie (2006) conducts the first study to systematically classify mixed methods According to the study, the mixed methods design is divided into four types: (1) Concurrent design; (2) Sequential design; (3) Conversion design; (4) Fully integrated design Creswell & Clark (2007) divides the mixed methods research design into four categories from the perspective of the combination of qualitative and quantitative research methods: (1) Triangulation design; (2) Embedded design; (3) Sequential explanatory design; (4) Sequential exploratory design Creswell & Clark (2011) adjust to seven types Other researchers have also carried out classification discussion, such as Creswell et al (2003), Teddlie

& Tashakkori (2003), Leech & Onwuegbuzie et al (2009) Until the latest research of Creswell & Clark (2018), mixed methods is divided into three types mainly according to the different use purposes of qualitative and quantitative data: (1) Convergent design, which is a research design to collect quantitative and qualitative data at the same time The purpose is to compare the quantitative and qualitative analysis results, or to combine the two kinds of data for interpretation in the analysis or integrate the two data through conversion (2) Explanatory sequential design is a design of quantitative research before qualitative research, which aims to further explain the quantitative research results in the first stage with qualitative data (3) Exploratory sequential design is a design from qualitative research to quantitative research The research results of the two stages are integrated in the overall analysis stage in order to supplement, explain or verify the conclusions of qualitative research with quantitative data

In the process of test design and development, it is necessary to adopt a variety of research methods and integrate evidence in different fields to ensure the most important quality attribute of a test-validity (AERA et al., 2014) As the definitions and concepts of language ability and test validity continue to change and expand, the paradigm in the field of language testing and assessment has shifted from a quantitative approach to a more balanced and flexible mixed approach (Jang et al.,

2014) Specifically, there are two main reasons for this shift in paradigm perspective Firstly, the evolution of validity, as well as the shift in focus from validating test scores to interpreting and using validation of test scores, inevitably lead to apply multiple methods instead of a single method to solve the multi-faceted nature of language testing and assessment Secondly, language proficiency is an important component of language testing and assessment, which requires researchers to place themselves in a dynamic and interactive language environment Therefore, a single perspective for validation is not enough In recent years, there have been many studies in the field of language testing used the mixed methods of qualitative and quantitative to investigate the validity, such as Lee & Greene (2007), Jang et al (2008), Kim

(2008), Norris (2008), Turner (2009), Barkaoui (2007, 2010), Tan (2011), Plakans & Gebril (2012), Baker (2012), Youn (2013, 2015), Elliott & Lim (2016), Galaczi & Khabbazbashi (2016), Khalifa & Docherty (2016), Liu (2015), Lv (2017) Using the mixed research methods to study the validity is an important trend in the field of language testing Therefore, the current study adopts convergent mixed methods design to investigate the validity of VSTEP.3-5 reading test, which would be elaborated in the following section.

The convergent mixed methods design for the study

Validation of reading in L2 is a multi-component and multi-stage process, which can be conceptualized from the perspectives of products and processes The product perspective relies on scores such as score patterns, item difficulty, etc The process perspective focuses on the reading test-taking process related to test scores Both of the perspectives cannot be explained separately, which should be integrated into mixed-methods to complement each other That is to say, quantitative methods alone cannot be used to explore in-depth information about the test takers’ test-taking reading processes and it is highly suggested that quantitative analysis be supplemented by qualitative analyses Therefore, in order to validate the construct of VSTEP.3-5 reading test in terms of test tasks, test items and test reliability and investigate what cognitive processes test takers undergo while they are taking part in VSTEP.3-5 reading test, the current study employed a convergent mixed methods research design

The following table illustrates the overall design for the current study To be specific, the mixed methods design is applied to explore the conceptualization of products and processes for VSTEP.3-5 reading test construct, agreeing to provide appropriate insights into validity issues The evidence obtained from the mixed methods support the assumptions in each of the two inferences in the interpretive argument of the VSTEP 3-5 reading test

The convergent mixed methods design for the study

Research questions Methods Data collection Data analyses

A VSTEP.3-5 reading test paper, the VSTEP.3-5 reading test specifications, Expert interview Test takers’ scores of the VSTEP.3-5 reading test

Document-based analysis Content-based analysis Thematic analysis Rasch model analysis

Q2 Quantitative Test takers’ scores of the VSTEP.3-5 reading test

Q3 Quantitative Test takers’ scores of the VSTEP.3-5 reading test

Content-based analysis Thematic analysis

Eye tracking Reading processing checklist

Table 4.1 presents the overview of the research design, which contains a summary of the interpretive argument inferences, research questions, data collection and data analyses For addressing the generalization inference, questions 1, 2 and 3 were discussed respectively For answering question 1 which was to investigate the extent of the test tasks and test items of the VSTEP.3-5 reading test met the requirements of the test specifications, a VSTEP.3-5 reading test paper, VSTEP.3-5 reading test specifications, expert interview and test takers’ scores of the VSTEP.3-5 reading test were collected Both qualitative and quantitative analyses were employed to discuss the test tasks and test items For addressing question 2 which was to explore the extent of the test takers’ genders and majors affected their reading scores, test takers’ scores of the VSTEP.3-5 reading test were used to do DIF analysis to generate quantitative description For answering question 3 which was to look into the extent of VSTEP.3-5 reading test scores adequately reliable in measuring the test takers’ English proficiency, test takers’ scores of the VSTEP.3-5 reading test were used to analyze quantitatively

For addressing the explanation inference, questions 4 and 5 were answered respectively For discussing question 4 which was to investigate the reading processes applied by test takers to answer VSTEP.3-5 reading test items correctly and the extent of the expected reading processes correspond with the processes actually engaged in by test takers while taking VSTEP.3-5 reading test, data expert judgement and think aloud protocols were collected to do analyses qualitatively For addressing question 5 which was to distinguish successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements, and to investigate the extent of the VSTEP.3-5 reading test discriminate test takers at different proficiency levels in terms of the cognitive reading process, data of eye tracking and reading processing checklist were used to do analyses quantitatively and qualitatively.

Participants of the study

According to the convergent mixed methods research design, two kinds of sampling techniques which were cluster sampling and purposive sampling were adapted in the current study Quantitative and qualitative research methods require different sample sizes due to the different focuses With quantitative research methods, numerical data are gathered and then generalized across groups of participants to explain trends or phenomena From the view of Dửrnyei (2007), qualitative research methods involve data collection procedures that result primarily in open-ended, non- numerical data which is then analyzed primarily by non-statistical methods Thus, qualitative research methods should purposefully select a small number of participants who can provide a lot of research information, which determines that the method of selecting samples is different from the probability sampling of quantitative research The brief information of participants in the study is shown in Table 4.2

Sampling of participants in the study

Study stage Participants No of participants

College students in different majors from two Chinese universities

College EFL teachers in Vietnam and China

Think aloud protocols College students from Chinese universities

College students from Chinese universities

Framed within an argument-based approach to test validation, the study engaged different groups of participants so as to collect evidence for different inferences The current study included several stages which are one test (the VSTEP.3-5 reading test), expert interview, expert judgement, think aloud protocols and eye tracking In total,

225 participants were recruited The following part introduces the participants and the rationale for choosing the participants in the current study

4.3.1 Test takers for the VSTEP.3-5 reading test

Since VSTEP.3-5 has been widely used as a standard to measure the comprehensive English proficiency of Vietnamese English learners College students in Vietnam have been an important group of test takers, and have accounted for a considerable proportion However, due to the Covid-19, the researcher could not collect data from Vietnamese college students Therefore, the participants recruited were college students in China instead by using cluster sampling To ensure the sampling diversity, the researcher recruited college students from two universities in north of China Since CET-4 is an English proficiency test for college students in China, for a long time, employers have regarded CET-4 scores as an important indicator of the English proficiency of recruited talents, which has high validity (Wang, 2008) Therefore, the criterion was set by the researcher that the participants had just taken the CET-4 in the past June and had passed the test (CET-4 scores were equal or more than 425)

Depending on the help of colleagues, the researcher knew and asked the staffs (A and B) who worked in the offices of academic affairs in the two universities respectively for helping provide the information of students including genders, majors and CET-4 scores, etc The researcher promised that the information of students was just for research purposes In total, there were 864 students (538 students from university A and 326 students from university B) who met the criterion Several issues were considered to select participants to take the VSTEP.3-5 reading test Firstly, since the administrative restrictions, it is very hard to organize 864 students to take test at the same time Secondly, considering the DIF analysis in the second research question, the test takers should be gender-balanced and major-balanced to ensure the statistics more reliable Thus, in order to make sure the enough amount of data from test takers’ scores for the quantitative analysis and consider the issues mentioned before, the researcher aimed to recruit 200 participants from the two universities for the first study stage Specifically, the researcher selected 100 test takers from each of the university In order to ensure the test takers as representative as possible, the researcher selected 50 male test takers and 50 female test takers majoring in Physics, Chemistry, Geography, etc from university A which had been known as a prestigious university in majors of nature science Besides, the researcher recruited 50 male test takers and 50 female test takers majoring in Journalism, Education, Tourism management, etc from university B which was excellent in majors of humanities and social science After a list of target test takers was identified, the counselors at the target test takers’ faculties were contacted with the help of the staffs A and B After explaining that students were gathered to take the VSTEP.3-5 reading test for research purposes, the counselors helped contact the target test takers All of the target test takers agreed to cooperate with the study and signed consent forms The following table presents the information of the test takers for the investigated VSTEP.3-5 reading test

Information of the test takers for the investigated test

Literature of Theatre Film and Television;

Chinese Language and Literature; Tourism management;

4.3.2 Experts for interview and judgement

As potential methods, expert interview and judgement were employed to provide more reliable evidence for investigating the validity of VSTEP.3-5 reading test in the current study Experts for interview and judgement were recruited by purposive sampling in this study The standards for recruiting experts were clearly stated, which were specifically reflected in theoretical and practical backgrounds and knowledge Firstly, the expert should have a doctor’s degree in the major related to applied linguistics and should have attended related courses in language testing and assessment Secondly, the expert should be actively participating in the development and evaluation of second language reading tests in his/her own teaching practices Thirdly, the expert should have experience in teaching L2 reading courses or language testing and assessment courses to students at his/her own institution

Two English teachers in Vietnam (teachers A and B) and two English teachers in China (teachers C and D) who met the recruitment requirements were invited in this study Specifically, the researcher sent a letter of invitation via email for each of the teacher, including the purposes of the study and explanations about their roles in this study All of them replied and agreed to participate in the current study The following table presents the basic information of the experts for interview and judgement

Information of the experts for interview and judgement

ID Gender Nationality Degrees Teaching age

22 English language testing and reading teaching

15 English language testing Teacher A is one of the VSTEP.3-5 development team members who works in an international studies and foreign language university in Vietnam with 31 years of experience in teaching and assessing general English skills to students Teacher B also works in an international studies and foreign language university in Vietnam who has

20 years of experience in teaching the courses of language testing and assessment Teacher C works in a normal university in China with 22 years of experience in teaching the courses of English language testing and English reading courses Teacher

D works in a comprehensive university who has 15 years of experience in teaching the courses of language testing and assessment for English majors

4.3.3 Participants for think aloud protocols

For small sample of think aloud protocols data collection, the participants were recruited by purposive sampling The researcher released think aloud protocols participant recruitment advertisement through WeChat, a very popular communication tool in China The recruitment advertisement included such information as the purposes and the procedures of doing think aloud protocols, the standards of the participation as well as the confidentiality of the participants’ personal information

As research showed that proficiency may influence test takers’ use of reading subskills and strategies (Cohen, 2006), participants’ CET-4 scores were used as a standard in an attempt to recruit students of different proficiency levels The CET-4 scores were based on participants’ official grades reports and participants was allowed to report their highest CET-4 scores There were 23 students who expressed their willing to take part in the think aloud protocols However, one on hand, the CET-4 scores reported by some of them were close; one the other hand, the procedures of analyzing the think aloud protocols data were very time-consuming Thus, after screening, the researcher recruited six college students whose CET-4 scores were representative at different levels to participate the think-aloud protocols The basic information of the participants for think aloud protocols is provided in Table 4.5

Information of the participants for think aloud protocols

The current study adopted purposive convenient sampling to recruit participants for eye tracking Therefore, the researcher asked the counselors from the two universities in China for help sending eye tracking participant recruitment advertisement to WeChat group of students The recruitment advertisement contained the information of purposes and procedures of the eye tracking, the recruitment criteria of the participation as well as the confidentiality of the participants’ information First, because eyesight affects eye movement data, participants should have good eyesight and do not wear glasses Second, participants should pass the CET-4 and have official grades reports Third, participants should be familiar with computers and onscreen tests Fifteen college students from the two universities in China contacted the researcher and were recruited as participants for eye tracking The detailed information of the participants for eye tracking is provided in Table 4.6

Information of the participants for eye tracking

Instruments of the study

4.4.1 A version of the VSTEP.3-5 reading test paper

A version of the VSTEP.3-5 reading test paper was applied as the key research instrument in the current study, which was used in all of the study stages The researcher wrote a letter of request to the VSTEP.3-5 development team to access this VSTEP.3-5 reading test paper The researcher acquired the permissions from the VSTEP.3-5 development team to use this test paper for research purposes The VSTEP.3-5 reading test paper includes four passages, each of them followed by 10 multiple-choice questions with four options The time constraint is 60 minutes The basic information, including passage number, genre, topic as well as the content is demonstrated in Table 4.7

Basic information of the VSTEP.3-5 reading test paper

Passage number Genre Topic Content

The passage introduces different blood type personalities Passage 2 Argumentative Social science

The passage discusses whether volunteering abroad is beneficial

The passage explains whether freezing temperatures kill bees

Passage 4 Argumentative Professional scientific article

The passage discusses the origin of continents 4.4.2 The VSTEP.3-5 reading test specifications

The VSTEP.3-5 reading test specifications that were in confidentiality were provided by the VSTEP.3-5 development team for the research purposes The VSTEP.3-5 reading test specifications have two main parts: one is the instructions for scoring the VSTEP.3-5 reading test, the other is the specific instructions for writing the reading passages and items The first part lists the scoring rubrics and cut-off scores of levels 3-5 The second part elaborates the general information of four passages and the specific writing principles for each of the passages such as the passage sources, topics, reading subskills and difficulty levels There was no doubt that the VSTEP.3-5 reading test specifications were applied to a standard for investigating the validity of the VSTEP.3-5 reading test paper from both product- based and process-based perspectives

Expert interview questions were developed for experts who participated in the evaluation for the test tasks of the investigated VSTEP.3-5 reading test which analyzed by the researcher There were two sets of interview questions (A and B) Interview questions A were designed for the validation of the important instrument that was the modified framework of task characteristics for VSTEP.3-5 reading test Interview questions B were designed for judging the analysis from the researcher which covered several indicators under characteristics of input and characteristics of the expected response based on the modified framework of task characteristics for VSTEP.3-5 reading test Interview questions A are presented as follow:

1 To what extent do you think the modified framework of task characteristics containing enough aspects for investigating test tasks of the VSTEP.3-5 reading test?

2 To what extent do you think the modified framework of task characteristics for VSTEP.3-5 reading test is reasonable in the design of the input?

3 To what extent do you think the modified framework of task characteristics for VSTEP.3-5 reading test is reasonable in the design of expected response? Interview questions B are presented as follow:

1 To what extent do you approve the analyses for the investigated VSTEP.3-5 reading test in terms of text length, vocabulary, grammar, text domain, text topic and text level in the current study?

2 To what extent do you approve the analyses for the investigated VSTEP.3-5 reading test in terms of response type in the current study?

4.4.4 The modified framework of task characteristics for the VSTEP.3-5 reading test

The framework of task characteristics (Bachman & Palmer, 1996) is widely used in language testing and assessment, but as Manxia (2008) states, this framework is not designed for any specific types of test tasks or exams Based on the nature of reading and characteristics of VSTEP.3-5 reading test, the current study developed the modified framework of task characteristics for analyzing the VSTEP.3-5 reading test

The modified framework of task characteristics for the VSTEP.3-5 reading test

1 Characteristics of the input of VSTEP.3-5 reading test

2 Characteristics of the expected response of VSTEP.3-5 reading test

1 Characteristics of the input of VSTEP.3-5 reading test

Weir (1993) believes that reading is a selective process that happened between the reader and the text Text in reading refers to each passage that the test takers have to read, deal with, and respond The quality of the text, or the quality of the input determines the content validity of the VSTEP.3-5 reading test The modified framework focused on the characteristics of input in detail, in terms of text length, language of input, text domain, text topic and text level

Since the text length of reading material is a very important variable influencing the test takers’ test performance, for it would constrain the test takers’ time available to access the information Thus, it is important to take the text length into consideration during language test development It is also an inevitable challenge that all test designers have to encounter is that how long should be a text on which they are able to test what they expect to test Usually, the text length is determined by the purpose of the test and the language proficiency of the intended test takers, and the text length usually increases with the test taker’s improvement of language proficiency

Language of input in the modified framework refers to the vocabulary and grammar of the four passages in VSTEP.3-5 reading test Alderson (2000) indicates that vocabulary plays an important role in reading passages, so that it is vital for passage comprehension and then passage performance Grammar also plays an indispensable part in reading passages

Text domain in the modified framework refers to the general field of the passages In order to guarantee the validity of the reading tests, the test designers must choose the reading materials in several domains according to the requirements of test specifications

Topic of the testing passages was selected and decided by the test designers Bachman (1990) holds that the input topic has a significant effect on the test takers’ performance when they are processing a reading comprehension test This is a kind of

“passage effect” To be specific, test takers who are familiar with the topic they are reading can better understand the passage than those who are not familiar with the topic Therefore, it is ideal to select the topic at an appropriate level of specificity, and the topic should not be culturally biased or favor on section of the test population (Weir, 1993) It is commonly assumed that what readers know would affect what they understand when reading, so it is commonly assumed that text content would affect how readers process text Reading comprehension tests with satisfactory qualities should test the test takers’ reading comprehension abilities in a wide range of topics (Alderson, 2000) In order to guarantee the validity of the reading tests, the test designers must choose the reading materials according to the requirements of test specifications

When choosing reading passages, it is of significance to take difficulty level into full account, because reading passages that are too difficult or too easy cannot examine test takers’ actual level Readability is so concerned in order to identify which features make a text readable and to adjust the difficulty of the text to the target readers

2 Characteristics of the expected response of VSTEP.3-5 reading test

In reading test, the design of the test questions which bears the expectation from the designers to the test takers’ response act as one of the most influential factors in content validity Bachman and Palmer (1996) argue that the characteristics of the expected response refer that test developers attempt to elicit language’s using or physiological responses through written instructions, task design or provided text input The format of the expected response is the way in which the response is generated

Bachman and Palmer (1996) have proposed three types of response which are selected response, limited production response and extended production response In terms of Bachman and Palmer’s point of view, the typical selected response refers to multiple-choice items that demand the test takers to select the best one from two or more options

3 Validation of the modified framework of task characteristics for the VSTEP.3-

Data collection procedures of the study

4.5.1 Data collection procedures for research questions 1, 2 and 3

For answering the first part of research question 1, a version of the VSTEP.3-5 reading test paper and the VSTEP.3-5 reading test specifications were collected from the VSTEP.3-5 development team One of the team members who received the permissions from the VSTEP.3-5 development team leader sent the electronic editions to the researcher and asked the researcher to sign on a confidentiality agreement After that, the content of expert interview was collected from four experts, which was about the modified framework of task characteristics for VSTEP.3-5 reading test and analyses of test tasks for the VSTEP.3-5 reading test in the study Before interview, the four experts respectively received a copy of the task characteristics framework (Bachman and Palmer, 1996), the modified framework of task characteristics for VSTEP.3-5 reading test, the investigated VSTEP.3-5 reading test paper and the VSTEP.3-5 reading test specifications All of the materials were in English The original language of the VSTEP.3-5 reading test specifications was Vietnamese, which was translated by the researcher into English In the interview, the experts shared their opinions about the reasonability of the modified framework of task characteristics for VSTEP.3-5 reading test and insights of the test characteristics in the aspects of input and expected response The researcher recorded all of the content of the expert interview for further analysis Collecting the above-mentioned data helped answer the first half part of research question 1 which aimed to investigate the compatibility between the test tasks and test specifications

For answering the latter part of research question 1, the data of VSTEP.3-5 reading test scores were collected from the 200 test takers for VSTEP.3-5 reading test Since the 200 test takers came from two universities which was very hard to organize them to take the test together, so the 200 test takers were invited to take the VSTEP.3-

5 reading test at the same time in two classrooms of their own universities respectively, by traditional pen-and-paper method The researcher asked her three colleagues to help supervise the test to ensure the authenticity, objectivity, and fairness of the test process Then, the researcher collected all of the 200 test papers, scored each of them, and typed the scores on the computer

For investigating the research questions 2 and 3, the data were also the VSTEP.3-

5 reading test scores which collected from the 200 test takers for VSTEP.3-5 reading test The data collection was in the same procedures as the above-mentioned

4.5.2 Data collection procedures for research question 4

For addressing the first sub-question of the research question 4, expert judgement data were collected in the following procedures Firstly, the four experts respectively received the electronic editions of investigated VSTEP.3-5 reading test paper, the VSTEP.3-5 reading test specifications in English, and expert judgment form A Secondly, the experts were asked to do the categorizations on the reading subskills needed in the reading process which were based on the VSTEP.3-5 reading test specifications and the Khalifa and Weir’s cognitive model of reading (2009) The experts completed the entire judgment process independently Then, a Zoom discussion were conducted for the consistence of the categorizations on the reading subskills needed in the reading process from the four experts Thirdly, expert judgment form B was provided for each of the four experts to make identifications on the reading subskills targeted by each test item of the investigated VSTEP.3-5 reading test Experts were given as much time as they needed to deliver reliable and accurate judgment of the reading processes Later in Zoom discussions, the identified subskills by experts were reviewed item by item and the experts were also consulted on the reasons of their choices of some of the reading subskills, which was to ensure that any disagreements were resolved before the judgment was finalized Finally, the final expert judgment was summarized and regarded as a standard to compare with the test takers’ actual reading processes from think aloud protocols subsequently

For addressing the second sub-question of the research question 4, data of think aloud protocols were collected The whole think aloud protocols was divided into two sessions including a training session and a test-taking session which were the source of formal data collection Six volunteers were recruited as the participants for the think aloud protocols in a quiet and secure place to keep disruptions minimal A training session, which started with sharing the purpose, principle and major procedures of the think aloud protocols was arranged prior to the formal think aloud protocols In order to promote the participant’s understanding of the research method and ensure that he/she fully understanded the requirements, the researcher provided he/she with a clear and unambiguous instruction of performing think aloud protocols in Chinese Once the participant understanded the nature and the demands of think aloud protocols, two training tasks were provided including an anagram and a short passage reading comprehension to familiarize the participant with the process In the training session, only one puzzle and a very short text with the same response format were involved, as the researcher did not want the participant to be exhausted to affect their performance in the test taking session The training session aimed to promote reliable data in formal data collection, rather than affecting its quality The tasks in training session were slightly easier than those in in the latter test taking session, as participants need some time to familiarize themselves with the methods and should not be discouraged or even frustrated

The training session was conducted before the formal data collection The researcher determined whether the participant was ready to enter the test taking session through training session, rather than rushing into the test taking session directly If the participant was not well-prepared, additional explanations were provided to him/her The training time lasted no more than 10 minutes; otherwise, the participant may feel exhausted during the upcoming test taking session and his/her attitude might not be positive enough During the training session, the participant and the researcher sat face to face to better communicate and convey the instructions and the procedures After the training session, the researcher provided feedback to the participant When the participant seemed to be familiar with and adapted to the procedures and requirements, he/she received a brief rest to allow himself/herself to slightly recover, and then entered the test taking session with full energy During the test taking session, the participant and the researcher sat in an angle so that they could see each other but not directly face each other and they sat with some distance in between so as to eliminate any discomfort or tension imposed on the participant Each participant was asked to complete four passages with 40 items of VSTEP.3-5 reading test, as assigned by the researcher The participant was allowed to produce think aloud protocols in Chinese or English It was crucial to minimize the interruption of the participant during the think aloud process Therefore, the researcher avoided exploring participant’s language expressions for eliminating the impact on his/her performance and used neutral and brief reminders such as “Can you tell me what you are thinking when you answer this question?”, “What else do you think?” or “Please keep talking.”, if there was a lapse longer than 10 seconds in the participant’s protocol The formal data collection including the training session and the test taking session was controlled within two hours The think aloud protocols were recorded using digital voice recorders and transcribed shortly after the sessions The following table lists the participants and dates of the think aloud protocols data collection

Participants and dates of the think aloud protocols data collection

4.5.3 Data collection procedures for research question 5

Because eye tracking starts from the observer’s perspective, and the data comes from the observation of the test takers’ behavior which belongs to the “first-hand” and direct process data and has a certain objectivity Thus, eye tracking data were collected for answering the research question 5 After reviewing the relevant literature on eye tracking theories and practices, the researcher learned to use Experiment Builder 2.3.3.8 which was a visual experiment programming software to write the eye tracking experiment material program running on the Eyelink Portable Duo eye tracker The experiment material was the investigated VSTEP.3-5 reading test which was composed of four passages Each passage had 10 multiple choice items for answering There were 40 multiple choice items in total In the well-written experiment material program, a single screen was displayed with a passage on left and an item on right, which helped capture the eye tracking precisely Figure 4.2 shows the layout of experiment materials as designed for the purposes of the current study For reasons of confidentiality, the experiment materials have been obscured

The layout of experiment materials of eye tracking

The researcher also started studying Eyelink Portable Duo technical specifications, familiarizing herself with the setting of the eye tracker and adjustment of the position on the camera in order to get improvement of calibration quality before conducting the experiment about a week The researcher applied the eye tracking laboratory and invited two colleagues to participate in the pilot study individually The purposes of the pilot study were trialing out the eye tracking experiment material program on the computer in the laboratory, completing the eye tracking training tutorial, calibration, and identifying potential problems or shortcomings in the eye tracking experiment material program or the training procedures if any Therefore, it was planned that either of the two researcher’s colleagues sat for only two passages of VSTEP.3-5 reading test (one for the former two passages with 20 items and the other for the latter two passages with 20 items) rather than the whole test while their eye movement data were recorded

The fact proved that pilot study was very important because problems emerged at this stage A colleague’s eye tracker calibration failed many times no matter how hard the researcher tried, because his eyesight was poor and the glasses, he wore made the calibration effect unsatisfactory For ensuring the success of formal data collection, the researcher planned to recruit only those participants with good eyesight, and all participants needed to participate the formal data collection with naked eyes The eyesight of participants in the recruitment process is strict, as poor eyesight may result in disqualification of the formal data collection

After conducting the pilot study on the period for each participant to complete formal data collection, a rough idea emerged Considering individual differences and any unexpected situations that may occur, the researchers decided to reserve an hour and a half for participants in formal data collection In other words, favorable environmental conditions were ensured to prevent participants from being distracted or disturbed The eye tracking laboratory was an isolated room, and once an experiment is conducted, a written notice stating that “Experiment in Progress” was posted on the laboratory door Therefore, the laboratory was completely quiet without any external noise interference or that from participants in the experiment The following table lists the participants and dates of theeye tracking data collection

Participants and dates of the eye tracking data collection

Fifteen participants were invited to the eye tracking laboratory to take part in the experiment In the training session, the researcher first briefly introduced the participant to the purpose of this study, the purpose of using eye tracking method and the major processes that need to be followed in the formal eye tracking data collection

In order to promote participants’ understanding of eye tracking methods and procedures, a brief and clear training was conducted, explaining various aspects of data collection in Chinese Then, the participants took the VSTEP.3-5 reading test with 40 test items on a computer screen with the Eyelink Portable Duo eye tracker attached As indicated in Figure 4.3, in the current study, the eye tracking was right eye pupil which was usually captured in the eye tracking experiment Eyelink Portable Duo eye tracker recorded eye movements at a rate of 2000 Hz A binocular camera was attached to the bottom of a 15.6 inches wide-screen monitor The tracking distance between the eyes and the camera was held between 42 cm to 62 cm to optimize gaze accuracy Before reading, the participants adjusted their sitting posture and perform a 9-point calibration Figure 4.4 presents the successful calibration and validation of the eye tracker camera on the host computer screen The system reported the error of different positions on the screen in turn, and finally reported the draw error and maximum error in the lower right corner of the host computer In text reading tasks, the average error was generally required to be less than 0.5° and the maximum error was not more than 1°; If a calibration point was missing or large errors occur (e.g., a large difference between the gaze point calculated by the eye tracker and the actual dot position), then the erroneous points were recalibrated The calibration process was repeated between the two items Each item started with a drift correction point A figure of a small point was presented in the middle of the empty screen If the gaze of the participant fell on the point, the next item would be presented If the gaze did not fall on the point, recalibration was performed Eighty minutes were allowed for the reading test, during which the eye-tracker recorded participants’ eyes Data Viewer 4.1.1 was employed to display the textual stimuli, collected eye tracking statistics and maps, and obtained eye tracking metrics Immediately after completing the 10 items of each passage, the participants were required to fill out a reading processing checklist which acted as the complement data to interpret the eye tracking data

Eye tracking mode in the study

Successful calibration and validation of the eye tracker camera on the host computer screen

Tools for data analysis

Compleat Lexical Tutor has an abbreviated from which is LexTutor (https://www.lextutor.ca) It is a vocabulary profiler tool which is developed by TomCobb of Canada The version has been optimized and updated for many times So far, the software has been updated to version 8.3 This software covers 26 tools such as concordancer, vocabulary profiler, exercise maker, interactive exercises, etc The current study mainly used one of the important tools named Vocabprofile for analyzing the vocabulary of the four passages of the VSTEP.3-5 reading test This software provided the statistical data on inputted text according to the research from the British National Corpus (BNC), representing a vocabulary profile of frequency lists from K1 to K20 Among them, K1 and K2 words belong to the high-frequency ones, and K3 words represents academic words and off-list words (Szudarski, 2018) The same with this study, Nguyen Thi Phuong Thao (2018) also uses this tool to analyze the vocabulary of the passages in one version of VSTEP.3-5 reading test paper 4.6.2 Readable.io

Readability refers to the ease in which a passage of written text can be understood It is often used in assessing the suitability of a text for readers Readable.io is an online text readability analysis software (http://www.readable.io) used to obtain the data of text difficulty, which are the Flesch Reading Ease, Flesch-Kincard Grade Level, CEFR level, readable rating, etc The Flesch Reading Ease ranges from 0-100 The larger the value, the higher the legibility of the text Flesch-Kincard Grade Level is divided into 12 levels The higher the level, the more difficult the text In the current study, Readable.io was employed to analyze the text level Some other studies that are related to VSTEP.3-5 have also applied this tool for analyzation of text levels such as Nguyen Thi Phuong Thao (2018) and Nguyen Thi Quynh Yen (2018)

WINSTEPS 5.4.1 is a powerful, flexible and professional Rasch model software, which can help users easily analyze item response data WINSTEPS is designed for item validation and person performance diagnosis, one-step (concurrent) test equating, item banking, rating scales, partial credit, etc It is widely used in the field of educational tests, such as Graduate Record Examination (GRE), TOEFL, etc In the current study, WINSTEPS was applied to analyze scores of 200 test takers for VSTEP.3-5 reading test Many other studies also have adopted WINSTEPS for analyzing test takers’ scores in the context of VSTEP.3-5, such as Nguyen Thi Phuong Thao (2018), Vo Ngoc Hoi (2021), Nguyen Thi Quynh Yen (2018), etc

SPSS is the short form of Statistical Product and Service Solutions It is a statistical software with in-depth analysis of data, convenient use and complete functions The SPSS statistical analysis process includes several categories such as descriptive statistics, mean comparison, general linear model, correlation analysis, regression analysis, etc In the current study, SPSS 26.0 was applied to analyze the data of eye tracking for VSTEP.3-5 reading test focusing on the Mann-Whitney U test and Kruskal Wallis H test Many studies on eye tracking that investigate test takers’ cognitive processes also have used this kind of software for analyzing data, such as Bax (2013), Brunfaut & McCray (2015) and Bax & Chan (2016).

Data analysis procedures of the study

4.7.1 Data analysis procedures for research question 1

For addressing the first half of the research question 1, an analysis was made for the test tasks of the VSTEP.3-5 reading test which comparing with the related content in the VSTEP.3-5 reading test specifications, and the response of expert interview on test tasks The modified framework of task characteristics for analyzing the VSTEP.3-

5 reading test was presented in the instrument section in detail The current study focused on analyzing characteristics of the input and characteristics of the expected response and illustrating the expert interview questions B one by one

Firstly, the requirements of text length in the test specifications and the manifestations of the four passages in the investigated test paper were listed respectively and compared one by one in the current study

Secondly, language of input was analyzed which included the vocabulary and grammar of the four passages in VSTEP.3-5 reading test The current study used the tool of Vocabprofile to analyze the vocabulary of the four passages of the VSTEP.3-5 reading test in the following procedures: first was to click the website to directly enter the software interface, then selected the Vocabprofile tool to enter the tool interface There were three types named VP-kids, VP-Classic and VP Complet, which were applicable to grades 0-3, 4-8 and 9-University respectively The current study selected the VP-Complet category for retrieval Next was to paste each passage into the text input area and clicked the “SUBMIT-Window” button to complete the retrieval and obtain the results Grammar also plays an indispensable part in reading passages Grammar was mainly involved several sentence patterns like simple sentences, compound sentences and complex sentences, which was also analyzed and compared in each passage in the current study

Thirdly, text domain that referred to the general field of the passages in the VSTEP.3-5 reading test was analyzed by comparing with the related descriptions in the VSTEP.3-5 reading test specifications

Fourthly, text topic meant subject that a passage discussed or wrote about, which was analyzed by comparing with the descriptions in the VSTEP.3-5 reading test specifications one by one

Finally, text level was analyzed by using Readable.io (http://www.readable.io) The researcher firstly pasted the four passages of the investigated VSTEP.3-5 reading test into the tool respectively to generate the Flesch Reading Ease, Flesch-Kincaid Grade Level and CEFR level And then, the researcher compared the text level of each passage in the investigated VSTEP.3-5 reading test with the descriptions in the VSTEP.3-5 reading test specifications

Characteristics of the expected response

The characteristics of the expected response in the current study was analyzed through response type which was to point out how similar and different the test paper under evaluation was written compared with the test specifications

Expert interview on test tasks

Apart from the above-mentioned analysis of test tasks in the investigated VSTEP.3-5 reading test, analysis of expert interview questions B also acted as a piece of supporting evidence for answering the first part of research question 1 First, the recorded content of the expert interview was transcribed sentence by sentence and under each of the key aspects (text length, vocabulary, etc.), some codes (totally approve, partly approve and disapprove) were identified Second, the frequency of the responses from expert interview questions was presented from point to point in a table Third, the content of expert interview was summarized according to each of the questions

For addressing the latter part of the research question 1, the statistics of the VSTEP.3-5 reading test items were analyzed to explore the role of each test item in the test of a group of test takers, and how appropriately they were designed in line with the difficulty level predetermined in the test specifications Dichotomous Rasch model was applied in test items’ quality analysis of the current study, which mainly carried out according to the following steps

Firstly, unidimensionality of the test items was checked whether the test items measure only one construct of English reading proficiency Accordingly, this study adopted Principal Component Analysis (PCA) of residuals to examine unidimensionality of the test items If the eigenvalues reported by PCA approximate the size predicted by the Rasch model, the data were effectively unidimensional Otherwise, the bigger the eigenvalues, the less unidimensional are the data (Linacre,

Secondly, item estimates were analyzed whether the test items functioned well enough for their use according to the stipulated cutoff points for determining item fit Aryadoust, Ng, & Sayama (2021) state that people have not reach an universal agreement on fit statistics in Rasch measurement That is to say, different researchers tend to use different thresholds for infit and outfit statistics to determine item fit Nonetheless, the current study was based on the rules which provided by many studies such as Wright and Linacre (1994), Smith, Schumacker, and Bush (1998), Smith

(2005), Wilson (2005) and Bond and Fox (2015) and Linacre (2017) Generally, it is accepted that when the infit mean square (MNSQ) values ranges from 0.5 to 1.5, the item can be regarded as a good fit to the Rasch model and can be considered productive of measurement Linacre (2017) also illustrate that although MNSQ values of 1.5 to 2 could be considered as unproductive for construction of measurement, these values might not degrade the result of the test However, if MNSQ values greater than 2, it could be considered to be a signal of unexpected observations which might severely underfit to the Rasch model and might distort or degrade the result of the test Besides, according to the statements of Aviad-Levitzky, Laufer, & Goldstein

(2019), not every MNSQ value above 2 should be considered as significantly underfit which must be confirmed by standardized as a z-scores (ZSTD) Only items with both MNSQ and ZSTD values greater than 2 could be considered significantly underfit Items with MNSQ values below 0.5 are regarded too predictable and thus likely to overfit the Rasch model, which might produce misleadingly high reliability and separation coefficients (Linacre, 2017) Another index for analyzation was named the point-measure correlation (PTMA-CORR.) indicated the correlation between a test taker response to a specific item and the overall proficiency measure (Bond and Fox,

2015) Negative reported correlations could be considered as the orientation of the scoring on the item, or by the person, may be opposite to the orientation of the latent variable This might be caused by item miskeying, reverse scoring, person special knowledge, guessing, data entry errors, or the expected randomness in the data The point-measure expected correlation (PTMA-EXP.) is the expected value of the point- correlation when the data fit the Rasch model with the estimated measures (Linacre,

2023) That is, the closer the reported correlation value is to the expected correlation value, the closer the item is to the measurement target

Thirdly, item and person calibration were analyzed through the Wright Map which provided a picture of test items by placing the difficulty of the test items on the same measurement scale as the ability of the test takers for better understanding how appropriately the investigated VSTEP.3-5 reading test measured

4.7.2 Data analysis procedures for research question 2

Summary

This chapter has presented the methodology for the current study Firstly, the ontology and epistemology for the design of the study has been provided to give enough reasons for the adopted convergent mixed methods design, which combines the strengths and compensate for the weaknesses of quantitative and qualitative research methods The necessary components of convergent mixed methods design have been elaborated specifically which include participants, instruments, data collection procedures, tools for data analysis, and data analysis procedures over the VSTEP.3-5 reading test In the following chapters, the findings and discussions are presented which are based on the multiple sources of data analysis to shed light on the products and the processes of the test takers’ performances when they are engaged in the VSTEP.3-5 reading test.

FINDINGS AND DISCUSSIONS ON VALIDITY OF VSTEP.3-5

Findings and discussions for research question 1

This section aims to present the findings for the first question: “To what extent do the test tasks and test items of the VSTEP.3-5 reading test meet the requirements of the test specifications?” The findings from test tasks and test items are analyzed respectively in the following part to provide evidence for the first two assumptions which are put forward in the generalization inference for the VSTEP.3-5 reading test Besides, some points merit more discussions

As mentioned in the chapter of methodology, the analysis of the VSTEP.3-5 reading test tasks were mainly based on comparison of the related content from the modified framework of task characteristics for VSTEP.3-5 reading test and the accordingly descriptions in VSTEP.3-5 reading test specifications, and answers of expert interview questions

5.1.1.1 Comparison of test tasks in the investigated test and test specifications

To be specific, the modified framework is composed of two main parts: (1) characteristics of input and (2) characteristics of the expected response Thus, the findings are elaborated surrounding these two parts

Through carefully scanning the VSTEP.3-5 reading test specifications, it can be found that the detailed content about input for the test is provided in a direct and clear way Due to the confidentiality, the specific numerical values and descriptions of test specifications are not presented in the following analyses This part focuses on providing the findings on the characteristics of input in the aspects of text length, language of input, text domain, text topic and text level

As for the aspect of text length, the VSTEP.3-5 reading test specifications stipulate the specific descriptions for each of the passages Table 5.1 presents text length of VSTEP.3-5 reading test by comparing the requirements in the test specifications and the manifestations of the investigated test paper

Comparison of text length in the investigated test and test specifications

Text length described in the test specifications

Text length in the investigated test

Passage 1: 453 words Passage 2: 436 words Passage 3: 440 words Passage 4: 518 words Total length: 1847 words The test specifications of VSTEP.3-5 reading test stipulates that the reading passages should contain around ……-…… words in total Specifically, the first three passages should contain around …… words and the fourth passage should contain around … words As to the investigated test paper, the first passage contains 453 words, the second passage contains 436 words, the third passage contains 440 words, and the fourth passage contains 518 words The total length of the four passages are

1847 words Comparing with the test specifications of VSTEP.3-5 reading test, all the texts of the investigated tests satisfy the requirement of length as described in the test specifications

For the aspect of language of input, Table 5.2 presents the detailed information about the vocabulary and grammar of VSTEP.3-5 reading test by comparing the requirements in the test specifications and the manifestations of the investigated test paper

Comparison of language of input in the investigated test and test specifications

Language of input in the test specifications

Language of input in the investigated test

Passage 1, 2, 3: a combination of simple, compound and complex sentences

Passage 4: a majority of compound and complex sentences

Based on the information about vocabulary in the above table, it can be observed that the proportion of high-frequency words and low-frequency words in the four passages of the investigated test paper satisfies the requirements in the test specifications Besides, in the investigated test paper, the first three passages are composed of a combination of simple, compound and complex sentences, and the last passage has a majority of compound and complex sentences Comparing with the test specifications of VSTEP.3-5 reading test, all the passages of the investigated tests meet the requirement of grammar as stipulated in the test specifications

As to text domain, the comparison of the requirements in the test specifications and the manifestations of the investigated test paper are explored and presented as follows

Comparison of text domain in the investigated test and test specifications

Text domain in the test specifications Text domain in the investigated test

Passage 1, 2, 3 & 4 should belong to one of the four domains: ……

Passage 1: personal domain Passage 2: public domain Passage 3 & 4: educational domain

The above table shows that the first and second passages belong to personal domain and public domain respectively, and both of the passage 3 and 4 belong to educational domain in the investigated VSTEP.3-5 reading test paper Therefore, the text domain of the investigated test paper could better match the VSTEP.3-5 reading test specifications

As to text topic, the following table presents the comparison between the test specifications and the investigated test paper

Comparison of text topic in the investigated test and test specifications

Text topic in the test specifications Text topic in the investigated test

Passage 1 should be mainly related to……;

Passage 2 should be mainly related to ……;

Passage 3 should be mainly related to……;

Passage 4 should be mainly related to……

-At least one of the passages contains……

Passage 1 is about daily life with the content of Asia

Passage 2 is about social science

Passage 3 is about natural science

Passage 4 is about professional works

The topics of four passages in the investigated test paper are elaborated as follows The first passage is mainly about blood type and personality which belongs to the topic of daily life Some content of the passage infers to the situations of Asian countries The second passage explains the positive and negative impacts of volunteering abroad which belongs to the social science topic For the third passage, the content is about bees which belongs to the natural science topic As to the fourth passage, the content refers to the emergence if the continents, which belongs to professional works After comparison and analysis, it can be concluded that all the text topics of the passages in the investigated test paper satisfy the requirement of text topic as described in the VSTEP.3-5 reading test specifications

Comparison of text level in the investigated test and test specifications

Text level in the test specifications Text level in the investigated test

Passage 1: Level 3 (B1) (Flesch-Kincaid Grade Level: 8.6 Flesch Reading Ease: 57.5) Passage 2: Level 3 (B1) (Flesch-Kincaid Grade Level: 12.1 Flesch Reading Ease: 42.3)

Passage 3: Level 5 (C1) (Flesch-Kincaid Grade Level: 9.3 Flesch Reading Ease: 53.1) Passage 4: Level 5 (C1) (Flesch-Kincaid Grade Level: 12.5 Flesch Reading Ease: 45.6)

As for text level, from the above table, it can be inferred that the requirement from the test specifications embodies the goal of the test which is to distinguish test takers’ reading proficiency level at levels 3 (B1), 4 (B2) and 5 (C1) The four passages in the investigated test paper were analyzed by the online tool Readable.io Table 5.5 shows the text level of each passage of the investigated VSTEP.3-5 reading test Among them, the readability index of passage 1 and 4 fit with the Flesch readability scores for CEFR B1 (CEFR-VN 3) level and CEFR C1 (CEFR-VN 5) level For passage 1, the Flesch-Kincaid Grade Level and Flesch Reading Ease are 8.6 and 57.5 respectively For passage 4, the Flesch-Kincaid Grade Level and Flesch Reading Ease are 12.5 and 45.6 respectively The Flesch-Kincaid Grade Level and Flesch Reading Ease of passage 3 are 12.1 and 42.3 respectively And the Flesch-Kincaid Grade Level and Flesch Reading Ease of passage 4 are 9.3 and 53.1 respectively That is to say, passage 2 is easier than expected with level 3, and passage 3 is more difficult than expected with level 5

Characteristics of the expected response

In this section, the current study probes the response type in the characteristics of the expected response According to the following table, the investigated VSTEP.3-5 reading test paper is made up of four passages, each of which is followed by 10 multiple-choice questions, a total of 40 items That means all of the items adopt selected response and satisfy the requirement of the test specifications in the aspect of response type

Comparison of response type in the investigated test and test specifications

Response type in the test specifications

Response type in the investigated test

5.1.1.2 Expert interview on test tasks

As stated in the chapter of methodology, there were two expert interview questions B which were about the test tasks of the investigated VSTEP.3-5 reading test paper The findings from expert interview are presented in the following table according to the responses surrounding for each of the two interview questions from experts

Frequency on the results of interview questions B

Interview question B 1 on characteristic of the input

Totally approve Partly approve Disapprove

Interview question B 2 on characteristic of the expected response

The number before the slash indicates the number of experts who supported the statement; the number after the slash indicated the number of experts who mentioned the specific aspect

Interview question B 1: To what extent do you approve the analyses for the investigated VSTEP.3-5 reading test in terms of text length, vocabulary, grammar, text domain, text topic and text level in the current study?

As for the first interview question, combining the frequency of the above table, it can be concluded that all of the four experts voiced their approvals for the results of the analyses in the aspects of text length, language of input, text domain and text topic

As for the aspect of text level, experts partly agreed the analysis from the online readability tool and hold different opinions which deserved further discussions

In this section, the current study aims to provide the findings and discussions of the second research question “To what extent do the test takers’ genders and majors affect their reading scores?” At the same time, the current study also provides backing evidence which is the analysis of DIF in test takers’ genders and majors to support the third assumption in generalization inference for the VSTEP.3-5 reading test Adapting dichotomous Rasch model in WINSTEPS is the premise of detecting DIF in this study

5.2.1 Differential item functioning by genders

DIF measure is the difficulty of this item for this class, with all else held constant The more difficult the item, the higher the DIF measure In the above table, DIF measure (M) and DIF measure (F) stand for the DIF measure for male, and DIF measure for female respectively DIF contrast is the “effect size” in Logits, the difference between the two groups (Linacre, 2023) A positive DIF contrast indicates that the item is more difficult for the first, left-hand-listed group “Welch Prob.” shows the probability of observing this amount of contrast by chance, when there is no systematic item bias effect For statistically significance DIF on an item, it should be less than 0.05

From the above table, it can be seen that although the absolute value of DIF contrast values of Items 19, 20, 25 and 28 are 0.78, 0.67, 0.68 and 0.60 respectively, reaching the level of attention (≥ 0.5), the DIF is not statistically significant (Welch Prob > 0.05) Therefore, the difficulty of test items does not change significantly for male and female test takers That is, all of the 40 items do not have substantial DIF by genders

5.2.2 Differential item functioning by majors

In the table of DIF by majors, DIF measure (H&S) and DIF measure (N) stand for the DIF measures for humanities and social sciences, and DIF measure for natural science respectively From the above table, it can be seen that the absolute values of DIF contrast of all the 40 items haven’t reach the level of attention (≥ 0.5), and the

“Welch Prob.” values are more than 0.05 Therefore, the difficulty of test items does not change significantly for humanities and social sciences and natural science test takers That is to say, all of the 40 items don not have substantial DIF by majors All in all, combining the findings from the above DIF analyses, it can be concluded that the test takers’ genders and majors cannot affect their VSTEP.3-5 reading scores, which provides the answer of research question 2 and supports the third assumption of generalization inference

According to the classifications from O’Sullivan (2000), test taker characteristics includes physical, physiological and experiential characteristics To be specific, physical characteristics are about the age, gender, short-term ailments and longer-term disabilities Psychological characteristics contain personality, memory, cognitive style, affective schemata, concentration, motivation and emotional state Experiential characteristics relate to education, examination preparedness, examination experience, target language-country residence, topic knowledge and knowledge of the world (O’Sullivan, 2000) As for the context of the study, the 200 test takers recruited for taking the VSTEP.3-5 reading test in traditional way are college students, who share the representative physical and experiential characteristics of genders and majors which may be the factors affect reading scores This is the reason why the study has focused on the consistency of the difficulty of the items among genders and majors Just as Kunnan (2010) states, only when the cross-group result of the test is stable, the decision-making based on the observed score is meaningful, reliable and fair, and the items would have higher construct validity The findings emerged from the study that all the items in the investigated reading test do not have substantial DIF in genders and majors, which would be useful to contribute one aspect for guaranteeing the validity of the investigated VSTEP.3-5 reading test.

In this section, the current study aims to verify the reliability of the investigated VSTEP.3-5 reading test and provide findings and discussions of the third research question “To what extent are the VSTEP.3-5 reading test scores adequately reliable in measuring the test takers’ English proficiency?” The evidence to verify the reliability of the investigated test is obtained from the analysis of the indexes about reliability by using the dichotomous Rasch model generated by WINSTEPS 5.4.1

Separation and reliability measures of the investigated test

As can be seen from the table 5.12, the person measure is -0.35, it means that this test is rather difficult for these 200 test takers Both person separation and person reliability are at 2.08 and 0.81 which are greater than 2 and 0.7 respectively, indicating that test takers with different levels can be sufficiently enough distinguished in this investigated VSTEP.3-5 reading test The item separation and item reliability are high at 4.23 and 0.95 which is also greater than 2 and 0.7 respectively, showing high internal consistency for the items in the investigated VSTEP.3-5 reading test In short, the investigated test has a wide spread of item difficulty, and the levels of test takers are enough to confirm the item difficulty hierarchy

Therefore, the VSTEP.3-5 reading test scores are adequately reliable in measuring the test takers’ English proficiency, which serves as the answer of research question 3 and provides evidence for supporting the fourth assumption in the generalization inference

The same with the current study, Nguyen Thi Phuong Thao (2018) has applied Rasch analysis to estimate the reliability and separation of items in one version of VSTEP.3-5 reading test The results show that there is a very high internal consistency for the items in the reading test Simply put, the test has a wide spread of item difficulty, and the number of test takers was large enough to confirm a reproducible item difficulty hierarchy, which matches the description in the test specifications that the item difficulty levels range from B1 low to C1 high (Nguyen Thi Phuong Thao,

2018) Besides, Vo Ngoc Hoi (2021) draws a conclusion that high item reliability and item separation indexes indicate that the sample size is large enough to reproduce the item difficulty hierarchy, and that the items are widely spread on the measurement continuum in investigating a version of VSTEP.3-5 reading test It can be seen that in the few studies on VSTEP.3-5 reading test validation, the reliability of the VSTEP.3-5 reading test can be ensured generally, which provides a strong support for the content validity.

Summary

All of the above findings for the three research questions have shed light on the products of the participants’ performances when they are engaged in the VSTEP.3-5 reading test It can be concluded that (1) both test tasks and test items largely meet the requirements of the test specifications; (2) the test takers’ genders and majors cannot affect their VSTEP.3-5 reading scores; (3) the test scores are adequately reliable in measuring the test takers’ English proficiency To sum up, it is inferred that the investigated VSTEP.3-5 reading test paper’s validity can be guaranteed from product- based perspective.

FINDINGS AND DISCUSSIONS ON VALIDITY OF VSTEP.3-5

In order to shed light on the reading processes of test takers, this section aims to make a clear explanation for the findings and discussions of the fourth question:

“What reading processes are applied to answer VSTEP.3-5 reading test correctly? To what extent do the expected reading processes correspond with the processes actually engaged in by test takers while taking VSTEP.3-5 reading test?” The exploration of consistency helps understand the relationship between the structure that the test is designed to measure and the test taker’s actual performance on the test That is to say, the main aspect of this section addressed involves the comparison between the reading process that experts judged according to the test items and the reading process that the test takers actually engaged which is the backing evidence for supporting the first assumption of explanation inference for the VSTEP.3-5 reading test

6.1.1 Expert judgement on reading subskills in the investigated test

The following table shows the reading subskills and descriptors of the VSTEP.3-

5 reading test which are concluded by the four experts in order to make each description of reading subskills in focus The reading subskills and descriptors of

VSTEP.3-5 reading test are the basis for comparison with the reading subskills targeted by each test item of the investigated VSTEP.3-5 reading test

Reading subskills and descriptors of VSTEP.3-5 reading test

-Can understand specific information that are explicitly stated in the passage, using simple grammatical structures and vocabulary

-Can scan for specific information in a paragraph/ some paragraphs

-Can identify and understand paraphrased information explicitly stated in the passage

B Understanding word meanings in context

-Can recognize word meanings in context (words with different meanings) -Can recognize word meanings in context (idiomatic expressions/words with highly colloquial expressions)

-Can identify the antecedent of a pronoun

-Can understand the relationship between sentences or ideas using connective devices such as discourse markers, anaphoric and cataphoric references, substitutions, repetitions

-Can infer complex references in the passage (with confounding factors)

-Can identify and understand an implicit detail that is rewritten using different words

-Can identify the implication of a sentence/a detail

-Can locate and integrate information across a paragraph

-Can locate and integrate information across the passage

-Can summarize the main ideas of a paragraph or the passage

-Can identify supporting information for an argument or the main idea of a paragraph or the passage

-Can understand significant points that are stated and relevant to the main ideas

G Understanding explicit/implicit author’s opinion/attitude

-Can understand the explicit attitude/opinion of author

-Can understand the implicit stance/opinion/intention of author in the passage

-Can understand the general tone of the passage

-Can identify the purpose and function of the passage

-Can identify the organizational structure of the passage

-Can identify the genre of the passage The following table presents results of the expert judgement of reading subskills in the investigated VSTEP.3-5 reading test which is a standard for comparing with the actual reading process of the test takers

Results of the expert judgement of reading subskills in the investigated test

Item Primary subskill identified Potential involvement of other subskills

Item 1 Understanding explicit information Understanding cohesive devices Item 2 Understanding cohesive devices Understanding explicit information;

Item 3 Integrating information Understanding explicit information;

Understanding cohesive devices Item 4 Understanding explicit information

Item 6 Integrating information Understanding explicit information;

Making inferences Item 7 Understanding word meanings in context

Item 8 Making inferences Understanding explicit information Item 9 Understanding explicit information Understanding cohesive devices Item 10 Understanding explicit/implicit author’s opinion/attitude

Item 12 Making inferences Understanding explicit information;

Understanding cohesive devices Item 13 Understanding word meanings in context

Item 14 Understanding explicit/implicit author’s opinion/attitude

Item 15 Understanding cohesive devices Understanding explicit information;

Item 16 Making inferences Understanding explicit information Item 17 Integrating information Understanding explicit information;

Making inferences Understanding cohesive devices Item 18 Integrating information Understanding explicit information;

Making inferences Item 19 Understanding explicit information Understanding word meanings in context Item 20 Summarizing main ideas Understanding explicit information;

Integrating information; Understanding cohesive devices; Making inferences Passage 3

Item 21 Understanding explicit/implicit author’s opinion/attitude

Item 22 Making inferences Understanding explicit information

Item 23 Making inferences Understanding explicit information Item 24 Understanding word meanings in context

Item 26 Making inferences Understanding explicit information Item 27 Summarizing main ideas Understanding explicit information;

Making inferences Item 39 Summarizing main ideas Understanding explicit information;

Integrating information Item 40 Understanding text function Understanding explicit information;

As Table 6.2 presents, all the reading subskills specified in the VSTEP.3-5 reading test specifications are judged to be assessed by at least one test item, showing that the test specifications which informed the test development process is well- represented by the test items as judged by the experts “Making inferences” is the most commonly assessed subskill, which is primarily assessed by fourteen items, followed by “understanding explicit information” (seven items) and “understanding word meanings in context” (six items) “Integrating information” is assessed by four items “Summarizing main ideas” and “understanding explicit/implicit author’s opinion/attitude” are each assessed by three items “Understanding cohesive devices” is assessed by two items, while “understanding text function” is targeted by one item Thirty-five items require a combination of at least two subskills to arrive at the answer Item 20 is judged by the experts to require test takers to engage with a maximum of five subskills Most of the items among the 40 items require two subskills as specifications stated Also, experts identify no subskills other than the proposed ones, which affirms that the set of subskills have covered the crucial processes required for the reading items of the investigated VSTEP.3-5 reading test

“Making inferences” is the primary subskill targeted by Items 8, 12, 16, 22, 23,

26, 28, 30, 31, 32, 33, 35, 36 and 37, which is the most frequent reading subskill tested primarily in the investigated VSTEP.3-5 reading test paper Answering this type of items require the potential involvement of a range of different subskills, such as,

“understanding explicit information”, “integrating information”, and in one case

“Understanding explicit information” requires test takers to identify specific information that are explicitly stated in the passage using simple grammatical structures and vocabulary, scan for specific information in a paragraph/some paragraphs or identify and understand paraphrased information explicitly stated in the passage, which is judged by the experts to be primarily assessed by Item 1, 4, 5, 9, 11,

19 and 25 Another potential subskill which is “understanding cohesive devices” is also required in some cases

Items 7, 13, 24, 29, 34 and 38 judged by experts to primarily assess

“understanding word meanings in context” requires test takers to recognize a word or a phrase meaning from contexts (word or phrase with different meanings or idiomatic expressions/words with highly colloquial expressions) Answering this type of items, test takers are asked to use a range of different potential skills, including

“understanding explicit information”, in some cases making inferences

“Integrating information” is believed to be primarily assessed by Items 3, 6, 17, and 18, all requiring test takers to employ “understanding explicit information” Most of the items of this type need test takers to use the subskill of “making inferences”, such as item 6, 17 and 18 “Understanding cohesive devices” is required in item 3 and

17 of this type because the key sentence in the text to answer the item is preceded by a conjunction, which ask test takers to identify the sentence meaning

“Summarizing main ideas” requires test takers to identify the main ideas of a paragraph or the passage, identify supporting information for an argument or the main idea of a paragraph or the passage, or understand significant points that are stated and relevant to the main ideas, which is judged by the experts to be primarily assessed by Items 20, 27 and 39 For answering all of this type of items, test takers are supposed to employ “understanding explicit information” and “integrating information” Besides, other potential subskills like “understanding cohesive devices” and “making inferences” are also required in some cases

“Understanding explicit/implicit author’s opinion/attitude” is one of subskills of the experts’ judgements for the VSTEP.3-5 reading test, which requires test takers to understand the explicit attitude/opinion of author, understand the implicit stance/opinion/intention of author in the passage, and understand the general tone of the passage “Understanding explicit/implicit author’s opinion/attitude” is believed to be primarily assessed by Items 10, 14 and 21 Answering this type of items needs the potential involvement of a range of different subskills, including “understanding explicit information” and “integrating information”

“Understand cohesive devices” is the primary subskill targeted by Item 2 and 15, which needs test takers to identify the antecedent of a pronoun, understand the relationship between sentences or ideas using connective devices such as discourse markers, anaphoric and cataphoric references, substitutions, repetitions The potential involvement of other subskill is “understanding explicit information”

“Understanding text function” is believed to be primarily assessed by Item 40, which requires test takers to identify the purpose and function of the passage, identify the organizational structure of the passage, or identify the genre of the passage In this item, “understanding explicit information” and “integrating information” are also required to be assessed as potential subskills

Just a few items (Items 4, 5, 11, and 25) assess test takers’ understanding explicit information, which at the local level While, the majority of items in the test were judged by experts to require understanding at the global level and the ability to adopt multiple subskills for coming out the answers

In short, the above analyses of results of reading subskills in the investigated VSTEP.3-5 reading test from the expert judgement can be used for arriving at the first sub-question of research question 4

From the above findings of the expert judgement, two main points can be concluded and discussed Firstly, experts identified eight reading subskills based on the VSTEP.3-5 reading test specifications and cognitive model of reading (Khalifa and Weir, 2009) Among the eight reading subskills, only one subskill named

“understanding explicit information”, was considered to be at local-level process of reading, as defined by Grabe (2009) and Khalifa and Weir (2009) Secondly, based on contemporary views on the L2 reading process and empirical evidence from L2 reading assessment research (Nassaji, 2003; Rupp, Ferne & Choi, 2006), experts believed that the process of answering specific test items in the current study involved low and high levels of multiple subskills Among the 40 test items, 35 items were judged to require at least two reading subskills Besides, the subskill of

In addition to the think aloud protocols clarified in the former section, in order to investigate the reading processes of test takers, another effective instrument which is eye tracking is used By using eye tracking methodology to externalize parts of the reading process in the form of records of eye movement patterns, researchers can explore how test scores are related to actual behavior (Solheim & Upstadd, 2011) This section aims to provide the findings of the fifth question of the current study The main content of this section addressed involves the analysis of the difference between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements and analysis of the extent of the VSTEP.3-5 reading test discriminate test takers at different proficiency levels in terms of cognitive reading process, which are the backing evidence proposed in the explanation inference for the VSTEP.3-5 reading test

6.2.1 Eye tracking of successful and unsuccessful test takers across item types

Before presenting the eye tracking data in this section, it is important to repeat that the difference between successful and unsuccessful test takers is whether they answer the certain items correctly Besides, based on the main subskills that the experts judged, all 40 items can be categorized into eight types for analysis Because of space cause, not all 40 items with numerical eye tracking data are compared between successful and unsuccessful test takers It is noted that eye tracking data on the most typical and representative item in each item type are compared between successful and unsuccessful test takers would be presented as below For the sake of confidentiality, the content in heat maps and gaze plots presented as follows have been conducted obscurely

Item type of “understanding explicit information”

Test takers’ eye tracking data on Item 1

Global processing Text processing Task processing

Number of AOI switch between text and task Successful

Heat map from the successful test taker on Item 1

Heat map from the unsuccessful test taker on Item 1

From the Table 6.12, it can be seen that when the Mann-Whitney U test is applied to test differences between successful and unsuccessful test takers eye tracking on the AOIs, no significant difference is noted between successful and unsuccessful test takers in terms of the indictors mentioned This type of item requires test takers to understand specific information which are explicitly stated in the texts

Through examining heat maps in Item 1 from both successful and unsuccessful test takers (see Figure 6.1 and Figure 6.2 for details), it is found that the fixations of both successful and unsuccessful test takers fall on the first paragraph of the passage, the intensive fixation of successful test takers fall on the key information, but the intensive fixation of unsuccessful test taker doesn’t fall on the area of correct answer

Item type of “understanding word meanings in context”

Number of AOI switch between text and task

Gaze plot from the unsuccessful test taker on Item 29

Table 6.13 shows test takers’ eye tracking data on Item 29 which tests the subskill of “understanding word meanings in context” When the Mann-Whitney U test is applied to test differences between successful and unsuccessful test takers eye tracking on the AOIs, no significant difference is noted between successful and unsuccessful test takers in terms of the indicators investigated except the number of AOI switch between text and task Compared to the successful test takers, the unsuccessful test takers spend more time on both test and task However, the difference is not statistically significant (U (N successful = 6, N unsuccessful = 9) = 0.00, Z -1.88, p= 0.06) Besides, the total number of fixations and the number of forward saccades of unsuccessful test takers in text AOI is higher than successful test takers However, the difference is not statistically significant either (U (N successful = 6, N unsuccessful = 9) = 0.00, Z= -1.88, p= 0.06) When it comes to the number of regressions, there are slight differences between successful test takers and unsuccessful test takers (U (N successful = 6, N unsuccessful = 9) = 3.00, Z= -0.49, p= 0.62) When it comes to the number of AOI switch between text and task, unsuccessful test takers’ is significantly higher than successful test takers’ (U (N successful = 6, N unsuccessful = 49 = 0.00, Z= -2.00, p= 0.046) which is also examined through gaze plots from the successful and unsuccessful test takers on Item 29 (see Figure 6.3 and Figure 6.4 for details), probably indicating that unsuccessful test takers switch between test AOI and task AOI repeatedly to construct answer due to the difficulty in understanding word meanings contained in the text with the task in the selection of the correct answer

Item type of “understanding cohesive devices”

Table 6.14 presents test takers’ eye tracking data on Item 2, which shows that when the Mann-Whitney U test is applied to test differences between successful and unsuccessful test takers eye tracking on the AOIs, no significant difference is noted between successful and unsuccessful test takers in terms of total dwell time, total number of fixations, number of forward saccades, number of regressions, task dwell time, number of AOI switch between text and task, with one exception It is in terms of text dwell time that unsuccessful test takers spend significantly longer on text than successful test takers (U (N successful = 8, N unsuccessful = 7) = 0.00, Z= -1.96, p= 0.048) The heat maps (Figure 6.5 and Figure 6.6) from both successful and unsuccessful test takers on Item 2 also indicate unsuccessful test takers take much more time on text than successful test takers It would appear to support the conclusion drawn in Bax

(2013) and Brunfaut and McCray (2015) that successful test takers are better able to find and identify the appropriate part of a text, using search reading, and that they can then read it careful, while unsuccessful students, who are less able to set the appropriate reading goal, fail to do so

Item type of “making inferences”

From the Table 6.15, it can be seen that when the Mann-Whitney U test is applied to test differences between successful and unsuccessful test takers eye tracking on the AOIs, there are differences noted between successful and unsuccessful test takers in terms of the indictors investigated The total dwell time and the text dwell time of unsuccessful test takers spending on the item are much more than the successful test takers’ (U (N successful = 6, N unsuccessful = 9) = 0.00, Z= -1.96, p= 0.048) The differences are statistically significant Besides, the total number of fixations and the number of forward saccades of unsuccessful test takers in text AOI are higher than successful test takers’ (U (N successful = 6, N unsuccessful = 9) = 0.00, Z= -1.88, p= 0.06) The number of regressions and the number of AOI switch between text and task of unsuccessful test takers is higher than successful test takers (U (N successful = 6, N unsuccessful = 9) = 0.00, Z= -1.91, p= 0.057) This item requires test takers to identify and understand an implicit detail that is rewritten using different words Through examining heat maps on Item 23 from both successful and unsuccessful test takers (see Figure 6.7 and Figure 6.8 for details), it is found that the fixations of the successful test taker fall on the first and second paragraph of the text, the intensive fixation of successful test takers fall on the key information But the fixations of the successful test taker fall on the second paragraph of the text The reason why the unsuccessful test taker got the incorrect answer maybe because he did not locate and identify the implicit detail in the correct area of text

Item type of “integrating information”

Gaze plot from the successful test taker on Item 6

The above table presents test takers’ eye tracking data on Item 6 which tests the subskill ofintegrating information across the passage When the Mann-Whitney U test is applied to test differences between successful and unsuccessful test takers eye tracking on the AOIs, significant differences are noted between successful and unsuccessful test takers in terms of all of the AOIs Specifically, in the global AOI, there are significant differences in the indicators of total dwell time (U (N successful = 7,

N unsuccessful = 8) = 0.00, Z= -1.96, p= 0.048) and total number of fixations (U (N successful = 7, N unsuccessful = 8) = 0.00, Z= -1.96, p= 0.048) between successful and unsuccessful test takers which can also be seen in the Figure 6.9 and Figure 6.10, possibly indicating that it is more difficult for unsuccessful test takers to extract information than successful test takers In the text AOI, unsuccessful test takers spend more time on reading text than successful test takers (U (N successful = 7, N unsuccessful = 8)

= 0.00, Z= -1.96, p= 0.048), possibly indicating that the time of unsuccessful test takers searching for information is longer than successful test takers Unsuccessful test takers’ forward saccade count and regression count are significantly higher than the successful test takers’ (U (N successful = 7, N unsuccessful = 8) = 0.00, Z= -1.96, p 0.048) and U (N successful = 7, N unsuccessful = 8) = 0.00, Z= -1.99, p= 0.046, respectively), which can also be seen in the Figure 6.9 and Figure 6.10, possibly indicating that unsuccessful test takers experience a longer search process for information and have doubt or difficulty in the text reading process In the task AOI, the unsuccessful test takers spend more time on reading task than successful test takers (U (N successful = 7,

N unsuccessful = 8) = 0.00, Z= -1.96, p= 0.048) The number of AOI switch between text and task of unsuccessful test takers is significantly higher than successful test takers (U (N successful = 7, N unsuccessful = 8) = 0.00, Z= -1.99, p= 0.046), indicating that unsuccessful test takers check out both text and task frequently

Item type of “summarizing main ideas”

Table 6.17 presents test takers’ eye tracking data on Item 27 which tests the subskill of “summarizing the main idea” of the specific paragraph When the Mann-

Whitney U test is applied to test differences between successful and unsuccessful test takers eye tracking on the AOIs, significant differences are noted between successful and unsuccessful test takers in terms of all the AOIs Specifically, in the global AOI, there are significant statistic differences in the indicators of total dwell time (U (N successful = 10, N unsuccessful = 5) = 0.00, Z= -1.96, p= 0.048) and total number of fixations (U (N successful = 10, N unsuccessful = 5) = 0.00, Z= -1.96, p= 0.048) between successful and unsuccessful test takers which can also be seen in the Figure 5.12 and Figure 5.13, probably indicating that it is more difficult for unsuccessful test takers to extract information than successful test takers In the text AOI, unsuccessful test takers spend more time on reading text than successful test takers (U (N successful = 10,

N unsuccessful = 5) = 0.00, Z= -1.96, p= 0.048), possibly indicating that the time unsuccessful test takers search for information may be much longer than successful test takers Unsuccessful test takers’ forward saccade count and regression count are significantly higher than the successful test takers’ (U (N successful = 10, N unsuccessful = 5)

= 0.00, Z= -1.96, p= 0.048) and U (N successful = 10, N unsuccessful = 5) = 0.00, Z= -2.12, p= 0.034, respectively) which can also be seen in the Figure 6.11 and Figure 6.12, possibly indicating that unsuccessful test takers experience a longer search process for information and have doubt or difficulty in the text reading process In the task AOI, the unsuccessful test takers spend more time on reading task than successful test takers (U (N successful = 10, N unsuccessful = 5) = 0.00, Z= -1.96, p= 0.048) The number of AOI switch between text and task of unsuccessful test takers is significantly higher than successful test takers (U (N successful = 10, N unsuccessful = 5) = 0.00, Z= -2.12, p0.034), indicating that unsuccessful test takers may check out both test and task frequently to search for the uncertain information

Item type of “understanding explicit/implicit author’s opinion/attitude”

Summary

All of the above findings for the two research questions from process-based perspective have shed light on the processes of the participants’ performances when they are engaged in the VSTEP.3-5 reading test From the process-based perspective, it can be summarized that (1) expert judgement and data from participants’ think aloud protocols are basically consistent in terms of primary reading subskills assessed by the test items except for some test-wiseness strategies; (2) there are differences noted between successful and unsuccessful test takers in terms of their movements; (3) from eye tracking, test items are effective in discriminating test takers at different proficiency levels in terms of cognitive reading processes.

CONCLUSION

Interpretive argument for the VSTEP.3-5 reading test revisited

The purpose of the current study is to validate one version of the VSTEP.3-5 reading test from both product-based and process-based perspectives A reasonable interpretive argument has been constructed to consider the specific context of the VSTEP.3-5 reading test and effectively allocate verification efforts to seek evidence to support the most critical and relevant inferences related to the purpose of the test Framed within an argument-based framework of language test validation (Chapelle et al., 2008), two inferences of score interpretation and use, which pertain to the most critical purposes of the current study, have been evaluated: the generalization inference and the explanation inference Corresponding to these central inferences, five research questions have been proposed To answer these research questions, an interpretive argument for the VSTEP.3-5 reading test comprising warrants and assumptions has constructed Multiple sources of evidence to support or challenge each validity inference have been collected and analyzed

7.1.1 A recap of the generalization inference

From the proposed interpretive argument for the VSTEP.3-5 reading test, generalization inference is based on the warrant that observed scores are estimates of expected scores over the relevant parallel versions of tasks and test forms Four assumptions addressing different aspects of the validity warrant have been articulated

Assumption 1 proposes that configuration of tasks on VSTEP.3-5 reading measures is appropriate for intended interpretation As the backing evidence, findings of characteristics of the VSTEP.3-5 reading test tasks can support the assumption Specifically, findings of the test tasks mainly depend on the modified framework of task characteristics for VSTEP.3-5 reading test and expert interview which show that test tasks largely meet the requirements of the test specifications Thus, the assumption 1 is well supported by the backing evidence and without potential rebuttals

Assumption 2 is articulated that the test items are designed with the predetermined difficulty as described in the test specifications of the test The backing evidence is the findings of the statistical data about item difficulty of the VSTEP.3-5 mainly relying on applying dichotomous Rasch model, the reasons for the use of which is discussed in the literature review chapter The findings help explore the role of each test item in the test of a group of test takers, and whether they are appropriately designed in line with the difficulty level predetermined in the test specifications, which contains the following three parts: (1) item unidimensionality, (2) item estimates, and (3) item and person calibration In a word, the findings under the analysis of the statistical data about item difficulty of the VSTEP.3-5 reading test could greatly support the assumption 2 which is the test items are designed with the predetermined difficulty as described in the test specifications of the test and without the potential rebuttal that the test items are not properly designed in accordance with the test specifications

Assumption 3 is proposed that genders and majors of test takers cannot affect their reading scores Findings of the DIF in test takers’ genders and majors acts as backing evidence for supporting this assumption In the current study, DIF is determined by testing the stability of the difficulty of the test takers with different genders and majors in answering the items, so as to test the universality of the applicability of the test items This study mainly refers to the Rasch-Welch t-test statistics In short, these findings provide evidence that genders and majors of test takers cannot affect their reading scores and therefore can serve as one line of backing evidence for the generalization inference

The last but not least assumption in the generalization inference is that the test reliability is in good range of the VSTEP.3-5 reading test Backing evidence for this assumption is established through the findings of the data about test reliability of the investigated VSTEP.3-5 reading test paper In brief, under the generalization inference, the findings indicate that the test scores are adequately reliable in measuring the test takers’ English proficiency, which act as one piece of evidence for supporting the assumption 4 that the test reliability is in good range of the VSTEP.3-5 reading test Thus, there is no potential rebuttal which is the test reliability is lower than expected

7.1.2 A recap of the explanation inference

According to the proposed interpretive argument for the VSTEP.3-5 reading test, explanation inference is based on the warrant thattest takers’ scores on the VSTEP.3-

5 reading test can be attributed to the construct of English reading proficiency Three assumptions have accordingly proposed to authorize the explanation inference of the VSTEP.3-5 reading test

Assumption 1 is that reading processes engaged by test takers vary according to theoretical expectations As an important component of explanation inference for validating VSTEP.3-5 reading test, the findings of expert judgment and participants’ think aloud protocols in answering test items contribute to understanding the structure of the investigated VSTEP.3-5 reading test The correspondence between expected subskills uses and actual subskills uses provides reliable evidence in interpreting the test scores as the item measures what it claims to measure In general, the findings have provided strong evidence that support the assumption 1 although one potential rebuttal as retrieved via think aloud protocols that test takers employ test-wiseness strategies to answer the test questions

This assumption claims that there are differences between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements Backing evidence for this assumption are sought by the findings of analysis for eye movements between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items The findings show that there are differences noted between successful and unsuccessful test takers in terms of their movements Therefore, this line of evidence can strongly support the assumption 2 in the explanation inference without any potential rebuttals

Assumption 3 for the explanation inference is that VSTEP.3-5 reading test can discriminate test takers at different proficiency levels in terms of cognitive reading process Findings of the extent of the VSTEP.3-5 reading test discriminate test takers at different proficiency levels in terms of cognitive reading process serves as backing evidence for supporting this assumption

Combining the findings from product-based perspective and the findings from process-based perspective under the interpretive argument proposed for the study, it is inferred that the investigated VSTEP.3-5 reading test’s validity can be guaranteed Besides, the related discussions have provided the room for further thinking VSTEP 3-5 is a newly-developed test and has not yet gained wider national and international recognition Thus, validation is really crucial for its development to gain wider recognition nationally and internationally.

Implications of the study

Implications of the current study can be concluded into theoretical, methodological and practical aspects

Theoretically, the argument-based approach is employed to provide a framework for the overall research design The current study demonstrates the usefulness of the argument-based approach, interpretive argument (Chapelle et al., 2008) in particular, in validating a high-stakes standardized test of English proficiency independently developed in Vietnam

First, the contribution of the current study owes much to its adaptation of the interpretive argument template to the VSTEP.3-5 reading test As Kane (1992) argues, interpretive arguments are artifacts which change with time and are modified for circumstances This flexibility of interpretive argument enables it to work in the context in which it is used, for the purpose for which it is proposed and with the population taking the test (Kane, 2004) In terms of VSTEP.3-5 reading test, two central and distinguishing perspectives regarding test score interpretation and use are identified in the current study: product-based and process-based Thus, the current study focuses on two inference which are the generalization inference the explanation inference

Second, through articulating the validity warrants and assumptions, the interpretive argument for the VSTEP.3-5 reading test provides a logical framework to allocate the relevant evidence to justify each validity inference Accordingly, different types of evidence involving both quantitative and qualitative data are collected and analyzed As an overall research framework, interpretive argument shows its merits in four aspects: (1) articulating explicit statements of the proposed test score interpretation and use, (2) guiding logical reasoning through validity warrants and assumptions, (3) offering a practical framework for identifying, collecting, and analyzing empirical data to address each assumption, and (4) synthesizing the findings in terms of backing evidence and potential rebuttals to highlight the strengths and weaknesses in the validity of the test

Methodologically, previous studies have generally involved test product data while the current study places great emphasis on process data tapped through introspective or observational approaches, such as think aloud protocols and eye tracking, both of which have been used widely in psychological studies but limitedly in studies on language testing and assessment The attempt to explore test takers’ cognitive processes in VSTEP.3-5 reading test mainly through think aloud protocols and eye tracking has illuminated some of the unobservable sides of test takers’ mental processes while they try to complete tests, which also demonstrate the potential wider use of this research paradigm

It is worth mentioning that methodologies designed to explore the actual processes of thought through biological indicators including eye movements can go beyond evaluation of a product obtained in a group setting environment that permits answers arrived at in a covert fashion (Kong, 2016) Therefore, eye tracking can be extremely valuable as part of test validation studies even though it is quite labor intensive In recent years, a few researchers in the field of language testing and assessment have positively explored the possibility of using eye tracking in research on language testing and assessment and great advances have been made in this regard (Bax, 2013; Brunfaut, & McCray, 2015; Bax, & Chan, 2016; Bax, & Chan, 2019) In Vietnam, there has been no use of eye tracking in research on language testing and assessment yet This study can be regarded as a new attempt and provide reference for the future studies

Practically, the multiple sources of evidence from both product-based and process-based perspectives of the current study may provide some potential implications for stakeholders of the VSTEP.3-5 reading test including test takers, teachers, test designers, curriculum designers and researchers Actually, the discussion of test fairness acts as an important role in the practical implications As stated in the

Standards for Educational and Psychological Testing (AERA, et al., 2014), test fairness is the extent to which the inferences based on test scores are valid for different groups of test takers This definition embodies implications such as different aspects of the test development and test validity Among them, fairness in test design and fairness in test related to the multiple sources of evidence in the current study A major issue in this phase of the test design is the expectation of potential sources of score variance, where constructing irrelevant sources is detrimental to test fairness Therefore, any factors that are irrelevant to the test constructs should be avoided in the phase of test design Reviewing the findings in the current study, some piece of evidence with the construct-irrelevant factors can be found Data of think aloud protocols and reading processing checklists show that some test takers have a tendency to use the test-wiseness strategies of elimination and guessing It indicates that the test has been constructed in a way that potentially induced test takers to resort to reading processes irrelevant to the expected reading processes

Identifying these structurally unrelated sources has implications not only for test designers, but also for teachers, test takers, and researchers For test designers, caution must be exercised when designing test items in order to avoid building constructed- irrelevant sources As to teachers and test takers, the teaching and learning of reading comprehension should pay much attention to comprehension itself, and the ultimate goal of subskill training should be to cultivate test takers with good reading ability, flexible use of subskills, familiarity with different types and genres of reading materials Test-oriented teaching should be minimized as much as possible As to researchers, transparent and depth studies should be conducted so as to gain more insight into problematic aspects of the test The argument-based approach employed in the current study is a viable framework for a comprehensive investigation of theses aspects of test.

Limitations of the study

The current study is one of few that have contributed to investigate validity of VSTEP.3-5 reading test from both product-based and process-based perspectives comprehensively However, some limitations also need to be pointed out to inform future validation studies on VSTEP.3-5 reading test

On the one hand, due to the confidentiality policy of test scores and the outbreak of the Covid-19, it is inaccessible for the researcher to obtain real data from the test takers in Vietnam to conduct research

On the other hand, sample size of the participants in qualitative and quantitative data collection such as thinking aloud protocols and eye tracking is small due to the administrative restrictions Therefore, cautions should be exercised to interpret the findings in the current study.

Suggestions for future studies

Restricted by the limitations in some aspects of the present study, it is necessary to offer some suggestions for further studies

Firstly, since the test takers recruited in the current study are exclusively students in two universities, it would be meaningful and worthwhile to include more sorts of test takers such as civil servants in major and professional positions and people who work in other fields in the future studies to make more confident generalization of the findings in the current study

Secondly, since the sample size of the participants in thinking aloud protocols and eye tracking experiment is not sufficiently large, it would be reasonable and convincing to increase the number of participants in future studies Because the difficulties in interpretation of some findings in the current study may be the result of the limited number of participants

Thirdly, since the current study looks at just one test form that materializes the VSTEP.3-5, not the VSTEP.3-5 test as an abstraction More validation efforts should be executed so that the test can be guaranteed validity and enjoy public trust as gate- keeping instrument

LIST OF PUBLICATIONS RELATED TO DISSERTATION

1 Wei, Y., & Nguyen, D H (2021) A validity study on the Vietnamese Standardized Test of English Proficiency (VSTEP.3-5): From test-takers’ perspectives Journal of

2 Wei, Y (2021) A comparative study on the validity of VSTEP.3-5 and PETS-5 reading tests International Graduate Research Symposium (IGRS), 40-51

LIST OF PUBLICATIONS OF OTHER FIELDS

1 Wei, Y (2020) A study on the application of modern information technology in the intercultural teaching of College English in application-oriented universities in China

Journal of the College of Northwest Adult Education, 06, 55-57

Adrian, J.W (2003) Reading in a second language: A reading problem or a language problem? Journal of College Reading and Learning, 33(2), 157

Afflerbach, P P (1990) The influence of prior knowledge on expert readers’ main idea construction strategies Reading Research Quarterly, 25(1), 31-46

Afflerbach, P P., & Johnston, P H (1984) On the use of verbal reports in reading research Journal of Reading Behavior, 16(4), 307-321

Alanen, R (1995) Input enhancement and rule presentation in second language acquisition In R Schmidt (Ed.), Attention and awareness in foreign language learning (pp 259-302) Honolulu: University of Hawaii Press

Alderson, J C (1990a) Testing reading comprehension skills (Part One) Reading in a Foreign Language, 6(2), 425-438

Alderson, J C (1990b) Testing reading comprehension skills (Part Two) Getting students to talk about a reading test (A pilot study) Reading in a Foreign Language, 7(1), 465-503

Alderson, J C (2000) Assessing reading Cambridge: Cambridge University Press Alderson, J C., & Lukmani, Y (1989) Cognition and reading: Cognitive levels as embodied in test questions Reading in a Foreign Language, 5(2), 253-270 Alderson, J C., Haapakangas, E L., Huhta, A., Nieminen, L., & Ullakonoja, R

(2015) The Diagnosis of Reading in a Second or Foreign Language New York, NY: Routledge

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985) Standards for educational and psychological testing Washington, DC: American Educational

American Psychological Association (1954) Technical recommendations for psychological tests and diagnostic techniques Psychological Bulletin, 51, 201-

American Psychological Association American Educational Research Association, & National Council on Measurement in Education (1966) Standards for educational and psychological a test and manuals Washington, DC: American

American Psychological Association, American Educational Research Association, & National Council on Measurement in Education (1974) Standards for educational and psychological tests and manuals Washington, DC: American Psychological Association

Anderson, N., Bachman, L., Perkins, K., & Cohen, A D (1991) An exploratory study into the construct validity of a reading comprehension test: triangulation of data sources Language Testing, 8(1), 41-66

Andrich, D (1978) A rating formulation for ordered response categories

Andrich, D (1982) An extension of the Rasch model for ratings providing both location and dispersion parameters Psychometrika, 47, 105-113

Aryadoust, V., Ng, L Y., & Sayama, H (2021) A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research Language Testing, 38(1), 6-40

Ashby, J., Rayner, K., & Clifton, C (2005) Eye movements of highly skilled and average readers: Differential effects of frequency and predictability The Quarterly Journal of Experimental Psychology, 58A (6), 1065-1086

Aviad-Levitzky, T., Laufer, B., & Goldstein, Z (2019) The new Computer Adaptive Test of Size and Strength (CATSS): Development and validation Language Assessment Quarterly, 16(3), 345-368

Bachman, L F (1990) Fundamental considerations in language testing Oxford

Bachman, L F (2005) Building and supporting a case for test use Language Assessment Quarterly, 2(1), 1-34

Bachman, L F., & Palmer, A S (1996) Language testing in practice: Designing and developing useful language tests Oxford, UK: Oxford University Press

Bachman, L F., & Palmer, A S (2010) Language assessment in practice: Developing language assessments and justifying their use in the real world

Oxford, UK: Oxford University Press

Bachman, L F., Vanniarjan, A K S., & Lynch, B (1988) Task and ability analysis as a basis for examining content and construct comparability in two EFL proficiency batteries Language Testing, 5(2), 128-159

Baghaei, P (2007) Applying the Rasch rating-scale model to set multiple cut-offs

Baghaei, P (2008) The Rasch model as a construct validation tool Rasch Measurement Transactions, 22(1), 1145-1146

Baghaei, P (2010) A comparison of three polychotomous Rasch models for super- item analysis Psychological Test and Assessment Modeling, 52, 313-323 Baghaei, P., & Carstensen, C H (2013) Fitting the mixed Rasch model to a reading comprehension test: Identifying reader types Practical Assessment, Research &

Baghaei, P., & Cassady, J (2014) Validation of the Persian translation of the Cognitive Test Anxiety Scale Sage Open, 4, 1-11

Baghaei, P., Kemper, C., Reichert, M., & Greif, S (2019) Mixed Rasch modeling in assessing reading comprehension In Aryadoust, V & Raquel, M (Eds.),

Quantitative data analysis for language assessment (Vol II) (pp.15-32) New

Baghaei, P., Monshi, T M T., & Boori, A A (2009) An Investigation into the validity of conversational C-Test as a measure of oral abilities Iranian EFL Journal, 4, 94-109

Baker, F B (1985) The basics of Item Response Theory Portsmouth, NH: Heinemann

Barkaoui, K (2007) Rating scale impact on EFL essay marking: A mixed-method study Assessing Writing, 12(2), 86-107

Barkaoui, K (2010) Do ESL essay raters’ evaluation criteria change with experience?

A mixed‐methods, cross‐sectional study Tesol Quarterly, 44 (1), 31-57

Barkaoui, K (2011) Think-aloud protocols in research on essay rating: An empirical study of their veridicality and reactivity Language Testing, 28(1), 51-75

Bartlett, F C (1932) Remembering: a study in experimental and social psychology Cambridge, UK: Cambridge University Press

Barton, M E (1994) Input and interaction in language acquisition Cambridge:

Bax, S (2013) The cognitive processing of candidates during reading tests: Evidence from eye-tracking Language Testing, 30(4), 441-465

Bax, S., & Chan, S (2016) Researching the cognitive validity of GEPT High- Intermediate and Advanced Reading: an eye tracking and stimulated recall study

The Language Training and Testing Center (LTTC) Taipei, Taiwan (R.O.C) Bax, S., & Chan, S (2019) Using eye-tracking research to investigate language test validity and design System, 83, 64-73

Bax, S., & Weir, C J (2012) Investigating learners’ cognitive processes during a computer-based CAE reading test Cambridge ESOL: Research Notes, 47, 3-14 Beglar, D (2010) A Rasch-based validation of the vocabulary size test Language Testing, 27(1), 101-118

Birch, B M (2002) English L2 reading: Getting to the bottom Mahwah, NJ:

Block, E (1986) The comprehension strategies of second language readers TESOL Quarterly, 20(3), 463-494

Bond, T G., & Fox C M (2007) Applying the Rasch Model: Fundamental measurement in the human sciences Mahwah, NJ: Erlbaum

Bond, T G., & Fox, C M (2015) Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.) New York: Routledge

Borsboom, D., Mellenbergh, G J., & Heerden, J (2004) The concept of validity

Bowles, M A (2010) The think-aloud controversy in second language research

Brooks, L., & Swain, M (2014) Contextualizing performances: Comparing performances during TOEFL iBTTM and real-life academic speaking activities

Brunfaut, T (2016) Looking into reading II: A follow-up study on test-takers’ cognitive processes while completing APTIS B1 reading tasks London: The British Council

Brunfaut, T., & McCray, G (2015) Looking into test-takers’ cognitive processes whilst completing reading tasks: A mixed-method eye-tracking and stimulated recall study London: The British Council

Brutten, S R., Perkins, K., and Upshur, J A (1991) Measuring growth in ESL reading Paper presented at the 13 th Annual Language Testing Research Colloquium, Princeton, New Jersey

Caron, T A (1989) Strategies for reading expository prose In S E McCormick, & J

E Zutell (Eds.), Cognitive and social perspectives for literacy research instruction (Thirty-eighth yearbook of the national reading conference) (pp 293-

Carrell, P L., & Grabe, W (2002) Reading In N Schmitt (Ed.), An introduction to applied linguistics (pp 233-250) London: Arnold

Carroll, B J (1980) Testing communicative competence: An interim study Oxford: Pergamon Press

Carroll, B J (1993) Human cognitive abilities Cambridge: Cambridge University Press

Carver, R P (1992) What do standardized tests of reading comprehension measure in terms of efficiency, accuracy and rate? Reading Research Quarterly, 27, 347-

Chapelle, C A (2012) Conceptions of validity In G Fulcher & F Davidson (Eds.),

The Routledge handbook of language testing (pp 21-33) New York, NY: Routledge

Chapelle, C A., & Voss, E (2013) Evaluation of language tests through validation research In A J Kunnan (Ed.), The companion to language assessment (pp 1-

Chapelle, C A., Enright, M K., & Jamieson, J M (2008) Building a validity argument for the Test of English as a Foreign Language New York, NY:

Chapelle, C A., Enright, M K., & Jamieson, J M (2010) Does an argument‐based approach to validity make a difference? Educational Measurement: Issues and Practice, 29(1), 3-13

Chi, M T H., Bassock, M., Lewis, M W Reimann, P., & Glaser, R (1989) Self- explanations: How students study and use examples in learning to solve problems

Cohen, A D (1984) On taking language tests: What the students report Language Testing, 1(1), 70-81

Cohen, A D (1994) Assessing language ability in the classroom (2nd ed.) Boston: Heinle & Heinle

Cohen, A D (2006) The coming of age of research on test-taking strategies

Cohen, A D., & Upton, T A (2006) Strategies in responding to the new TOEFL reading tasks TOFEL Monograph Princeton, NJ: Educational Testing Service Cote, N., & Goldman, S R (1999) Building representations of informational text: Evidence from children’s think-aloud protocols In H V Oostendor, & S R Goldman (Eds.), The construction of mental representations during reading (pp 169-193) Mahwah: Lawrence Erlbaum Associates

Crain-Thoreson, C., Lippman, M Z., & McClendon-Magnuson, D (1997) Windows on comprehension: Reading comprehension processes as revealed by two think- aloud procedures Journal of Educational Psychology, 89(4), 579-591

Creswell, J W (2008) Educational research: Planning, conducting and evaluating quantitative and qualitative research Upper Saddle River, NJ: Pearson

Creswell, J W., & Clark, V L P (2007) Designing and conducting mixed methods research Australian & New Zealand Journal of Public Health

Creswell, J W., & Clark, V L P (2011) Designing and conducting mixed methods research (2nd edition) Thousand Oaks, CA: Sage

Creswell, J W., & Clark, V L P (2018) Designing and conducting mixed methods research Los Angeles, CA: Sage publications

Cronbach, L J (1971) Test validation In R L Thorndike (Ed.), Education measurement (2nd ed., pp 443-507) Washington, DC: American Council on Education

Dávid, G (2007) Investigating the performance of alternative types of grammar items

Davey, B (1988) Factors affecting the difficulty of reading comprehension items for successful and unsuccessful readers Experimental Education, 56, 67-76

Dawadi, S., & Shrestha, P N (2018) Construct validity of the Nepalese school leaving English reading test Educational Assessment, 23(2), 102-120

Dửrnyei, Z (2007) Research methods in applied linguistics Oxford: Oxford

Drahozal, E C & Hanna, G S (1979) Reading comprehension subscores: Pretty bottles for ordinary wine Journal of Reading, 21(5), 416-420

Drum, P A., Calfee, R C., & Cook, L K (1981) The effects of variables on performance in reading comprehension tests Reading Research Quarterly, 16,

Dunlea, J., Spiby, R., Nguyen, T, N, Q., Nguyen, T, M, H., Nguyen, T, Q, Y., Nguyen,

T, P, T., Thai, H, L, T & Bui, T, S (2018) APTIS-VSTEP comparability study:

Investigating the usage of two EFL tests in the context of higher education in Vietnam British Council Validation Series VS/2018/001 London: British Council

Duran, R P (1989) Testing of linguistic minorities In R L Linn (Ed.), Educational measurement (3rd ed.) (pp 573-587) New York: Macmillan Publishing Co Inc

Dussias, P E (2010) Uses of eye-tracking data in second language sentence processing research Annual Review of Applied Linguistics, 30(3), 149-166

Eckes, T., & Grotjahn, T (2006) A closer look at the construct validity of C-Tests

Elliott, M., & Lim, G S (2016) The development of a new reading task: A mixed methods approach In A.J Moeller, J Creswell, & N Saville (Eds.), Second language assessment and mixed methods research (pp.208-222) Cambridge:

Embretson, S E., & Reise, S P (2000) Item Response Theory for psychologists NJ: Lawrence Erlbaum Associates

Enright, M K., Grabe, W., Koda, W., Mosenthal, P., Mulcahy-Ernt, P., & Schedl, M

(2000) TOEFL 2000 reading framework: A working paper (TOEFL Monograph Series No 17) Princeton, New Jersey: Educational Testing Service Ericsson, K A., & Simon, H A (1980) Verbal reports as data Psychological Review,

Ericsson, K A., & Simon, H A (1984) Protocol analysis: Verbal reports as data

Ericsson, K A., & Simon, H A (1993) Protocol analysis: Verbal reports as data

(Rev ed.) Cambridge: MIT Press

Ericsson K A., & Simon H A (1999) Protocol analysis: Verbal reports as data (3rd ed.) Cambridge: Bradford Books

Farr, R., Pritchard, R., & Smitten, B (1990) A description of what happens when an examine takes a multiple-choice reading comprehension test Journal of Educational Measurement 27(3), 209-226

Field, J (2013) Cognitive validity In A Geranpayeh, & L Taylor (Eds.) Examining listening (pp 77-151) Cambridge: Cambridge University Press

Finocchiaro, M., & Sako, S (1983) Foreign language testing: A practical approach New York: Regents Publishing Company, Inc

Fox, M C., Ericsson, K A., & Best, R (2011) Do procedures for verbal reporting of thinking have to be reactive? A meta-analysis and recommendations for best reporting methods Psychological Bulletin, 137(2), 316-344

Francoise, G (1981) Developing reading skills Cambridge: Cambridge University

Fransson, A (1984) Cramming or understanding? Effects of intrinsic and extrinsic motivation on approach to learning and test performance In J C Alderson, & A

H Urquhart (Eds.), Reading in a foreign language (pp 86-121) London: Longman

Frenck-Mestre, C (2005) Eye-movement recording as a tool for studying syntactic processing in a second language: A review of methodologies and experimental findings Second Language Research, 21(2), 175-198

Fulcher, G (2015) Re-examining language testing: A philosophical and social inquiry London & New York: Routledge

Gagne, E D., Yekovich, C W., & Yekovich, F R (1993) The cognitive psychology of school learning New York: Harper Collins College

Gaillard, S (2014) The elicited imitation task as a method for French proficiency assessment in institutional and research settings (Doctoral dissertation)

Retrieved from ProQuest Dissertations & Theses database

Galaczi, E, D., & Khabbazbashi, N (2016) Rating scale development: a multistage exploratory sequential design In A.J Moeller, J Creswell, & N Saville (Eds.),

Second language assessment and mixed methods research (pp.208-222)

Gao, X Y., & Liu, J P (2009) The use of mixed method in higher education research International and Comparative Education, 31(3), 49-54

Garner, R (1987) Metacognition and reading comprehension Norwood: Ablex

Gass, S M., & Mackey, A (2000) Stimulated recall methodology in second language research Mahwah: Lawrence Erlbaum Associates

Gholamreza, H (2013) The relative difficulty and significance of reading skills

International Journal of English Language Education, 1(3), 208-222

Gough, P B (1972) One second of reading Visible Language, 6(4), 291-320

Grabe, W P (1988) Reassessing the term “Interactive” In Carrell, P.L., Devine, J and Eskey, D E (Eds.), Interactive approaches to second language reading (pp 56-70) Cambridge: Cambridge University Press

Grabe, W P (1991) Current development in second language reading research

Grabe, W P (2009) Reading in a second language: Moving from theory to practice Cambridge: Cambridge University Press

Green, A (1998) Verbal protocol analysis in language testing research: A handbook

A Handbook Cambridge: Cambridge University Press

Grabe, W P & Stoller, F L (2002) Teaching and researching reading New York: Longman

Greene, J C (2007) Mixed methods in social inquiry San Francisco, CA: Jossey Grotjahn, R (1986) Test validation and cognitive psychology: Some methodological considerations Language Testing, 3(2), 159-185

Gulliksen, H (1950) Theory of mental tests Hillsdale, NJ: Lawrence Erlbaum

Họikiử, T., Bertram, R., Hyửnọ, J., & Niemi, P (2009) Development of the letter identity span in reading: evidence from the eye movement moving window paradigm Journal of Experimental Child Psychology, 102(2), 167-181

Haladyna, T M (2004) Developing and validating multiple choice test items (3rd ed.) Mahwah, NJ: Lawrence Erlbaum

Hama, M., & Leow, R P (2010) Learning without awareness revisited Studies in Second Language Acquisition, 32(3), 465-491

Han, B C., & Luo, K Z (2013) The evolution of validity and validation in language assessment Foreign Language Teaching and Research, 45(03), 411-425+481 Harris, A J., & Sipay, E R (1985) How to increase reading ability New York:

Harrison, A (1983) A Language testing handbook London: Macmillan Press

Hartman, D K (1991) The intertextual links of readers using multiple passages: A postmodern/semiotic/cognitive view of meaning making In J E Zutell, & S E McCormick (Eds.), Learner factors/teacher factors: Issues in literacy research and instruction (Fortieth yearbook of the national reading conference) (pp 49-

Hartman, D K (1995) Eight readers reading: The intertextual links of proficient readers reading multiple passages Reading Research Quarterly, 30(3), 520-561 Heaton, J B (1975) Writing English language tests: a practical guide for teachers of

English as a second or foreign language London: Longman

Henning, G (1987) A guide to language testing: Development, evaluation, research Cambridge, Massachusetts: Newbury House Publishers

Hochberg, J., & Brooks, V (1970) Reading as an intentional behavior In H Singer and R B Ruddell (Eds.), Theoretical models and processes of reading (pp 304-

314) Newark, Del: International Reading Association

Hudson, T (1996) Assessing second language academic reading from a communicative competence perspective: Relevance for TOEFL 2000 TOEFL monograph series

Huey, E B (1908) The psychology and pedagogy of reading Massachusetts: MIT

Hughes, A (1989) Testing for language teachers Cambridge, UK: Cambridge

Jang, E E., Wagner, M., & Park, G (2014) Mixed methods research in language testing and assessment Annual Review of Applied Linguistics, 34, 123-153 Jiang, Y, M (2009) Mixed methods research as the third methodological movement

Jin, Y (1998) An applied linguistics model of Chinese undergraduates’ EAP reading and its operationalizations in the advanced English reading test (Unpublished doctoral dissertation) Shanghai Jiao Tong University

Johnson, R B., & Onwuegbuzie, A J (2004) Mixed methods research: A research paradigm whose time has come Educational Researcher, 33(7), 14-26

Johnson, R B., Onwuegbuzie, A J., & Turner, L A (2007) Toward a definition of mixed methods research Journal of Mixed Methods Research, 1(2), 112-133 Jolly, D (1978) The establishment of a self-access scheme for intensive reading

Paper presented at the Goethe Institute, British Council on Reading, Paris

Kane, M T (1992) An argument-based approach to validity Psychological Bulletin,

Kane, M T (2001) Current concerns in validity theory Journal of Educational

Kane, M T (2002) Validating high-stakes testing programs Educational Measurement: Issues and Practice, 21(1), 31-41

Kane, M T (2004) Certification testing as an illustration of argument-based validation Measurement: Interdisciplinary Research and Perspective, 2(3), 135-

Kane, M T (2006) Validation In R L Brennan (Ed.), Educational measurement

(4th ed., pp 17–64) Washington, DC: American Council on Education/Praeger Kane, M T (2013) Validating the interpretations and uses of test scores Journal of

Kane, M T (2016) Validation Strategies: Delineating and Validating Proposed Interpretations and Uses of Test Scores In S Lane, M Raymond, & T M Haladyna (Eds.), Handbook of test development (Vol 2nd) New York, NY: Routledge

Kane, M T., Crooks, T., & Cohen, A D (1999) Validating measures of performance

Educational Measurement: Issues and Practice, 18(2), 5-17

Keith, R., & Alexander, P (1989) The psychology of reading Hillsdale, NJ:

Kelley, T L (1981) Interpretation of educational measurements University

Khalifa, H., & Docherty, C (2016) Investigating the impact of international assessment: A convergent parallel mixed methods approach In A.J Moeller, J Creswell, & N Saville (Eds.), Second language assessment and mixed methods research (pp.208-222) Cambridge: Cambridge University Press

Khalifa, H., & Weir, C J (2009) Examining reading: Research and practice in assessing second language reading Cambridge: Cambridge University Press

Khosravi, A (2019) Extensions and applications of the Rasch model Retrieved from HAL Open Science

Kim, J Y (2008) Development and validation of an ESL diagnostic reading-to-write test: An effect-driven approach (Doctoral dissertation) Retrieved from ProQuest

Knutson, E M (1997) Reading with a purpose: Communicative reading tasks for the foreign language classroom Foreign Language Annals, 30(1): 49-57

Kong, J F (2016) Construct validity investigation from a process-focused perspective-A case study of L2 reading comprehension test (Doctoral dissertation) Shanghai International Studies University

Krieber, M., Bartl-Pokorny, K D., Pokorny, F B., Einspieler, C., Langmann, A., Kửrner, C., Falck-Ytter, T., & Marschik, P B (2016) The relation between reading skills and eye movement patterns in adolescent readers: evidence from a regular orthography PLoS One, 11(1), e0145934

Krstić, K., Šoškić, A., Ković, V., & Holmqvist, K (2018) All good readers are the same, but every low-skilled reader is different: an eye-tracking study of the PISA data European Journal of Psychology of Education, 33: 543-558

Kunnan, A J (2010) Test fairness and Toulmin’s argument structure Language

Lado, R (1961) Language testing: The construction and use of foreign language tests London: Longman

Langer, J A (1990) The process of understanding reading for literary and informative purposes Research in the Teaching of English, 24(3), 229-260

Lee, Y J., & Greene, J C (2007) The productive validity of an ESL placement test:

A mixed methods approach Journal of Mixed Methods Research, 1, 366-389 Leighton, J P (2004) Avoiding misconception, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing Educational

Leow, R P (1998) Toward operationalizing the process of attention in SLA: Evidence for Tomlin and Villa’s (1994) fine-grained analysis of attention

Leow, R P (2000) A study of the role of awareness in foreign language behavior: Aware versus unaware learners Studies in Second Language Acquisition, 22(4), 557-584

Leow, R P (2001) Do learners notice enhanced forms while interacting with the L2?:

An online and offline study of the role of written input enhancement in L2 reading Hispania, 84(3), 496-509

Leow, R P., Hsieh, H C., & Moreno, N (2008) Attention to form and meaning revisited Language Learning, 58(3), 665-695

Leow, R P., Grey, S., Marijuan, S., & Moorman, C (2014) Concurrent data elicitation procedures, processes, and the early stages of L2 learning: A critical overview Second Language Research, 30(2), 111-127

Leow, R P., & Morgan-Short, K (2004) To think aloud or not to think aloud: The issue of reactivity in SLA research methodology Studies in Second Language Acquisition, 26(1), 35-57

Li, Z (2015) An argument-based validation study of the English placement test (EPT)— Focusing on the inferences of extrapolation and ramification

(Doctoral dissertation) Retrieved from ProQuest Dissertations & Theses database

Lim, G S (2009) Prompt and rater effects in second language writing performance assessment (Doctoral dissertation) Retrieved from ProQuest Dissertations &

Linacre, J M (1989) Many-facet Rasch measurement Chicago, IL: MESA Press Linacre, J M (2017) Teaching Rasch measurement Rasch Measurement Transactions, 31(2), 1630-1631

Linacre, J M (2023) A user’s guide to WINSTEPS: Rasch-Model computer programs(Program Manual 5.4.1) Available from http:// www.winsteps.com

Liu, M (2015) Justifying the validity of the listening-based retelling task in the National Matriculation English Test (Doctoral dissertation) Beijing Foreign Studies University

Lord, F M & Novick M R (1968) Statistical theories of mental test scores

Lumley, T (1993) The notion of sub-skills in reading comprehension tests: An EAP example Language Testing, 10(3), 211-234

Lumley, T (2005) Assessing second language writing: The rater’s perspective New York: Peter Lang

Lunzer, E., Waite, M., & Dolan, T (1979) Comprehension and comprehension tests

In E Lunzer, & K Gardner, (Eds.), The effective use of reading London:

Luo, K Z (2019) The four unification approaches to validation in language testing: Interpretation, evaluation and implication Foreign Language Education, 40(06), 76-81

Magliano, J P., Trabasso, T., & Graesser, A C (1999) Strategic processes during comprehension Journal of Educational Psychology, 91(4), 615-629

Manxia, D (2008) Content validity study on reading comprehension tests of NMET CELEA Journal, 31(4), 29-39

Martinková, P., Drabinová, A., Liaw, Y.-L., Sanders, E A., McFarland, J L., and Price, R M (2017) Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual assessments CBE—Life Sciences Education, 16(2), rm2

Masters, G N (1982) A Rasch model for partial credit scoring Psychometrika, 47,

Matthews, M (1990) Skill taxonomies and problems for the testing of reading

McNamara, T F (1996) Measuring second language performance London: Longman

McNamara, T F (2003) Looking back, looking forward: Rethinking Bachman

Messick, S (1988) The once and future issues of validity: assessing the meaning and consequences of measurement In H Wainer and H I Braun (Eds.), Test validity Hillsdale, NJ: Lawrence Erlbaum

Messick, S (1989) Validity In R L Linn (Ed.), Educational Measurement (3nded., pp 13-103) New York: American Council on Education and Macmillan

Messick, S (1995) Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning American Psychologist, 50, 741-749

Messick, S (1996) Validity and washback in language testing Language Testing,

Miles, M B., & Huberman, A M (1994) Qualitative data analysis: An expanded sourcebook (2nd ed.) London: Sage Publications

Mislevy, R J., Steinberg, L S., & Almond, R G (2003) Focus article: On the structure of educational assessments Measurement: Interdisciplinary Research and Perspectives, 1(1), 3-62

Morgan-Short, K., Heil, J., Botero-Moriarty, A., & Ebert, S (2012) Allocation of attention to second language form and meaning: Issues of think-alouds and depth of processing Studies in Second Language Acquisition, 34(4), 659-685

Morse, J M (2003) Principles of mixed methods and multimethod research design

Handbook of mixed methods in social and behavioral research, (pp 189-208)

Munby, J (1978) Communicative syllabus design Cambridge: Cambridge University Press

Nassaji, H (2003) Higher–level and lower–level text processing skills in advanced ESL reading comprehension The Modern Language Journal, 87(2), 261-276 Neurath, M., & Cohen, R S (2012) Empiricism and sociology Dordrecht: D Reidel Publishing Company

Nevo, N (1989) Test-taking strategies on a multiple-choice test of reading comprehension Language Testing, 6(2) 199-215

Newton, P E., & Shaw, S D (2014) Validity in educational & psychological assessment London, UK: Sage

Nguyen, T P T (2018) An investigation into the content validity of a Vietnamese Standardized Test of English Proficiency (VSTEP.3-5) reading test VNU Journal of Foreign Studies, 34(4), 129-143

Nguyen, T Q Y (2017) Building a validity argument for the Vietnamese Standardized Test of English Proficiency (VSTEP.3-5) Graduate Research Symposium (GRS), 705-712

Nguyen, T Q Y (2018) An investigation into the cut-score validity of the VSTEP.3-

5 listening test (Doctoral dissertation) Hanoi: University of Languages and

International Studies, Vietnam National University, Hanoi

Nguyen, T N Q., Nguyen, T Q Y., Tran, T T H., Nguyen, T P T., Bui T S., Nguyen, T C., & Nguyen, Q H (2020) The effectiveness of VSTEP.3-5 speaking rater training VNU Journal of Foreign Studies, 36(4), 99-112

Nisbett, R E., & Wilson, T D (1977) Telling more than we can know: Verbal reports on mental processes Psychological Review, 84(3), 231-259

O’ Sullivan, B., & Weir, C (2011) Language testing and validation In B O’ Sullivan (ed.) Language testing: Theory and practice (pp.13-32) Oxford: Palgrave

Organization for Economic Cooperation and Development (2016) PISA 2018 draft analytical frameworks [EB/OL]

Olshavsky, J E (1977) Reading as problem solving: an investigation of strategies

Olson, G., Duffy, S A., & Mack, R L (1984) Thinking-out-loud as a method for studying real-time comprehension processes In D E Kieras, & M A Just (Eds.),

New methods in reading comprehension research (pp 253-286) Mahwah:

Oranje, A., Gorin, J., Jia, Y., & Kerr, D (2017) Collecting, analyzing, and interpreting response time, eye tracking and log data In K Ercikan, & J.W Pellegrino (Eds.), Validation of score meaning for the next generation of assessments: the use of response processes (pp 39-51) New York: Routledge

Ozuru, Y., Best, R., Bell, C., Witherspoon, A., & McNamara, D S (2007) Influence of question format and text availability on the assessment of expository text comprehension Cognition and Instruction, 25(4), 399-438

Pae, H K., Greenberg, D & Morris, R D (2012) Construct validity and measurement invariance of the Peabody Picture Vocabulary Test-III Form A

Paulson, E J., & Henry, J (2002) Does the degrees of reading power assessment reflect the reading process? An eye-movement examination Journal of Adolescent and Adult Literacy, 46(3), 234-244

Perkins, K (1992) The effect of passage and topical structure types on ESL reading comprehension difficulty Language Testing, 9(2), 163-172

Phakiti, A (2003) A closer look at the relationship of cognitive and metacognitive strategy use to EFL reading achievement test performance Language Testing, 20(1), 26-56

Pishghadam, R., Baghaei, P., & Seyednozadi, Z (2017) Introducing emotioncy as a potential source of test bias: A mixed Rasch modeling study International Journal of Testing, 17, 127-140

Pishghadam, R., Baghaei, P., Shams, M.A., & Shamsaee, S (2011) Construction and validation of a narrative intelligence scale with the Rasch rating scale model The

International Journal of Educational and Psychological Assessment, 8, 75-90

Plakans, L (2009) The role of reading strategies in integrated L2 writing tasks

Journal of English for Academic Purposes, 8(4), 252-266

Plakans, L., & Gebril, A (2012) A close investigation into source use in integrated second language writing tasks Assessing Writing 17(1), 18-34

Pollitt, A., Hutchinson, C., Entwistle, N., & DeLuca, C (1985) What makes exam questions difficult? An analysis of ‘O’ grade questions and answers Edinburgh:

Pressley M., & Afflerbach, P (1995) Verbal protocols of reading: The nature of constructively responsive reading Hillsdale: Lawrence Erlbaum Associates

Qi, D S., & Lapkin, S (2001) Exploring the role of noticing in a three-stage second language writing task Journal of Second Language Writing, 10(4), 277-303 Rasch G (1960) Probabilistic models for some intelligence and attainment tests

Chicago: The University of Chicago Press

Rayner, K (1998) Eye movements in reading and information processing: 20 years of research Psychological Bulletin, 124(3), 372-422

Rayner, K (2009) Eye movements and attention in reading, scene perception, and visual search Quarterly Journal of Experimental Psychology, 62(8), 457-506 Rayner, K., Foorman, B R., Perfetti, C A., & Pesetsky, D (2001) How psychological science informs the teaching of reading Psychological Science in the Public Interest, 2(2), 31-74

Rayner, K., & Pollatsek, A (1989) The psychology of reading Boston: Prentice Hall Rayner, K., Pollatsek, A., Ashby, J., & Clifton, C (2012) The psychology of reading (2nd ed.) New York: Psychology Press

Rayner, K., Slattery, T J., & Bélanger, N N (2010) Eye movements, the perceptual span, and reading speed Psychonomic Bulletin & Review, 17(6), 834-839

Riazi, A M (2016) Comparing writing performance in TOEFL-iBT and academic assignments: An exploration of textual features Assessing Writing, 28, 15-27

Rosa, E M., & Leow, R P (2004) Awareness, different learning conditions, and L2 development Applied Psycholinguistics, 25(2), 269-292

Rosenshine, B V (1980) Skills hierarchies in reading comprehension In Spiro et al (Eds.), Theoretical issues on language testing (pp 535-554) Hillsdale, NJ: Erlbaum

Rost, D (1993) Assessing the different components of reading comprehension: fact or fiction? Language Testing, 10(1), 79-92

Rost, J (1990) Rasch models in latent classes: An integration of two approaches to item analysis Applied Psychological Measurement, 14, 271-282

Rost, J., & von Davier, M (1995) Mixture distribution Rasch models In G H Fischer, & I W Molanaar (Eds.), Rasch models: Foundations, recent developments and applications (pp 257- 268) New York: Springer Verlag

Rumelhart, D E (1977) Towards an interactive model of reading In S Doric (Ed.),

Attention and performance VI Hillsdale, NJ: Lawrence Erlbaum Associates

Rumelhart, D E (1980) Schemata: the building blocks of cognition In R J Spiro et al (Eds.), Theoretical issues in reading comprehension Hillsdale, NJ:

Rupp, A A., Ferne, T., & Choi, H (2006) How assessing reading comprehension with multiple-choice questions shapes the construct: a cognitive processing perspective Language Testing, 23(4), 441-474

Sandelowski, M (2003) Tables or tableaux? The challenges of writing and reading mixed methods studies Handbook of Mixed Methods in Social and Behavioral

Research (pp 321-350) Thousand Oaks, CA: Sage

Schmitt, T A (2011) Current methodological considerations in exploratory and confirmatory factor analysis Journal of Psychoeducational Assessment, 29,

Seagull, F J., Wickens, C D., & Loeb, R G (2001) When is less more? Attention and workload in auditory, visual, and redundant patient-monitoring conditions

Human Factors and Ergonomics Society Annual Meeting Proceedings, 45(18),

Shepard, L A (1993) Evaluating test validity Review of Research in Education, 19,

Smith, E (2005) Effect of item redundancy on Rasch item and person estimates

Smith, F (1994) Understanding reading Cambridge: Cambridge University Press Smith, F., & Goodman, K S (1971) On the psycholinguistic method of teaching reading Elementary School Journal, 71, 177-181

Smith Jr., E V (2004) Evidence for the reliability of measures and validity of measure interpretation: a Rasch measurement perspective In E V Smith Jr., &

R M Smith (Eds.), Introduction to Rasch measurement: Theory, models and applications, (pp 93-122) JAM Press

Smith, M L (2006) Multiple methodology in education research Handbook of complementary methods in education research (pp 457-475) New Jersey:

Smith, R M., Schumacker, R E., & Bush, M J (1998) Using item mean squares to evaluate fit to the Rasch model Journal of Outcome Mearsurement, 2, 66-78 Solheim, O J., & Upstadd, P H (2011) Eye-tracking as a tool in process-oriented reading test validation International Electronic Journal of Elementary Education,

Someren, M W V., Barnard, Y F., & Sandberg, J A (1994) The think aloud method: A practical guide to modeling cognitive processes London: Academic Press

Spearritt, D (1972) Identification of subskills of reading comprehension by maximum likelihood factor analysis Reading Research Quarterly, 8, 92-111 Spivey, M., Richardson, D., & Dale, R (2009) The movement of eye and hand as a window into language and cognition In E Morsella & J Bargh (Eds.), Oxford handbook of human action (pp 225-248) New York: Oxford University Press

Stanovich, K E (1980) Toward an interactive-compensatory model of individual differences in the development of reading fluency Reading Research Quarterly,

Stefanie, W., & Cheng, H (2022) Rasch measurement theory analysis in R: Illustrations and practical guidance for researchers and practitioners London,

Sternberg, R J (1981) Intelligence as thinking and learning skills Educational Leadership, 39(1), 18-20

Sun, Y (2016) Context, construct, and consequences: Washback of the college English test in China Retrieved from ProQuest Dissertations & Theses database

Szudarski, P (2018) Corpus Linguistics for Vocabulary: A Guide for Research Routledge Corpus Linguistics Guides New York: Routledge

Tabatabaee-Yazdi, M., Motallebzadeh, K., Ashraf, H., & Baghaei, P (2018) Development and validation of a teacher success questionnaire using the Rasch model International Journal of Instruction, 11, 129-144

Tai, R H., Loehr, J F., & Brigham, F J (2006) An exploration of the use of eye- gaze tracking to study problem-solving on standardized science assessments

International Journal of Research and Method in Education, 29(2), 185-208

Tashakkori, A., & Teddlie, C (2003) Handbook of mixed methods in social and behavioral research Thousand Oaks, CA: Sage

Tashakkori, A., & Teddlie, C (2006) Validity issues in mixed methods research: Calling for an integrative framework Paper presented at the annual meeting of the American Educational Research Association, San Francisco

Teddlie, C., & Tashakkori, A (2008) Foundations of mixed methods research— Integrating quantitative and qualitative approaches in the social and behavioral sciences Thousand Oaks, CA: Sage

Thomas, R M (2003) Blending qualitative and quantitative research methods in theses and dissertations New York: Corwin Press

Thorndike, E L (1917) Reading as reasoning: A study of mistakes in paragraph reading Journal of Educational Psychology, 8(6), 323-332

Toulmin, S E (1958) The uses of argument Cambridge, UK: Cambridge University Press

Toulmin, S E (2003) The uses of argument (2nd ed.) Cambridge, UK: Cambridge University Press

Trabasso, T., & Magliano, J P (1996) Conscious understanding during comprehension Discourse Processes, 21(3), 255-287

Upton, T., & Lee-Thompson, L (2001) The role of the first language in second language reading Studies in Second Language Acquisition, 23(4), 469-495 Urquhart, S., & Weir, C J (1998) Reading in a second language: Process, product and practice Harlow: Longman

Valette, R M (1977) Modern language testing (2nd ed.) New York: Harcourt Brace Jovanovich

Vidal, E (2010) Individual differences for Self-regulating task-oriented reading activities Journal of Educational Psychology, 102(4), 817-826

Vo, N H (2021) Constructing a validity argument for a locally developed test of English reading proficiency (Doctoral dissertation) Queensland University of

Voss, E (2012) A validity argument for score meaning of a computer-based ESL academic collocational ability test based on a corpus-driven approach to test design (Doctoral dissertation) Iowa State University Ames, IA

Waern, Y (1980) Thinking aloud during reading: A descriptive model and its application Scandinavian Journal of Psychology, 21(1), 123-132

Wallace, C (1992) Reading Oxford: Oxford University Press

Wang, H X (2008) A systems approach to the reform of College English Testing— Report on the “Survey of College English Testing Reform” Foreign Languages in China, 4, 4-12

Wang, H (2010) Investigating the justifiability of an additional use: An application of assessment use argument to an English as a foreign language test (Doctoral dissertation) Retrieved from ProQuest Dissertations & Theses database

Weakly, S (1993) Procedures in the content validation of an EAP proficiency test of reading comprehension (Unpublished MA thesis) University of Reading

Wei, Y (2021) A comparative study on the validity of VSTEP.3-5 and PETS-5 reading tests International Graduate Research Symposium (IGRS), 40-51

Wei, Y., & Nguyen, D H (2021) A validity study on the Vietnamese Standardized Test of English Proficiency (VSTEP.3-5): From test-takers’ perspectives Journal of China Examinations, 10, 67-73

Weigle, S C (2002) Assessing writing Cambridge, UK: Cambridge University Press Weir, C J (1988) The specification, realization and validation of an English language proficiency test In A Hughes (Ed.), Testing English for university study, ELT documents, 127, 45-110

Weir, C J (1993) Understanding and Developing Language Tests New York: Prentice Hall

Weir, C J (2005) Language testing and validation: An evidence-based approach

Weir, C J., Hawkey, R., Green, T., & Devi, S (2009) The cognitive processes underlying the academic reading construct as measured by IELTS British Council/IDP Australia IELTS Research Reports, 9(4), 157-189

Weir, C J., Hughes, A., & Porter, D (1990) Reading skills: Hierarchies, implicational relationships and identifiability Reading in a Foreign Language, 7(1), 505-510

Weir, C J., & Porter, D (1994) The multi-divisible or unitary nature of reading: The language tester between Scylla and Charybdis Reading in a Foreign Language,

Weir, C J., Yang, H Z., & Jin, Y (2000) An empirical investigation of the componentiality of L2 reading in English for academic purposes Cambridge:

Whitney, P., Ritchie, B G., & Clark, M B (1991) Working-memory capacity and the use of elaborative inferences in text comprehension Discourse Processes, 14(2), 133-145

William, G., & Fredricka, L S (2005) Teaching and researching: Reading Beijing: Foreign Language Teaching and Research Press

Williams, E., & Moran, C (1989) Reading in a foreign language at intermediate and advanced levels with particular reference to English Language Teaching, 22(4),

Wilson, M (2005) Constructing measures: An item response modeling approach

Wright, B D & Linacre, J M (1994) Reasonable mean-square fit values Rasch Measurement Transactions, 8(3): 370

Wright, B D & Mok, M M C (2004) An overview of the family of Rasch measurement models In E V Smith & R M Smith (Eds.) Introduction to Rasch measurement (pp 1-24) JAM Press

Xi, X (2008) Methods of test validation In E Shohamy & N H Hornberger (Eds.)

Encyclopedia of language and education (2nd ed., pp 177-196) New York, NY:

Yan, Z (2010) Objective measurement in psychological science: An overview of Rasch Model Advances in Psychological Science, 18(8), 1298-1305

Yang, T K (2005) Measurement of Korean EFL college students’ foreign language classroom speaking anxiety: Evidence of psychometric properties and accuracy of A Computerized Adaptive Test (CAT) with dichotomously scored items using a CAT simulation (Doctoral Dissertation) Austin: The University of Texas at

Control file for WINSTEPS

Title= "C:\Users\DELL\Desktop\VSTEP.3-5 reading test.xlsx"

; Excel file created or last modified: 2023/3/6 21:36:15

ITEM1 = 1 ; Starting column of item responses

NAME1 = 42 ; Starting column for person label in data record

NAMLEN = 8 ; Length of person label

XWIDE = 1 ; Matches the widest data value observed

TOTALSCORE = Yes ; Include extreme responses in reported scores

; Person Label variables: columns in label: columns in line

&END ; Item labels follow: columns in label i1 ; Item 1 : 1-1 i2 ; Item 2 : 2-2 i3 ; Item 3 : 3-3 i4 ; Item 4 : 4-4 i5 ; Item 5 : 5-5 i6 ; Item 6 : 6-6 i7 ; Item 7 : 7-7 i8 ; Item 8 : 8-8 i9 ; Item 9 : 9-9 i10 ; Item 10 : 10-10 i11 ; Item 11 : 11-11 i12 ; Item 12 : 12-12 i13 ; Item 13 : 13-13 i14 ; Item 14 : 14-14 i15 ; Item 15 : 15-15 i16 ; Item 16 : 16-16 i17 ; Item 17 : 17-17 i18 ; Item 18 : 18-18 i19 ; Item 19 : 19-19 i20 ; Item 20 : 20-20 i21 ; Item 21 : 21-21 i22 ; Item 22 : 22-22 i23 ; Item 23 : 23-23 i24 ; Item 24 : 24-24 i25 ; Item 25 : 25-25 i26 ; Item 26 : 26-26 i27 ; Item 27 : 27-27 i28 ; Item 28 : 28-28 i29 ; Item 29 : 29-29 i30 ; Item 30 : 30-30 i31 ; Item 31 : 31-31 i32 ; Item 32 : 32-32 i33 ; Item 33 : 33-33 i34 ; Item 34 : 34-34 i35 ; Item 35 : 35-35 i36 ; Item 36 : 36-36 i37 ; Item 37 : 37-37 i38 ; Item 38 : 38-38 i39 ; Item 39 : 39-39 i40 ; Item 40 : 40-40

Person measures of the investigated VSTEP.3-5 reading test

Person Total score Measure Rasch

Expert interview

Thank you very much for participating in this interview I will ask you some questions about your views on the modified framework of task characteristics for VSTEP.3-5 reading test and the analyses of the current study which cover several indicators under characteristics of input and characteristics of the expected response based on the modified framework of task characteristics for VSTEP.3-5 reading test respectively Your answer will help us improve the VSTEP.3-5 reading test better The interview will be recorded and the content will be only for research Thank you for your cooperation

Have you read the task characteristics framework (Bachman and Palmer, 1996), the modified framework of task characteristics for VSTEP.3-5 reading test, the investigated VSTEP.3-5 reading test paper and the VSTEP.3-5 reading test specifications? If yes, please answer the following questions

1 To what extent do you think the modified framework of task characteristics containing enough aspects for investigating test tasks of the VSTEP.3-5 reading test?

2 To what extent do you think the modified framework of task characteristics for VSTEP.3-5 reading test is reasonable in the design of the input?

3 To what extent do you think the modified framework of task characteristics for VSTEP.3-5 reading test is reasonable in the design of expected response?

1 To what extent do you approve the analyses for the investigated VSTEP.3-5 reading test in terms of text length, vocabulary, grammar, text domain, text topic and text level in the current study?

2 To what extent do you approve the analyses for the investigated VSTEP.3-5 reading test in terms of response type in the current study?

Expert judgement form A

Professional title: Teaching age: Research direction:

Part II Categorizations on the reading subskills needed in the reading process for the VSTEP.3-5 reading test

Please refer to the VSTEP.3-5 reading test specifications and the Khalifa and Weir’s cognitive model of reading (2009), and then categorize the reading subskills needed in the reading process for VSTEP.3-5 reading test in the following Table

Reading subskills and descriptors of the VSTEP.3-5 reading test

Expert judgement form B

Please read the passages and items of the reading test, browse the reading subskills listed in Table 1, and then judge the items in Table 2

1) For each item, you can select one or more subskills

2) If you think an item examines one subskill, please mark “1” below the subskill If you recognize two or more subskills have been investigated for the item, please select the main subskills that have the greatest impact on the item Mark “1” to represent primary subskill; Mark “2” for other subskills to represent secondary subskill(s)

3) If you think an item examines subskills not listed in Table 1, please explain them in other columns

Table 1 Results on reading subskills and descriptors of the VSTEP.3-5 reading test

-Can understand specific information that are explicitly stated in the passage, using simple grammatical structures and vocabulary

-Can scan for specific information in a paragraph/ some paragraphs

-Can identify and understand paraphrased information explicitly stated in the passage

-Can recognize word meanings in context (words with different meanings)

-Can recognize word meanings in context (idiomatic expressions/words with highly colloquial expressions)

-Can identify the antecedent of a pronoun

-Can understand the relationship between sentences or ideas using connective devices such as discourse markers, anaphoric and cataphoric references, substitutions, repetitions

-Can infer complex references in the passage (with confounding factors) -Can identify and understand an implicit detail that is rewritten using different words

-Can identify the implication of a sentence/a detail

-Can locate and integrate information across a paragraph

-Can locate and integrate information across the passage

-Can summarize the main ideas of a paragraph or the passage

-Can identify supporting information for an argument or the main idea of a paragraph or the passage

-Can understand significant points that are stated and relevant to the main ideas

-Can understand the explicit attitude/opinion of author

-Can understand the implicit stance/opinion/intention of author in the passage

-Can understand the general tone of the passage

-Can identify the purpose and function of the passage

-Can identify the organizational structure of the passage

-Can identify the genre of the passage

Table 2 Correspondence between subskills and each item of the VSTEP.3-5 reading test Item A B C D E F G H Other subskills

VSTEP.3-5 reading test think aloud protocols experiment operation

VSTEP.3-5 reading test think aloud protocols experiment operation instructions

Thank you very much for your participation in this study The purpose of this study is to investigate the validity of VSTEP.3-5 reading test by explore the thinking process of test-takers in completing reading tasks, that is, what subskills they use in reading and answering questions This study adopts the think aloud protocols experiment The think aloud protocols experiment requires participants to speak out all kinds of ideas and information in their brains when completing certain tasks In this experiment, you are supposed to do reading test while speaking out the thinking process of yourself, that is, “think aloud” You don’t need to plan or organize your language carefully; you can just act as if you are talking to yourself You can make think aloud report in Chinese or English The experiment is expected to take no more than two hours This experiment will be recorded The recorded content is only for research use Your personal information will be kept strictly confidential

1 First of all, I will guide you to do two exercises of thinking aloud to help you get familiar with the reporting method of thinking aloud

2 After completing the exercise, you will read four passages and answer 40 multiple- choice questions

3 Please say all your thoughts when reading the passages and answering the questions The content of oral report is not only the reason and subskills of choosing an option, but also the whole process of answering questions During the reporting process, if you have a long pause in your speech, you may be prompted to continue your speech

If you’re thinking in English, just speak English; if you are thinking in Chinese, just speak Chinese

1 First, let’s do a Scrabble I will give you some letters You will sort these letters to form an English word All the letters should be used For example, the letters o-d-o-r can be reordered to form the word ‘door’ These letters are c-l-k-u Please say all your thoughts

2 Please read the following passage and answer the questions You can start the think aloud protocols now

These laws are universal in their application, regardless of cultural beliefs, geography, or climate If pots have no bottoms or have large openings in their sides, they could hardly be considered containers in any traditional sense Since the laws of physics, not some arbitrary decision, have determined the general form of applied-art objects, they follow basic patterns, so much so that functional forms can vary only within certain limits

The underlined word ―they in the passage refers to

3 Do you have any other questions? If not, let’s start the formal experiment

In think aloud protocols, if the participants are silent, guide the participants using the following reminders:

1 Can you tell me what you are thinking when you answer this question?

2 What else do you think?

Reading processing checklist

In order to answer these test items, I used the following reading subskills…

(You just need to choose one main subskill for each item If you think an item examines subskills not listed, please explain them in other columns.)

Interpretive argument for the VSTEP.3-5 reading test

Inference Warrant Assumption Backing evidence Potential rebuttal

Observed scores are estimates of expected scores over the relevant parallel versions of tasks and test forms

1 Configuration of tasks on VSTEP.3-5 reading measures is appropriate for intended interpretation

2 The test items are designed with the predetermined difficulty as described in the test specification of the test

3 Genders and majors of test takers cannot affect their reading scores

1 Findings of characteristics of the VSTEP.3-5 reading test tasks

2 Findings of the statistical data about items of the VSTEP.3-5 reading test

3 Findings of DIF in test takers’ genders and majors

1 The test tasks are beyond the scope specified in the test specifications

2 The test items are not properly designed in accordance with the test specifications

3 The VSTEP.3-5 reading test includes DIF of genders or majors

4 The test reliability is in good range of the VSTEP.3-5 reading test

4 Findings of the data about test reliability of the VSTEP.3-5 reading test

4 The test reliability is lower than expected

Test takers’ scores on the VSTEP.3-5 reading test can be attributed to the construct of English reading proficiency

1 Reading processes engaged by test takers vary according to theoretical expectations

2 There are differences between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements

1 Findings of the comparison between the reading process that experts judged according to the test items and the reading process that the test takers actually engaged

2 Findings of analysis for eye movements between successful and unsuccessful test takers in processing the different types of VSTEP.3-5 reading test items

1 Test takers might employ test-wise strategies to answer the test questions

2 Successful and unsuccessful test takers are the same in processing the different types of VSTEP.3-5 reading test items in terms of their eye movements.

Tiêu đề	A Study on the Validity of a Vietnamese Developed Test of English Reading Proficiency
Tác giả	Wei Yu
Người hướng dẫn	Prof. Nguyen Hoa, Dr. Duong Thu Mai
Trường học	University of Languages and International Studies, Vietnam National University, Hanoi
Chuyên ngành	English Language Teaching Methodology
Thể loại	Doctoral Dissertation
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	312
Dung lượng	9,58 MB

A study on the validity of a vietnamese developed test of english reading proficiency = nghiên cứu xác trị bài thi Đọc hiểu của một Định dạng bài thi tiếng anh của việt nam

INTRODUCTION

Rationale of the study

Objective of the study

Scope of the study

Clarifications of the key terms

Context of the study

Significance of the study

Overall structure of the thesis

Summary

LITERATURE REVIEW

Review of L2 reading

Review of test validation approaches

Summary

THEORETICAL FRAMEWORK

Validity

Studies on the validity of VSTEP.3-5

Validation

Empirical validation studies with interpretive argument framework

Interpretive argument for the VSTEP.3-5 reading test

Summary

RESEARCH METHODOLOGY

Ontology and epistemology for the design of the study

The convergent mixed methods design for the study

Participants of the study

Instruments of the study

Data collection procedures of the study

Tools for data analysis

Data analysis procedures of the study

Summary

FINDINGS AND DISCUSSIONS ON VALIDITY OF VSTEP.3-5

Findings and discussions for research question 1

Findings and discussions for research question 2

Findings and discussions for research question 3

Summary

FINDINGS AND DISCUSSIONS ON VALIDITY OF VSTEP.3-5

Findings and discussions for research question 4

Findings and discussions for research question 5

Summary

CONCLUSION

Interpretive argument for the VSTEP.3-5 reading test revisited

Implications of the study

Limitations of the study

Suggestions for future studies

Control file for WINSTEPS

Person measures of the investigated VSTEP.3-5 reading test

Expert interview

Expert judgement form A

Expert judgement form B

VSTEP.3-5 reading test think aloud protocols experiment operation

Reading processing checklist

Interpretive argument for the VSTEP.3-5 reading test