Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 59 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
59
Dung lượng
221,04 KB
Nội dung
1 TABLE OF CONTENT APPENDIX A EPT W RATING SCALE QUYẾT ĐỊNH GIAO ĐỀ TÀI (Bản sao) LIST OF FIGURES Number of Name of Figure Figure 2.1 2.2 An illustration of inferences in the interpretative argument (adapted from Chapelle et al 2008) Bridges that represent inferences linking components in Page 2.3 performance assessment (adapted from Kane et al., 1999) Evidence to build the validity argument for the test 15 3.1 Participants 24 Generalization inference in the validity argument for the PT W 5.1 with assumptions and backing 40 The explanation inference in the validity argument for the PT 5.2 test with assumption and backing 42 LIST OF TABLES Number Name of table of table 1.1 The structure of the EPT W Page Summary of the inferences, warrants in the TOEFL validity 2.1 argument with their underlying assumptions (Chapelle et al., 12 2010, p.7) A framework of sub-skills in academic writing (McNamara, 2.2 1991) 19 3.1 Texts and word counts in the two levels of the EPT sub-corpora 30 4.1 Variance components attributed to test scores 32 4.2 Dependability Estimate 37 4.3 Distribution of vocabulary across proficiency levels 38 ACKNOWLEDGMENTS I would like to express my deeply sincere appreciation for my supervisor, Dr Vo Thanh Son Ca who inspired and gave me her devoted instruction throughout the period until the project work was conducted From the onset, my research topic was quite wide, but with her support and guidance, I have learned how to combine theory and practice in use Thanks to her instruction and willingness to motivate and guide me with many questions and comments, I came to be deeply aware of the important role of doing research in this area of language testing and assessment More than that, I welcomed her dedicated and detailed support since her quick feedback and comments on my drafted work I mean that she observed thoroughly every step of my work, helped me make significant improvements Without my supervisor's dedicated support, the research would not have been completed Moreover, I would like to take this chance to thank my family and friends, who always take care, assist and encourage me I could not have completed my dissertation, without the support of all these marvelous people ABSTRACT Foreign or second language writing is one of the important skills in language learning and teaching, and universities use scores from writing assessments to make decisions on placing students in language support courses Therefore, in order for the inferences based on scores from the tests are valid, it is important to build a validity argument for the test This study built a validity argument for the English Placement Writing test (EPT W) at BTEC International College Danang Campus Particularly, this study examined two inferences which are generalization and evaluation by investigating the extent to which tasks and raters attributed to test score variability, and how many raters and tasks are needed to get involved in assessment progress to obtain the test score dependability of at least 85, and by investigating the extent to which vocabulary distributions were different across proficiency levels of academic writing To achieve the goals, the test score data from 21 students who took two writing tasks were analyzed using the Generalizability theory Decision studies (Dstudies) were employed to investigate the number of tasks and raters needed to obtain the dependability score of 0.85 The 42 written responses from 21 students were then analyzed to examine the vocabulary distributions across proficiency levels The results suggested tasks were the main variance attributed to variability of test score, whereas raters contributed to the score variance in a more limited way To obtain the dependability score of 0.85, the test should include 14 raters and 10 tasks or 10 raters and 12 tasks In terms of vocabulary distributions, low level students produced less varied language than higher level students The findings suggest that higher proficiency learners produce a wider range of word families than lower proficient counterparts CHAPTER INTRODUCTION This chapter presents the introduction to test validity and the purpose of this thesis The chapter concludes with the significance of this thesis 1.1 INTRODUCTION TO TEST VALIDITY Language tests are needed to measure students' ability in English in college settings One of the most common tests developed is entrance tests or placement tests which are used to place students into appropriate language courses Thus, the use of test scores cannot be denied as a very important role The placement test at BTEC International College is used as an example for building this research study and helping to build up validity argument with further research purposes It will have certain impact on students, administrators, and instructors at BTEC International College Da Nang Campus First, the test score helps students know whether they are ready for collegiate courses taught in English Second, the test score helps administrations at English programs place students into appropriate English language use class level The information on students' ability would help instructors with their instruction or lesson planning Besides, students would also value the importance of language use ability for their success in college so that they pay more attention to their improvement of academic skills With the importance role of entrance tests, test validity is the focus of this study Test validity is the extent to which a test accurately measures what it is supposed to be measure and validity refers to the interpretations of test score entailed by proposed uses of tests which is supported by evidence and theory (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) In other words, validation is a progress in which test developers and/or test users gather evidence to provide “a sound scientific basis” for interpreting test scores Validity researchers emphasize on quality, rather than quantity of evidence in order to support validity interpretation Evidence to support can be one of the four categories: evidence based on content of tests, evidence based on response processes, evidence based on relations to other variables, evidence based on consequences of testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) In order to provide those four categories of evidence, we need to conduct different kinds of research studies: the Achieve alignment method is used for providing evidence based on content of tests (Rothman, Slattery, Vranek, & Resnick, 2002); evidence based on response processes is provided with the help of the cognitive interview method (Willis, 2005; Miller, Cheep, Wilson, & Padilla, 2013); the predictive method is conducted to provide evidence based on relations to other variables; evidence based on consequences of testing can be backed up with the use of argument-based approaches for a test's interpretative argument and validity argument (Kane, 2006) 1.2 THE STUDY BTEC International College - FPT University administers its placement test (PT) every semester to incoming students to measure their English proficiency for university studies The test is composed of four skills: reading, listening, speaking, and writing Only writing skill is the focus of this study This study developed a validity argument for the English Placement Writing test (EPT W) at BTEC International College - FPT University Developed and first administered in Summer 2019, the EPT W is intended to measure test takers' writing skills necessary for success in academic contexts (See Table 1.1 for the structure of the EPT W.) Therefore, building a validity argument for this test is very important It is helpful for many educators and researchers to understand the consequences of assessment Particularly, the objectives of this study are investigating: 1) the extent to which tasks and raters attributed to score variability; 2) how many tasks and raters are needed to get involved in assessment to obtain thetest score dependability of at least 85; and 3) the extent to which vocabulary distributions are different across proficiency levels of academic writing Total test time Table 1.1 The structure of the EPT W 30 minutes Number of parts Part Total time 15 minutes Task content Write a paragraph using one tense on any familiar topics For example: Write a paragraph (100-120 words) to describe an event you attended recently Part Total time 15 times Task content Write a paragraph using more than one tense on a topic that relates to publicity For example: Write a paragraph (100-120 words) to describe a vacation trip from your childhood Using these clues: Where did you go? When did you go? Who did you go with? What did you do? What is the most memorable thing? Etc The EPT W uses a rating rubric to assess test takers' performance The appropriateness of a response is based on a list of criteria, such as task achievement, grammatical range and accuracy, lexical resource, coherence and cohesion (see Appendix A) 1.3 SIGNIFICANCE OF THE STUDY The results of the study should contribute theoretically to the field of language assessment By providing evidence to support inferences based on the scores of the EPT W test, this current study attempts to provide the discussion of test validity in the context of academic writing Practically, the results should contribute to the possible use findings of quantity of of this tasks study and should raters provide to assess an understanding writing ability The different kind of language components elicited affect This variability would offer of test guidance scores onof andhow the choosing an appropriate task for measuring academic writing CHAPTER LITERATURE REVIEW This chapter discusses previous studies on validity and introduces generalizability theory (G-theory) that was used as background for data analyses 2.1 STUDIES ON VALIDITY DISCUSSION 2.1.1 The conception of validity in language testing and assessment What is validity? The definition of validity in language testing and assessment could be given in three main time periods First, Messick (1989) stated that “validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores or other modes of assessment” (p.13) Then, Messick's view about validity was supported and found an official recognition in AERA, APA, and NCME (1985) which describes validity as follows: The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores Test validation is the process of accumulating evidence to support such inferences A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference Validity, however, is a unitary concept (p 9) Second, the definition of validity is well-explained and elaborated by Bachman (1990) This definition helps to confirm that the inferences made on the basis of test scores and their uses are the objects of validation rather than the tests themselves in concert with the Messick's view According to Bachman, validity has a complex nature comprising of a number of aspects including content validity, construct validity, concurrent validity, and consequences of test use AERA et al EPT L2 In contrast, K2 tokens and AWL increased across score levels EPT L1 responses included 0.52 K2 tokens and 0.05 AWL, and EPT L2 responses included 0.53 K2 tokens and 0.15 AWL However, there was no statistically significant difference in lexical diversity, lexical density, and lexical sophistication across two levels of proficiency As for word families, there were 372 word families in the EPT L1 and 379 in the EPT L2 There was statistically significant difference in the proportion of word families across two groups of learners This suggests that the higher-level used more word families in their written discourse, suggesting that the higher-level written responses were linguistically and cognitively than the lower-level output Overall, although there is not much difference in lexical density, lexical diversity, and lexical sopihistication among the EPT levels, when combined with the other individual measures, such as types, tokens, and word families, the higher- level learners' written dicourse come across as being more complex than the lower- level learners' written language CHAPTER DISCUSSION AND CONCLUSIONS The purpose of this study was to build a validity argument for the EPT W test The study focused on two inferences of generalization and explanation (Chapelle et al., 2008) For the generalization inference, this study investigated the extent to which tasks and raters attributed to score variability and how many tasks and raters are needed to get involved in assessment to obtain the test score dependability of at least 85 For the explanation inference, this study analyzed discourse of written responses of two groups of students across different proficiency levels in terms of the extent to which vocabulary distributions are different across proficiency levels of academic writing This chapter presents a summary and discussion of the findings of each question 5.1 GENERALIZATION INFERENCE Figure 5.1 Generalization inference in the validity argument for the PT W with assumptions and backing The first assumption is sufficient number of tasks included on the test to provide stable estimates of test takers' performance Backing for the first assumption was supported by generalizability analysis The results showed that the PT W with two tasks provided 50% performance of test takers The second assumption is sufficient number of raters included on the test to provide stable estimates of test takers' performance Backing for the second assumption was also supported by Gtheory The findings suggested that the test with two raters provided 50% performance of test takers Decision studies were employed to figure out the number of tasks and raters needed to obtain the dependability index of 0.85 In order to reach to the dependability score of 0.85, we would need to have 14 raters and 10 tasks or 10 raters and 12 tasks However, due to our practicality and resources at BTEC, we could not meet the demand of 14-rater-&-10-task or 10-rater-&-12-task test scale Different from the other high-stakes test such as TOEFL iBT or IELTS, the EPT W is a medium-stakes test which was intended to place students into language support courses That is, PT at BTEC is a medium-stake test due to its compulsoriness If the students could not pass the third level according to the rating scale (see Appendix A), they would not enter the majoring at BTEC Moreover, the college regulations would not force students that did not pass the PT to drop out They would have another chance to retake the test Therefore, given the nature of the EPT W test and the availability of resources at BTEC, it would be appropriate to decrease the dependability score to 0.7 With this dependability score, we would need raters and tasks or raters and tasks These numbers of tasks and raters seem practical, given the current resources at BTEC 5.2 EXPLANATION INFERENCE Figure 5.2 The explanation inference in the validity argument for the PT test with assumption and backing The assumption underlying the warrant of explanation inference is the linguistic knowledge, processes, and strategies required to successfully complete tasks vary in keeping with theoretical expectations Backing for this assumption was supported by discourse analysis of test takers' written responses in terms of vocabulary frequency The analyses of written discourse of EPT W suggested that there was variation in terms of lexical frequency distributions in students' language between language proficiency levels The findings based on two EPT sub-corpora groups consisting of 42 texts and 3920 tokens suggested that the single word-based measures such as the number of types, tokens, and tokens increased as the proficiency levels increased, and higher proficiency learners produces a wider range of word families than lower proficient counterparts 5.3 SUMMARY AND IMPLICATIONS OF THE STUDY Overall, the results supported the two assumptions of generalization inference in a moderate manner That is, more evidence is needed to be investigated under generalization inference for the PT W test For the explanation inference, the only assumption investigated in this study was supported by qualitative evidence, although this is limited with quantitative analysis This study has several implications First, practical implication is that the two-task-&-two-rater test could reach a dependability estimate of 50% To obtain a higher score, we need to enlarge the scale of test with more tasks and raters included The number of tasks and raters can be decided by decision makers at the college, with a reference to the findings of this study Second, the research has methodological implications in that the use of mixed methods approach can be employed to maximize the backing for the assumptions of the inferences Third, the study suggests that raters and tasks were the main components attributed to the test score variability That implicates rater training was important in performance assessments such as writing or speaking tests 5.4 LIMITATIONS OF THE STUDY AND SUGGESTION FOR FUTURE RESEARCH As being the very first investigation into the validity issue of the EPT Writing test at BTEC, there exist a number of limitations on the results produced in the analyses of the study First, due to the scope of the study, only two inferences (generalization and explanation) were investigated A validity argument with strong evidence should be backed up with evidence from six inferences such as domain description, evaluation, generalization, explanation, extrapolation and utilization For each inference, there were many assumptions underlying the warrant that supported the inference The second limitation here was that just only one or two assumptions were conducted to give example of backing sought to support assumptions For instance, the generalization that was based for the first and second research questions has three assumptions: 1) a sufficient number of tasks are included on the test to provide stable estimates of test takers' performances, 2) Configuration of tasks on measures is appropriate for intended interpretation, 3) Appropriate scaling and equating procedures for test scores are used, 4) Task, test, and rating specifications are well defined so that parallel tasks and test forms are created This study focused on only the first assumptions (Chapelle et al., 2008) That led to the open suggestion for future research in the field of building validity argument for PT test with many other inferences supported by other assumptions Third, related to the linguistic field, the study examined the vocabulary distributions across the proficiency levels by calculating the texts and tokens based on two different sup-corpora groups In terms of linguistic knowledge, other aspects of language should be analyzed such as: grammar, semantic, pragmatics, etc In other words, more linguistic features should be analyzed in future research REFERENCES American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999) The standards for educational and psychological testing American Educational Research Association American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985) The standards for educational and psychological testing American Educational Research Association Bachman, L F (1990) Fundamental considerations in language testing Oxford university press, 188-197 Bachman, L F & Palmer, A S (1996) Language testing in practice: Designing and developing useful language tests Oxford university press 1996 Borsboom, D & Mellenbergh, G J (2004) The concept of validity, 111(4), 1061-1071 Brown, C R., Moore, J L., Silkstone, B E & Botton, C (1996) The construct validity and context dependency of teacher assessment of practical skills in some pre-university level science examinations 3(3) 377-392 7.Brown, J D (1989) Improving ESL Placement tests using two perspective 8.TESOL Quarterly, 23(1), 65-83 Brow, J D (1996) Testing in language programs New in language programs New Jersey: Prentice Hall 10 Chapelle, C A., Jamieson, J., & Hegelheimer, V (2003) Validation of a webbased ESL test Language Testing, 20(4), 409-439 11 Chapelle, C A., Enright, M K., & Jamieson, J M (2008) Building a validity argument for the test of English as a foreign language 12 Chapelle, C A., Enright, M K, & Jamieson, J M (2010) Does an argumentbased approach to validity make a difference? Educational measurement issues and practice, 1(29), 3-13 13 Crooks, T J., Kane, M T., & Cohen, A S (1996) Threats to the valid use of assessments Assessment in education: Principles, policy & practice,.3(3), 265286 14 Cronbach, L J., & Meehl, P E (1955) Construct validity in psychological tests Psychological Bulletin, 52(4), 281-232 15 Douglas, D (ed.) (2003) English language testing in U.S colleges and universities (2nd ed.) Washington, D.C: Association of International Educators 16 Douglas, D (2009) Understanding Language Assessment London: Hodder 17 Fulcher, G (1997) An English language placement test: Issues in reliability and validity Language testing, 14(2), 113-139 18 Kane, M (1992) An argument-based approach to validity Psychological bulletin, 112(3), 527-535 19 Kane, M (2001) Current concerns in validity theory Journal of educational measurement, 38(4), 319-342 20 Kane, M (2002) Validating high stakes testing programs Educational measurement: Issues and practice, 21 (1), 31-41 21 Kane, M (2004) The analysis of interpretive arguments: some observations inspired by the comments Measurement: Interdisciplinary research and perspective, 2(3), 192-200 22 Kane, M (2006) Validation In R Brennon (ed.), Educational measurement (4th ed.) Westport, CT: American Council on Education and Praeger, 17-64 23 Kane, M., Crooks, T & Cohen, A (1999) Validating measures of performance Educational measurement: Issues and practice, 18(2), 5-17 24 Lee, Y J & Greene, J (2007) The predictive validity of an ESL placement test: a mixed methods approach Journal of mixed methods research, 1(4), 366389 25 Messick, S (1989) Meaning and values in test validation: The science and ethics of assessment America Educational Research Association & SAGE Publications, 18(1), 5-11 26 Mislevy, R L (2003) Argument substance and argument structure in educational assessment CSE Technical Report 605 Los Angeles: Center for the study of evaluation 27 Shavelson, R., & Webb, N (1991) Generalization theory: A primer Sage Publications 28 Raimes (1994) Testing writing in EFL exams: The learners' viewpoint as valuable feedback for improvement Procedia - Social and behavioral sciences, 199 (2015), 30-37 29 Lines (2004) Guiding the reader (or not) to re-create coherence: observations 30 on postgraduate student writing in an academic argumentative writing task Journal of English for academic purposes, 16(2014), 14-22 31 Toulmin, S., Rieke, R., & Janik, A (1984) An introduction to reasoning New York: Macmillan 32 Usaha, S (2000) Effectiveness of Suranaree University's English placement test Suranaree University of Technology Retrieved on 12 th September 2010 from http://hdl.handle.net/123456789/2213 33 Wall, D., Claphan, C & Alderson, J C (1994) Evaluating a placement test Language testing, 3(11), 321-344 APPENDIX A EPT W RATING SCALE Level (Top Notch Level (Top Notch Level (Top Notch 2) Criteria Task Does not attend achievement pts Fundamental) 1) 0-4.5 4.5-5.0 Level (Top Notch 3) • Answer is barely • Answer 5.5-6.5 5.0-5.5 • Attempts to address the • Attempts to address related to the task sometimes presents task but answer often the task but usually related ideas Writes nothing or does write no English not cover all key points address all key points words Grammatical range Does not attend and • Barely uses • Sometimes correct grammatical correct accuracy Writes nothing or structures pts write no English tenses words and sentences uses • Attempts to use a • Usually Uses only a simple variety of structures, but range of structures; (the first only rare use of Attempts complex subject always) and subordinate clauses sentences but these tend correct forms of verbs to be less accurate than Have frequent simple sentences grammatical errors • Have frequent grammatical errors Level (Top Notch Criteria Fundamental) 0-4.5 Lexical resource pts Does not attend • Only uses a few isolated words Level (Top Notch Level (Top Notch 2) 1) 4.5-5.0 Level (Top Notch 3) 5.5-6.5 5.0-5.5 • Uses a limited • Uses basic • Uses a limited Writes nothing or range of words and vocabulary which may expressions with no be used repetitively or range of vocabulary, but this is minimally write no English words control of word which may be formation and/or inappropriate for the adequate for the task spelling task Has a good control of • Has limited control word formation of word formation Coherence and cohesion pts Does not attend • Rarely writes any Writes nothing or message due to lack of cohesive control organisational of information and ideas but these are information with some organisation but write no English words devices and of close connection features such cohesive devices as not arranged coherently and there there may be a lack of overall progression among sentences • A lot of spelling and Has very little Frequently punctuation spelling • Presents is no clear progression has in the response and • Uses some basic • Present • Uses s cohesive devices effectively, but cohesion within Criteria Level (Top Notch Level (Top Notch Fundamental) 1) 3) 0-4.5 4.5-5.0 5.5-6.5 errors that hinder comprehension Level (Top Notch 2) 5.0-5.5 punctuation errors that cohesive devices but hinder these comprehension may Level (Top Notch and/ or between be sentences may be faulty inaccurate or repetitive • or mechanical Sometimes has • May be repetitive spelling and because of lack of punctuation errors that may comprehension referencing and hinder substitution • Has some spelling and punctuation errors but does not hinder comprehension TRƯ ĐẠI HỌC ĐÀ NẴNG ỜNG ĐẠI HỌC NGOAI NGỮ Số: ƯƯ/QĐ-ĐHNN Độc lập - Tự - Hạnh phúc Đà Nang, ngày Qd tháng 10 năm 2019 CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM QUYẾT ĐỊNH việc giao đề tài trách nhiệm ngưòi hu'O'ng dẫn luận văn thạc sĩ HIỆU TRƯỞNG TRƯỜNG ĐẠI HỌC NGOẠI NGỮ Căn Nghị định số 32/CP ngày 04 tháng năm 1994 Chính phủ việc thành lập Đại học Đà Nang; Căn Quyết định số 709/QĐ - TTg ngày 26 tháng năm 2002 Chính phủ việc thành lập Trường Đại học Ngoại Ngữ, thuộc Đại học Đà Nang; Căn Thông tư số 08/2014/TT-BGDĐT ngày 20 tháng năm 2014 Bộ trưởng Bộ Giáo dục Đào tạo việc ban hành Quy chế tổ chức hoạt động Đại học vùng sỏ' giáo dục đại học thành viên; Căn Thông tư số 15/2014/TT-BGDĐT ngày 15 tháng năm 2014 Bộ trưởng Bộ Giáo dục Đào tạo việc ban hành Quy chế đào tạo trình độ thạc sĩ; Căn Quy định nhiệm vụ, quyền hạn cửa Đại học Đà Nang, sỏ' giáo dục đại học thành viên đon vị trực thuộc ban hành kèm theo Quyết định số 6950/QĐ- ĐHĐN ngày 01 tháng 12 năm 2014 Giám đốc Đại học Đà Nang; Căn Quyết định số 975/QĐ-ĐHNN ngày 04 tháng 11 năm 2016 Hiệu trưởng Trường Đại học Ngoại ngũ' - Đại học Đà Nang việc ban hành Quy định Đào tạo trình độ thạc sĩ; Căn Biên họp hội đồng bảo vệ đề cương chi tiết luận văn tốt nghiệp thạc sĩ chun ngành Ngơn ngữ Anh khóa 36 ngày 17/7/2019; Theo đề nghị Trưởng phòng Phòng Đào tạo, QUYẾT ĐỊNH: Điều Giao cho học viên cao học Võ Thị Thu Hiền., lóp K36.NNA.ĐN, chun ngành Ngơn ngữ Anh, thực đề tài luận văn “Developing a Validity Argument for the English Placement Test at International College BTEC FPT Da Nang Campus”, hướng dẫn TS Võ Thanh Sơn Ca, Trường Đại học Ngoại ngữ - Đại học Đà Nang Điều Học viên cao học người hướng dẫn có tên Điều đưọ'c hưởng quyền lợi thực nhiệm vụ theo Quy chế đào tạo trình độ thạc sĩ Bộ Giáo dục Đào tạo ban hành Quy định đào tạo trình độ thạc sĩ Trường Đại học Ngoại ngữ - Đại học Đà Nang Điều Thủ trưởng đơn vị liên quan Trường Đại học Ngoại ngữ - Đại học Đà Nang, người hướng dẫn luận văn học viên có tên Điều Quyết định thi hành./ Nơi nhận: - Hiệu trưởng (để b/c); - Như điều 3; - Lưu: VT, P.ĐT KT HIỆU TRƯỞNG LƯTƯƯ1ÈU TRƯỞNG ... raters read each performance entirely and then gave analytical ratings and a holistic rating for each task based on the rating rubric 3.4.3 Data analysis SPSS version 22 was used for data analysis... W) at BTEC International College Danang Campus Particularly, this study examined two inferences which are generalization and evaluation by investigating the extent to which tasks and raters attributed... measured by the test, averaging over raters and tasks Task: The variance estimate for task (0.299) accounts for 11% of the total variance This suggests that a task was more difficult than another