VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF FOREIGN LANGUAGES DEPARTMENT OF POSTGRADUATE STUDIES NGUYEN THI VIET HA A STUDY ON THE RELIABILITY OF THE FINAL ACHIEVEMENT COMPUTER-BASED M
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF FOREIGN LANGUAGES DEPARTMENT OF POSTGRADUATE STUDIES
NGUYEN THI VIET HA
A STUDY ON THE RELIABILITY OF THE FINAL ACHIEVEMENT COMPUTER-BASED MCQS TEST 1 FOR THE 4™ SEMESTER NON - ENGLISH MAJORS AT HANOI UNIVERSITY OF BUSINESS AND TECHNOLOGY
(DANH GIA DO TIN CAY CUA BÀI THI TRAC NGHIEM THU NHAT TREN MAY TINH CUOI KY 4 DANH CHO SINH VIEN NAM THU HAI KHONG CHUYEN
NGÀNH TIẾNG ANH TRỜNG ĐẠI HỌC KINH DOANH VÀ CÔNG NGHỆ HÀ NỘI)
Minor Programme Thesis Field: Methodology
HANOI, 2008
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF FOREIGN LANGUAGES DEPARTMENT OF POSTGRADUATE STUDIES
NGUYEN THI VIET HA
A STUDY ON THE RELIABILITY OF THE FINAL ACHIEVEMENT COMPUTER-BASED MCQS TEST 1 FOR THE 4™ SEMESTER NON - ENGLISH MAJORS AT HANOI UNIVERSITY OF BUSINESS AND TECHNOLOGY
(ĐÁNH GIA BO TIN CAY CUA BAI THI TRAC NGHIEM THUNHAT TREN MAY
TINH CUOI KY 4 DANH CHO SINH VIEN NAM THU'HAI KHONG CHUYEN NGÀNH TIẾNG ANH TRỜNG ĐẠI HỌC KINH DOANH VÀ CÔNG NGHỆ HÀ NỘI)
Minor Programme Thesis
Field: Methodology Code: 601410 Supervisor: Nguyén Thu Hién M.A
HANOI, 2008
Trang 3VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF FOREIGN LANGUAGES DEPARTMENT OF POSTGRADUATE STUDIES
CANDIDATE’S STATEMENT
I hereby state that I: Nguyen Thi Viet Ha, Class 14A, being a candidate for the degree of Master of Arts (TEFL) accept the requirements of the College relating to the retention and use of Master of Arts Thesis deposited in the library
In terms of these conditions, I agree that the origin of my thesis deposited in the library should be accessible for the purposes of study and research, in accordance with the normal conditions established by the librarian for the care, loan or reproduction of the thesis Signature
Date
Trang 4ACKNOWLEDGMENTS
In the completion of this thesis, I have received a great deal of backup Of primary importance has been the role of my supervisor, Ms Nguyen Thu Hien, M.A, Teacher of Department of English and American Languages & Cultures, College of Foreign Language, Vietnam National University, Hanoi | am deeply grateful to her for her precious guidance, enthusiastic encouragement and invaluable critical feedback Without her dedicated support and correction, this thesis could not have been completed
I am deeply indebted to my dear teacher, Mr Vu Van Phuc, M.A, Head of Testing Center, College of Foreign Languages, VNU, who provided me with a lot of useful suggestion and assistance towards my study
I would also like to express my sincere thanks to all teachers and colleagues in English Department, HUBT, for their help in conducting the survey, sharing opinions and making suggestions to the study Especially, my thanks go to Ms Le Thi Kieu Oanh, Assistant of English Department, HUBT for her willingness to offer test score data
I wish to show my special thanks to the students of KI1 at Hanoi University of Business and Technology who have actively participated in the survey
Finally, it is my great pleasure to acknowledge my gratitude to beloved members of my family, especially my husband who constantly encouraged and helped me with my thesis
ii
Trang 5ABSTRACT
The main aim of this minor thesis is to evaluate the reliability of the final Achievement Computer-based MCQs Test 1 for the 4th semester non-English majors at Hanoi University of Business and Technology
In order to achieve this aim, a combination of both qualitative and quantitative research methods were adopted The findings indicate that there is a certain degree of unreliability
in the final achievement computer-based MCQs testl and there are two main factors that cause the unreliability including test item quality and test- takers’ performance
Having carefully considered a thorough analysis of the collected data, the author made some suggestions in order to improve the quality of the final achievement test and the MCQs test I for the non-majors of English in the 4" semester in Hanoi University of Business and Technology Firstly, the test objectives, sections and skill weight should be adjusted to be more compatible with the course objectives and the syllabus Secondly, a testing committee should be set up for the construction and development of a multi choice item bank including test items which are of good p-value and discrimination value
iii
Trang 6LIST OF ABBRIVIATIONS CBT: Computer-based testing
HUBT: Hanoi University of Business and Technology
MC: Multi choice
- MCQSs: Multi choice questions
.» ML Pre- : Market Leader Pre-intermediate
KD: Kuder- Richardson
SD: Standard deviation
iv
Trang 7Main points in the vocabulary section Topics in reading section
Items in the functional language sections Test reliability coefficient
p-value of items in 4 sections Discrimination value of items in 4 sections Number of test items with acceptable p-value and discrimination value in 4 sections
Suggested scoring format Proposed test specifications Students’ response on test content Students’ response on item discrimination value Students’ response on time length
Students’ response arbitrariness Students’ response on relation between test score and their
achievement
Trang 81.1 Rationale for the study
1.2 Aims and research questions
1.3 Theoretical and practical significance of the study
1.4 Scope of the study
1.5 Method of the study
1.6 Organization of the paper
Chapter 2: LITERATURE REVIEW
2.1 Language testing
2.1.1 What is a language test?
2.1.2 The purposes of language tests
2.1.3 Types of language tests
2.1.4 Criteria of a good language test
2.2 Achievement test
2.2.1 Definition
2.2.2 Types of achievement test
2.2.3 Considerations in final achievement test construction
Trang 92.4.1 Definition
2.4.2 Methods for test reliability estimate
2.4.3 Measures to improve test reliability
2.5 Summary
Chapter 3: The Context of the Study
3.1 The current English learning, teaching and testing situation at HUBT
3.2 The course objectives, syllabus and materials used for the second non- majors of English in Semester 4
3.2.1 The course objectives
3.2.2 Business English syllabus
3.2.3 The course book
3.2.4 Specification grid for the final achievement Computer-based MCQs test
in Semester 4
Chapter 4: Methodology
4.1 Participants
4.2 Data collection instruments
4.3 Data collection procedure
4.4 Data analysis procedure
Chapter 5: RESULTS AND DISCUSSIONS
5.1 The compatibility of the objectives, content and skill weight format of the final achievement computer-based MCQ test 1 for 4" semester with the course objectives and the syllabus
5.1.1 The test objectives and the course objectives
5.1.2 The test item content in four sections and the syllabus content
5.1.3 The skill weight format in the test and the syllabus
5.2 The reliability of the final achievement test
5.2.1 Reliability coefficient
5.2.2 Item difficulty and discrimination value
5.3 The attitude of students towards the MCQs test 1
5.4 Pedagogical implications and suggestions on improvements of the existing final achievement computer-based MCQs test 1 for the non-
Trang 10English majors at HUBT
5.5 Summary
Chapter 6: CONCLUSION
6.1 Summary of the findings
6.2 Limitations of the study
6.3 Suggestions for further study
Trang 11Chapter 1: Introduction 1.1 Rationale of the study
Testing plays a very important role in teaching and learning process Testing is one form of measurement which is used to point out strengths and weaknesses in the learned abilities of the students Through testing, especially tests scores we may discover the performance of given students and of teachers As far as students are concemed, test scores reveal what they have achieved after a learning period As for teachers, test scores indicate what they have taught to their students Based on test results, we may make improvement in teaching, learning and testing for better instructional effectiveness
Another reason for the selection of testing a matter of study lies in the fact that the current language testing at Hanoi University of Business and Technology (HUBT) has been under a lot of controversy among students and teachers Testing is mainly carried out
in the form of two objective tests on computers (named test | and test 2) which are administered at the end of each semester The scores that a student gets on these tests are the main indicators of his or her performance during the whole semester There are different comments on the results of these tests, especially the test 1 for the second-year non-English majors Some subject teachers claim that these tests do not truly reflect the students’ language competence Others say that these tests are appropriate to what students have learnt in class and compatible with the course objectives and therefore reliable Also, among the students, do opposite ideas exist Many think that these tests are more difficult than what they have learnt and studied for the exam, others say that these test items are easy and relevant to what they have been taught Therefore finding out whether the tests are closely related with what the students have been learnt and what the teachers have taught, also, whether these tests are of reliability is indispensable
For the two reasons mentioned above, the author would like to undertake this study entitled “A study on the reliability of the final achievement Computer-based MCQs Test
1 for the 4th semester non-English majors at Hanoi University of Business and Technology” with the intention to examine rumors about this test In addition, the author hopes that the study results help to raise awareness among teachers as well as those who are interested in this field At the same time, study results, in some extent, can be applied to improve the current testing situation in HUBT
1.2 Aims and research questions
Trang 12The main aim of the study is to investigate the reliability of the existing final achievement MCQs test 1 (4 semester) for non-English majors at HUBT through analyzing the test objectives, test content and test skill weight format, students’ scores, test items, perception and comments from students on the test and then to make suggestions towards the test’s improvement
To achieve this aim, the following research questions are set for exploration:
1 Are the objectives, content and skill weight format of the final achievement computer-based MCQs test 1 compatible with the course objectives, the syllabus content and skill weight format ?
2 To what extend is the test | reliable?
3 What is the student’s attitude towards the final achievement Computer-based MCQs test 1?
1.3 Scope of the study
The existing final achievement Computer-based MCQs test 1 in the 4" semester for the second-year non-English majors at HUBT
1.4 Theoretical and practical significance of the study
Theoretically, the study proves that testing is crucial in order to measure and evaluate the quality of learning and teaching Also, test reliability is one of the most important criteria for the evaluation of a test
Practically, the study presents how reliable the final achievement MCQs test 1 administered at HUBT is and how to improve its quality
1.5 Method of the study :
Both qualitative and quantitative methods are used
Regarding literature review on language testing, course objectives, syllabus, the objectives, content and format of the achievement test 1 for 4" term, results of the questionnaires for students, qualitative method is applied
With reference to test scores and test items analysis, quantitative method is used 1.6 Organization of the paper
The study is composed of 6 chapters
Chapter 1- Introduction briefly states the rationale, aims and research questions, scope of the study, theoretical and practical significance of the study, method of the study and organization of the paper
Trang 13Chapter 2- Literature review discusses relevant theories of language testing, final achievement test, Computer-based MCQ tests and test reliability
Chapter 3- The context of the study deals with English learning, teaching and testing situation at HUBT, course book, syllabus and check list for the test
Chapter 4- Methodology presents participants, data collection instruments, data collection and data analysis procedure
Chapter 5S— Results and Discussions presents and discusses the results of the study Suggestions for the improvement of the achievement test | are also proposed in this chapter Chapter 6- Conclusion summarizes the findings, mentions the limitations and provides suggestions for further study
Trang 14Chapter 2: Literature review 2.1 Language testing
2.1.1 What is a language test?
There are a wide variety of definitions of a language test which have one point of similarity That is to say, a language test is considered as a device for measuring individuals’ language ability
According to Henning (1987, p.1), “Testing, including all form of language test, is one form of measurement” In his opinion, tests such as listening or reading comprehension are delivered in order to find out the extent to what the abilities of these skills are present in the learners Similarly, Bachman (1990, p.20) stated: “A test is a measurement instrument designed to elicit a specific sample of an individual’s behavior” He also considered obtaining the elicited sample of behavior as the distinction of a test from other types of measurement
Brown H.D (1995, p.384) presented the notion in a simpler way: “A test, in plain words, is a method of measuring a person’s ability or knowledge in a given domain”
He explained that a test first and foremost is a method which includes items and techniques requiring the performance of testees Via this performance, a person’s ability or language competence is measured
These viewpoints show that a language test is an effective tool of measuring and assessing students’ language knowledge and skills and providing precious information for better future teaching and learning
2.1.2 The purposes of language tests
Language tests regarding their purposes are perceived from different perspectives
by different scholars Typically, Henton (1990) mentioned 7 points which can be represented as follows:
e Finding out about progress
e Encouraging students
e Finding out about learning difficulties
e Finding out about achievement
e Placing students
e Selecting student
e Finding out about proficiency
Trang 15In general, a language test is used to evaluate both teaches and students’ performance, to make judgment and adjustment to teaching materials and methods, and
to strengthen students’ motivation for their further study
2.1.3 Types of language tests
Language tests can be classified into different types according to their purposes Henton (1990), Brown (1995), Harrison (1983) and Hughes (1989) pointed out that language tests include four main types: proficiency tests, diagnostic tests, placement tests and achievement tests with characteristics illustrated in the following table:
Type of test Characteristics
Proficiency test Measure people’s abilities in a language regardless of any
training they may have had in that language
Diagnostic test Check students’ progress for their strengths and weaknesses
and what further teaching is necessary
Placement test Classify students into groups at different level at the beginning
2.1.4 Criteria of a good language test
Just like any measuring device, a language test presents potential error measurement For the purpose of investigating and evaluating and “testing” a test, researchers such as Brown (1995), Henning (1987), Bachman (1990) and Harrison (1983) identified criteria to determine if a test is good or not A good language test must feature four most important qualities: reliability, validity, practicality and discrimination
The reliability of a test is its consistency (Brown, 1995; Harrison, 1983) A test is reliable only when it yields the same results whether it is administrated under any circumstances or scored by any markers The validity of a test refers to “the degree to which the test actually measures what it is intended to measure” (Brown, 1995, p.387)
Trang 16A test is considered to be valid if it possesses content validity, face validity and construct validity The practicality of a test is administrative A test is practical when it
is time and money- saving Also, it is easy to administer, mark and interpret The discrimination of a test is the extent to which a test separates the students from each other (Harrison, 1983) In other words, it is the capacity of the test to discriminate among different students and to reflect individuals’ performance of the same group 2.2 Achievement test
2.2.1 Definition
Achievement tests are of extensive use at different levels of education due to their distinguished characteristics Researchers define the notion of achievement tests in various ways
Henning (1987, p.6) held that:
Achievement tests are used to measure the extent of learning in a
prescribed content domain, often in accordance with explicitly stated
objectives of a learning program
From this definition, it followed that an achievement test was a measurement tool designed to examine language competence of learners over a period of instruction learning and to evaluate instruction program In the same token, Hughes (1989) put that achievement tests were intended to assess how successful individual students, groups of students or the courses themselves have been in achieving objectives Achievement tests play an important role in the education programs, especially in evaluating students’ acquired language knowledge and skills during a given course
2.2.2 Types of achievement test
Achievement tests can be subdivided into the final achievement and progress achievement according to the time of administration and the desired objectives (Henton, 1990)
Final achievement tests are usually given at the end of the school year or at the end
of the course to measure how far students have achieved the teaching goals The contents of these tests must be related to the teaching content and objectives concerned Progress achievement tests are usually administrated during the course to measure the progress that students are making The results from the test enables teachers to identify the weaknesses of the learners, diagnose the areas not properly obtained by students during the course in order to have remedial action
Trang 17Henton (1990) also stated the two types of test differ in the sense that final achievement tests are designed to cover a longer period of learning and it should attempt to cover as much of syllabus as possible
2.2.3 Considerations in final achievement test construction
On the basis of its characteristics, Heaton (1990) put that covering much the contents of a syllabus or a course book is a requirement for designing a final achievement test Testers should avoid basing the test on their own teaching rather than
on the syllabus or course book in order to establish and maintain a certain standard In addition, Mc Namara (2000) stated that test writers should draw out a test specification before writing a test Test specification is resulted from the process of designing test content and test method Test specification has to include information on the length, the structure of each part of the test, the type of materials with which the candidates will have to engage, the source of materials, the extent to which authentic materials may be altered, the response format and how responses are scored They are usually written before the tests and then the test is written on the basis of the specifications After the test is written, the specification should be consulted again to see whether the test matches the objectives set in the specifications
2.3 MCQs test
2.3.1 Definition
Multi-choice questions tests (MCQs tests) are objective tests which require no particular knowledge or training in the examined content area on the part of the scorer (Henning, 1990) They are different from subjective tests in terms of scoring methods That means no matter which examiners mark the test, a testee will get the same score
on the test (Heaton, 1988)
MCQs tests use multi-choice questions which is also called multi-choice items as a testing technique An MC item is a test item where the test taker is required to choose the only correct answer from a number of given options (McNamara, 2000; Weir, 1990)
In the view of Heaton (1988), MC items take many forms but their basic structure includes two parts The initial part is known as stem The primary purpose of the stem is
to present the problem clearly and concisely The stem needs to provide the testees a very general idea of the problem and the answer required The stem may be in the form
of an incomplete statement, a complete statement or a question The other part is the
Trang 18choices from which the students select their answers and is referred as options/ responses or alternatives In an MC item there may be three, four or five options of which one is the correct options or key while the others are distractors of which the task
is to distract the majority of poor students from the correct option The optimum number of options in most public test for each multi choice item is five And it is desirable to use four options for grammar items and five for vocabulary and reading 2.3.2 Benefits of MCQs test
MC items are undoubtedly one of the most widely used types of items in objective test (Heaton, 1988) The popularity of this testing technique results from its efficiency Researchers such as Weir (1990), Heaton (1988) and Hughes (1989) pointed out a number of benefits which are presented as detailed below
Firstly, the scoring of MCQs test is perfectly reliable, rapid and economical There
is only one correct in the format of an MC item so that the scorers’ interference into the test is minimized The scorers are not permitted to impose their personal expertise, experience, attitudes and judgment when giving marks to testees’ responses The testees, thus, always get a consistent result whoever the scorers are and whenever their tests are given marks In addition, MCQs tests can be marked mechanically with minimal human intervention As a result, the marking is not only reliable and simple but also more rapid and often more cost effective than other forms of written test (Weir, 1990) Secondly, an MCQs test can cover a much wider sample of knowledge than a subjective test When taking an MCQs test, a candidate has only to make a mark on the paper and therefore it is possible for testers to add more items in a given period of time (Hughes, 1988) With a large number of items in the test, the coverage of knowledge in
MC items is so broad and is very useful for identifying students’ strengths and weaknesses and distinguishing their ability
Thirdly, MCQs tests increase test reliability According to Heaton (1988) and Weir (1990), it will not be difficult to obtain reliability for MCQs tests because of perfectly objective scoring Besides, due to the fact that the testees do not have to deploy the skill
of writing as in open-ended one and MC items have clear and unequivocal format, the extent to which measurement errors exert on the trait being assessed is narrowed Another benefit is that MC items can be trialed beforehand fairly easily From these trials, the difficulty level of each item and that of the test as a whole are usually possible to be estimated in advance (Weir, 1990) The results from item difficulty
Trang 19estimate make a great contribution to the success of designing a more appropriate test
to candidates’ level of language
In addition, Heaton (1988, p.27) claimed that “multi choice items can provide a useful means of teaching and testing in various learning situation (particularly at the lower levels) provided that it is always recognized such items test knowledge of grammar, vocabulary, etc rather than the ability to use language” MC items can be very useful in measuring students’ ability to recognize correct grammatical forms, etc and therefore can help both teacher and students to identify areas of difficulty
As far as computer-based MCQs tests are concerned, according to McNamara (2000) many important national and international language tests, including TOEFL, are moving to computer-based testing (CBT) since there have been rapid developments in computer technology The main feature of CBT is that stimulus texts and prompts are presented not in examination booklets but on the screen, with candidates being required
to key in their responses The advent of CBT has not necessarily involved any change in the test content but often simply represents a change in test method McNamara (2000) noted that the proponents of computer-based testing can point to a number of advantages First, just as paper-done MCQs tests, scoring of fixed response items can be done automatically and the candidate can be given a score immediately Second, the computer can deliver tests that are tailored to the particular abilities of the candidate This type of test, as also called computer-adaptive test, can provide far more information about the testees’ ability
2.3.3 Limitations of MCQs tests:
Despite the fact that MCQs tests bring lots of benefits, especially, to test administrators, there are several problems associated with the use of MC items These problems were identified by a number of researchers such as Weir (1990), Hughes (1989), Heaton (1988), McCOUBRIE P (2004) and McNamara (2000)
First of all, Hughes (1989) criticized that MCQ technique tests only recognition knowledge To do a given task, a testee just needs to look at the stem and four or five options and then picks out the key His or her performance is not much more than the recognition of the right form of language It shows no evidence that this person can produce the language Obviously, this type of test presents a lack between at least some candidates’ productive and receptive skill and therefore the performance on an MCQs test may give an inaccurate picture of these candidates’ ability (Hughes, 1989) Heaton
Trang 20(1988) also pointed out that an MC item does not lend itself to the testing language as communication and the process involved in the actual selection of one out of four or five options does not bear much relation to the language used in most real life situation Normally, in everyday situation we are required to produce and receive language while
MC items are merely aimed to test receptive skills
Another problem arises when using MCQs tests is that “the multi choice item is one
of the most difficult and time consuming types of items to construct” (Heaton, 1988, p.27) In order to write a good item, test designers have to strictly follow certain principles For example, they have to write many more items than they actually need for
a test After that they have to pre-test and analyze students’ performance on the item evaluate items and recognize the usable ones or even to rewrite the items for a satisfactory final version These procedures take a lot of test constructors’ time and need far more careful preparation than subjective tests
Furthermore, objective tests of MCQs type encourage guessing (Weir, 1990; Heaton, 1988; Hughes, 1989) Hughes estimated the chance of guessing the correct answer in a three option multi choice item is roughly 33%; in four or five option item it
is 25% or 20% respectively The format of MC items makes it possible for testees to complete some items without reference to the texts they are set on As a result, the score gained in MCQs maybe suspect and the score range may become narrow
Some other limitations in the use of MC items involve backwash and cheating Backwash may be harmful because MQ items require students to memorize as many structures and forms as possible and do not stimulate them to produce language Thus practicing MC items is not a good way to improve learners’ command of language Cheating may be facilitated as MC items make students easy to communicate with each other and exchange selected response nonverbally
Referring to computer-based tests, according to McNamara (2000), this type of test requires the prior creation of item bank which have been thoroughly trialed The preparation for a standardized item bank to estimate difficulty for candidates at given levels of ability as precisely as possible is not at ease In addition, delivering CBT raises the question of validity and reliability For example, different levels of familiarity with computers or of reading texts on computer screens will affect students’ performance These differences might lead to difficult conclusion about a candidate’s ability
2.3.4 Principles to construct MC items
10
Trang 21In order to construct a good MC item, there are a large number of principles which can be summarized as follows (Heaton, 1988):
e Each MC item should have only one answer
¢ Only one feature at a time should be tested
e Each option should be grammatical correct when placed in the stem, except for the case of specific grammar test items
e All multi-choice items should be at a level appropriate to the proficiency level of the testees
e Multi choice items should be as brief and as clear as possible
e Multi choice items should be arranged in rough order of increasing difficulty and there should be one or two simple items to “lead in” the testees
2.4 Reliability of a test
2.4.1 Definition
In research, the term reliability means ‘repeatability’ or ‘consistency’ A test is considered reliable if it would give us the same result over and over again assuming that what we are measuring isn't changing Lynch (2003, p.83) stated that reliability refers to
“the consistency of our measurement” In the same vein, Harrison (1983) explained that to
be reliable, tests should not be elastic in their measurement Whatever the version of the test a testee take, whatever the occasion the test is administrated, and whatever raters who score the test, it still yields the same results
2.4.2 Methods of test reliability estimate
Reliability may be estimated through a variety of methods which is presented below:
* Test-retest method is a classic way to calculate the reliability coefficient of a test The test is given to a group of students and then given again to these students immediately afterward (the interval between two test administration is no more than two weeks) The test is assumed to be perfectly reliable if the students get the same score on the first and the second administration (Alderson, J.S et al., 1995)
* Parallel-form methods involve correlating the scores from two or more similar (parallel) tests which are administrated to the same sample of persons A formula for this method may be expressed as follows:
Rtt = rA,B (Henning, 1987)
Rtt: the reliability coefficient
11
Trang 22TA,B: the correlation of form A with form B of the test when administered to the same people at the same time
* Inter-rater method is applied when scores on the test are independent estimates by two
or more raters It involves the correlation of the ratings of one rater with those of another The following formula is used in calculating reliability:
nrA,B Rtt= _ (Henning, 1987)
1+(n-1)r A,B
Rtt: inter-rater reliability
n: the number of rater who combines estimates from the final mark for the examiner rA,B: the correlation between the raters, or the average correlation among all rater if there are more than two
* Internal consistency method judges the reliability of the test by estimating how consistent test-takers’ performances on different parts of the tests with each other (Bachman, 1990) The following are internal consistency measures that can be used: Split-half reliability involves dividing a test into two, and correlating these two halves The more strongly the two halves correlate, the higher the reliability will be This method uses the following formula:
2rA,B
1+rA,B
Rtt: Reliability estimated by the split half method
tA.B: The correlation of the score from one half of the test with those from the other half
Kuder-Richardson Formula 20 (KD20) is based on item level data and is used when the tester has the results for each test item The KD-20 is as follows:
Rtt: The KR 20 reliability estimate
n: The number of items in the test
SỐ: The variance of test scores
Ys; : The sum of the variances of all items (or pq)
12
Trang 23Kuder- Richardson Formula 21 (KD-21) is based on total test scores and assumes that all items of an equal level of difficulty The KD-21 is as follows:
Rtt: The KR 20 reliability estimate
The number of items in the test
| The mean of scores on the test
The variances of test scores
Alderson, J.S et al (1995) stated that for the internal consistency reliability, the perfect reliability index is +1.0 In the same view, Hughes (1989, p.31-32) noted that “ the ideal reliability coefficient is 1- a test with a reliability coefficient of | is one which would give precisely the same results for a particular sets candidates regardless of when it happened to be administrated” Reliability coefficient for a good vocabulary, structure and reading test is usually in the 0.90 to 0.99 range, for an auditory comprehension test is more often in the 0.80 to 0.89 range and for an oral production test it may be in the 0.70 to 0.79 range while an MCQs test typically has the reliability coefficient of more than 0.80 (Hughes, 1989)
Among the above ways of estimating reliability, test-retest and parallel methods require at least two test administrations while the inter-rater and internal consistency methods need only a single administration For the reason of convenience and satisfaction, KD20 and KD 21 are often chosen more than the others and are considered the two most common formulae (Alderson J.S et al., 1995)
Concerning MCQs tests, besides estimating test reliability coefficient, item analysis including item difficulty and item discrimination provides more concise insight into the test reliability (Henning, 1997)
The formula for calculating item difficulty is:
»Œr
N P: proportion correct
Cr : the sum of correct responses
13
Trang 24N: the number of students
Henning (1987) pointed out that p value for each item should be between 0.33 and 0.67 and thus the level of difficulty of the item is acceptable If p value is below 0.33, the item is considered as too difficult If it is above 0.67, the item is too easy
The formula for computation of item discrimination is:
He D= —W—— (Henning, 1987)
He + Le D: discriminability
He: the number of correct response in the high group
Le: the number of correct response in the low group
The optimal size of each group is 28% of the total sample For very large samples of examinees, the number of examinees in the high and low groups are reduced to 20% for computational convenience The acceptable discrimination value by sample separation method is >= 0.67 (Henning, 1987)
2.4.3 Measures to improve test reliability
Reliability may be improved by eliminating its sources of error Hughes (1989) makes a list of recommendation to improve test reliability as follows:
e Take enough sample of behavior
¢ Do not allow candidates too much freedom
¢ Write unambiguous items
e Provide clear and explicit instructions
e Ensure that the test are well laid out and perfectly legible
¢ Candidate should be familiar with format and testing techniques
e Provide uniform and non-distracting conditions of administration
Furthermore, item difficulty and item discriminability show that the reliability of an MCQs test is low or high (Henning, 1987) Therefore the most straight forward ways to improve test reliability is to design MCQs items with good level of difficulty and discrimination value
2.5 Summary
This chapter presents the theoretical framework for the study In Section 2.1, the notion of a language test as a measuring device of people’s ability is reviewed
14
Trang 25Additionally, the purposes of language testing, types of language tests and criteria of a good test are also discussed Section 2.2 classifies achievement tests into two types and mentions consideration in designing final achievement tests The definition, benefits and limitations of MCQs tests and principles of this type of test construction are dealt with in section 2.3 The final Section - 2.4 is concerned with test reliability, methods for estimating test reliability, and ways to make language tests more reliable
15
Trang 26Chapter 3: The Context of the Study 3.1 The current English learning, teaching and testing situation at HUBT
There are over 1500 second-year non-majors of English at HUBT English is their required subject for foreign language Their levels of proficiency vary because of their different backgrounds, knowledge of language, exposure to English, characteristics, learning attitudes, motivations and so on These students have to cover a comparatively large amount of knowledge of English as English hold the highest credits among all subjects In the English Department, HUBT there are totally 62 teachers who work with the non-English majors enthusiastically to help them with the foreign language They are all dedicated and qualified with an average of five years’ teaching experience
With the aim to equip students with business English and communication skills necessary for their future career, learning and teaching activity for the second-year non- English majors mainly focus on developing speaking and listening skills However, testing process is quite complicated and can be described as follows
In semester 4 the students have to experience daily assessment and go through four tests all together Daily assessment includes checking vocabulary, speaking skill, and doing tasks in the course book and practice files The four tests comprise of two paper tests and two computer-based MCQs tests These tests are designed by teachers of English Department, HUBT and The paper tests, given in the middle of the term (week 9) and at the end of the term (week 17) focus on listening, writing, grammar and vocabulary The computer-based MCQs tests are administered on computers in the week 19 Each test lasts
2 hours and includes 150 multi choice items emphasizing on vocabulary, grammar, reading and functional language The construction of the first test (hereafter achievement test 1) is based on the three units of the course book (Unit 7, 8, 9) that the students have already learnt The second one (achievement test 2) is designed on the basis of the last three units of the course book (Unit 10, 11, 12) Items of MCQs tests are selected by one person in charged of teaching English in the 3" and 4" semester for the second year students
The Computer-based MCQs test administered in HUBT is similar to a paper-done one The main different is that the test is delivered on computers and students simply click mouse for their chosen response among A, B, C, D This kind of test is different from computer adaptive tests which are tailored to the particular abilities of the candidate In
16
Trang 27other words, a Computer-based MCQs test at HUBT is in fact an MCQs test delivered on computers
The following chart illustrates testing guideline for semester 4:
Semester 4 (12 credits) The first score (6 credits) The second score (6 credits )
3.2.1 The course objectives
The training objectives in the 4" semester are to help students to:
- Further develop speaking and listening skill in business contexts
- Further develop skill of reading business texts
- Consolidate basic grammar
- Broaden business vocabulary
- Further practice pronunciation
-Write business letters and memorandums
3.2.2 Business English syllabus
The syllabus is described in the following table:
1 220 7 Starting up- Vocabulary C.B 62-63 | P F 28-29
Trang 28
Note: C.B: Course book; P.F: Practice file; T.B: Teacher’s book
Table 3: The syllabus for 4" semester (for non English majors)
Time allocation for language skills and sections is illustrated as follows:
Skills Class numbers ( period ) Percentage (%)
practicing functional language)
Trang 29The course book in semester 4 for the second year students at HUBT is Market Leader Pre-intermediate which was written by Davis Cotton, David Falvey and Simon Kent and published in 2002 by Longman These books mainly focus on three skills: speaking, listening and reading It does not put a great emphasis on grammar The book is divided into 12 units and closely interrelated but each with a slightly different emphasis The pattern including starting up-Vocabulary-Listening-Reading-Language review-Skills-Case study is the same for all units In the fourth semester, students study the last six units of this book (Unit 7-Unit 12)
The course book check lists necessary for examining the task and content in the course book used for construction of the achievement computer-based MCQs test 1 is in Appendix 1
3.2.4, Specification grid and scoring scale for the final achievement Computer-based MCQs test 1 in Semester 4
In order to evaluate students’ achievement, the following grid is used to design achievement test |
sentences, multiple approx 18 choice words
factual test, | multiple approx 60 choice words
Trang 30
Table 5: Specification grid for the final computer-based MCQs test 1
The scoring scale for the test is designed by the teachers in HUBT and includes two levels
as follows:
Pass: For students who can get 50% of the whole test
Fail: For students who get below 50% of the whole test
20
Trang 31Chapter 4: Methodology 4.1 Participants
The first subjects who participated in this study include 349 second year students from 14 classes Their test scores were collected for the purpose of analyzing and computing the internal consistency reliability, item difficulty, and item discriminability The second subjects who took part in answering a questionnaire include 236 second year non-English majors Their responses to 14 questions were analyzed in order to investigate the students’ attitude towards the final achievement MCQs test 1
4.2 Data collection instruments
The following instruments were adopted to obtain information for the study:
- Kuder-Richardson Formula 20 for internal consistency reliability estimate
- Item difficulty and item discrimination formulae mentioned in section 2.4.2
- A questionnaire survey for students (see Appendix 2)
The questionnaires were designed on the basis of Henning’s list of threats to reliability of a test (1987) The objective is to find out students’ attitude towards the reliability of the current achievement MCQs test 1 in the 4" term The questionnaires included 14 items and were in Vietnamese to make sure the informants understood the questions appropriately (see Appendix 2) These items focus on the characteristics of the test, test administration and test-takers
4.3 Data collection procedure
The data about test objectives and the course objectives were elicited through English Department Bulletin, HUBT enacted in 2003 The data about the syllabus content were collected through the syllabus for the second year students The data about the test content and test format were obtained through a copy of the official current test from English Department
The data about the students’ test scores and items responses were obtained from a file containing both the students’ score and responses on the test provided by Informatics Department, HUBT
The data about the results of questionnaire were collected from 236 second year students who were randomly selected one week after they have finished the final achievement test 1
4.4 Data analysis procedure
21
Trang 32First, the comparison between the test objectives and the course objectives, the test content and the syllabus content, and skill weight in the test format and the syllabus was made in order to determine if they are compatible with each other
Second, reliability coefficient, item difficulty and item discrimination indices of the MCQs test 1 were analyzed in order to determine the extent to which the final achievement test | is reliable
Finally, analysis of students’ responses on the questionnaire was made in order to find out students’ attitude towards the MCQs test given to them
22
Trang 33Chapter 5: Results and Discussions
5.1 The compatibility of the objectives, content and skill weight format of the final achievement computer-based MCQ test 1 for 4" semester with the course objectives and the syllabus
5.1.1 The test objectives and the course objectives
As mentioned in section 3.2.1, the course is mainly targeted to further develop students’ essential business communication skill of speaking such as making presentations, taking parts in meetings, negotiating, telephoning and using English in social situation Through a lot of interesting discussion activities, students will build up their confidence in using English and improve their fluency The course is also aimed at developing students’ listening skill such as listening for information and note-taking In addition, it provides students with important new words and phrases and increases their business vocabulary Students’ skill of reading will be also built up through authentic articles on a variety of topic on business The course also helps students to revise and consolidate basic grammar,
to improve their pronunciation and to perform some writing tasks on business letter and memorandum
The MCQs test | is designed to check what students have learnt about vocabulary, grammar, reading topics and functional language in Unit 7,8,9 of Market Leader Pre- It is also constructed to assess students’ achievement at the end of the course, especially to evaluate students’ results after completing these 3 units Particularly, vocabulary and grammar section making up of 100 items are aimed at examining the amount of vocabulary and grammar that students have been instructed Reading section of 30 items
is to measure students’ reading skill on business topics such as marketing, planning and managing Functional language sections of 20 items is to measure students’ ability of communicating in daily business situations
Obviously, the objectives of the course and of the MCQs test 1 are partially compatible with each other That is to say, the course provides students with knowledge about vocabulary, grammar and functional language and develop students’ reading skills and the MCQs test | is designed to measure students’ ability of these knowledge and skills However, the difference is that the course objectives are targeted to develop both receptive and productive skills for students whereas the test merely focuses on students’ receptive
23
Trang 34skill of reading and examines students’ ability of knowledge recognition rather than language production
5.1.2 The test item content in four sections and the syllabus content
* Grammar section
The grammar items in the test are shown clearly and specifically in the table below
tested items tested items
at in grammar part of the syllabus such as prepositions, connectors, comparatives, verb tense and verb form
tested items | of tested
Trang 35
to manage)
Table7: Main points in the vocabulary section
In comparison with vocabulary checklist (see Appendix 1), it can be recognized that test items in vocabulary section of the test are of 80% the same as vocabulary items in the course book That is to say, the test items stick to what students have learnt such as noun- noun collocation relating to marketing terms, verb-noun collocation relating to ways to plan and verb-preposition collocation relating to ways to manage Nevertheless, there are also items such as verbs showing trends, multi-word verbs and adjective related to profits which do no include in vocabulary part of the syllabus but in reading articles in Unit 7, 8,9
* Reading comprehension section
In this section, there are 30 extracts of which main topics are shown as follows:
tested items | of tested
department
Table 8: Topics in reading section
By comparing the reading section with the reading checklist (see Appendix 1), it can be observed that the topics in the MCQs test 1 such as managing, marketing and planning are highly relevant to the ones that the students have already learnt
25
Trang 36* Functional language section
This section includes 20 items of business situations The function of language in these situations is presented in the following table:
Table 9: Items in the functional language sections
To bring Table 9 into comparison with functional language checklist (see Appendix 1), it can be obviously realized that all test items broadly cover what the students have already been taught in business situations (for example, telephoning, meeting and socializing & entertaining) However, there is a lack of language items of interruption and making excuses although they are focal points in the syllabus
To sum up, with regard to the content, items in four sections of the MCQs test | is generally to large extent relevant to the course book
5.1.3 The skill weight format in the test and the syllabus
According to skill weight format in the syllabus illustrated in Table 4- section 3.2.2, among four parts including reading, vocabulary, grammar and functional language, reading has the highest proportion of skill weight (18%) and ranks number | Grammar ranks number 2 with the skill weight percentage of 14 Functional language ranks number 3 with the rate of 13% and vocabulary is at the bottom with the proportion of 12%
However, in the test specification grid, skill weighting for four sections is not in the same rank as in the syllabus Vocabulary and grammar section, with the number of 50 tests items for each hold the same rank — number | whereas the rank of reading (30 test items) and functional language (20 test items) is number 3 and 4 respectively Thus it can be seen
26