Cần đa dạng hóa các dạng bài tập đánh giá như: các dạng bài tập nghiên cứu; đánh giá trên sản phẩm hoạt động học tập của học sinh tập các bài làm tốt nhất của học sinh; tập tranh ảnh học[r]
(1)BỘ GIÁO DỤC VÀ ĐÀO TẠO VỤ GIÁO DỤC TRUNG HỌC CHƯƠNG TRÌNH PHÁT TRIỂN GIÁO DỤC TRUNG HỌC TÀI LIỆU BỒI DƯỠNG CÁN BỘ QUẢN LÍ VÀ GIÁO VIÊN VỀ BIÊN SOẠN ĐỀ KIỂM TRA, XÂY DỰNG THƯ VIỆN CÂU HỎI VÀ BÀI TẬP MÔN TIẾNG ANH CẤP TRUNG HỌC CƠ SỞ (Tài liệu lưu hành nội bộ) Hà Nội, tháng 12 năm 2010 (2) Người biên soạn: Đặng Trần Thinh (3) (4) MỤC LỤC PHẦN THỨ NHẤT: ĐỊNH HƯỚNG CHỈ ĐẠO VỀ ĐỔI MỚI KIỂM TRA, ĐÁNH GIÁ .3 Định hướng đạo đổi kiểm tra, đánh giá Một số nhiệm vụ đạo đổi kiểm tra, đánh giá .6 PHẦN THỨ HAI: BIÊN SOẠN ĐỀ KIỂM TRA 13 I - KĨ THUẬT BIÊN SOẠN ĐỀ KIỂM TRA 13 Kĩ thuật biên soạn câu hỏi Từ vựng/Vocabulary questions 13 Kĩ thuật biên soạn câu hỏi Ngữ pháp/Grammar questions 22 Kĩ thuật biên soạn câu hỏi Đọc hiểu/Reading questions 33 Kĩ thuật biên soạn câu hỏi Viết/Writing questions 40 Đánh giá đề kiểm tra / Evaluating the tests .51 Kiểm tra đánh giá theo chuẩn Kiến thức Kỹ 58 Các kĩ đặt câu hỏi 60 II - ĐỀ KIỂM TRA MINH HỌA DÙNG CHO LÀM VIỆC THEO NHÓM .63 PHẦN THỨ BA: THƯ VIỆN CÂU HỎI VÀ BÀI TẬP .75 Về dạng câu hỏi .75 Về số lượng câu hỏi .76 Yêu cầu câu hỏi 76 Định dạng văn .76 Các bước tiến hành biên soạn câu hỏi môn học 77 Sử dụng câu hỏi môn học thư viện câu hỏi 79 PHẦN THỨ BỐN: HƯỚNG DẪN TỔ CHỨC TẬP HUẤN TẠI CÁC ĐỊA PHƯƠNG 80 PHỤ LỤC 82 (5) (6) PHẦN THỨ NHẤT ĐỊNH HƯỚNG CHỈ ĐẠO VỀ ĐỔI MỚI KIỂM TRA, ĐÁNH GIÁ Kiểm tra đánh giá kết học tập học sinh nhằm theo dõi quá trình học tập học sinh, đưa các giải pháp kịp thời điều chỉnh phương pháp dạy thày, phương pháp học trò, giúp học sinh tiến và đạt mục tiêu giáo dục Theo Từ điển Tiếng Việt, kiểm tra hiểu là: Xem xét tình hình thực tế để đánh giá, nhận xét Như vậy, việc kiểm tra cung cấp kiện, thông tin cần thiết làm sở cho việc đánh giá học sinh Một số nhà nghiên cứu cho rằng: “Kiểm tra là thuật ngữ cách thức hoạt động giáo viên sử dụng để thu thập thông tin biểu kiến thức, kỹ và thái độ học tập học sinh học tập nhằm cung cấp kiện làm sở cho việc đánh giá”; Kiểm tra hiểu theo nghĩa rộng là theo dõi quá trình học tập và có thể hiểu theo nghĩa hẹp là công cụ kiểm tra bài kiểm tra các kỳ thi”; “Việc kiểm tra cung cấp kiện, thông tin làm sở cho việc đánh giá” Có nhiều khái niệm Đánh giá, nêu các tài liệu nhiều tác giả khác Theo Từ điển Tiếng Việt: “Đánh giá hiểu là nhận định giá trị” Dưới đây là số khái niệm thường gặp các tài liệu đánh giá kết học tập học sinh: - “Đánh giá là quá trình thu thập và xử lí kịp thời, có hệ thống thông tin trạng, khả hay nguyên nhân chất lượng và hiệu giáo dục vào mục tiêu giáo dục, làm sở cho chủ trương, biện pháp và hành động giáo dục nhằm phát huy kết quả, sửa chữa thiếu sót” - “Đánh giá kết học tập học sinh là quá trình thu thập và xử lí thông tin trình độ, khả đạt mục tiêu học tập HS cùng với tác động và nguyên nhân tình hình đó, nhằm tạo sở cho định sư phạm giáo viên và nhà trường để HS học tập ngày tiến hơn” - “Đánh giá có nghĩa là: Thu thập tập hợp thông tin đủ, thích hợp, có giá trị và đáng tin cậy; và xem xét mức độ phù hợp tập hợp thông tin này và tập hợp tiêu chí phù hợp với các mục tiêu định ban đầu hay điều chỉnh quá trình thu thập thông tin; nhằm định” - “Đánh giá hiểu là quá trình hình thành nhận định, phán đoán kết công việc, dựa vào phân tích thông tin thu đối chiếu với mục tiêu, tiêu chuẩn đã đề ra, nhằm đề xuất định thích hợp để cải thiện thực trạng, điều chỉnh, nâng cao chất lượng và hiệu công tác giáo dục” - “Đánh giá là quá trình thu thập thông tin, chứng đối tượng đánh giá và đưa phán xét, nhận định mức độ đạt theo các tiêu chí đã đưa các chuẩn hay kết học tập” (mô hình ARC) - “Đánh giá là quá trình thu thập thông tin, chứng đối tượng đánh giá và đưa (7) phán xét, nhận định mức độ đạt theo các tiêu chí đã đưa (8) các tiêu chuẩn hay kết học tập Đánh giá có thể là đánh giá định lượng (quantitative) dựa vào các số định tính (qualitative) dự vào các ý kiến và giá trị” Đánh giá gồm có khâu chính là: Thu thập thông tin, xử lí thông tin và định Đánh giá là quá trình bắt đầu chúng ta định mục tiêu phải theo đuổi và kết thúc đưa định liên quan đến mục tiêu đó, đồng thời lại mở đầu cho chu trình giáo dục Đánh giḠthực đồng thời chức năng: vừa là nguồn thông tin phản hồi quá trình dạy học, vừa góp phần điều chỉnh hoạt động này Chuẩn đánh giá là quan trọng để thực việc đánh giá, chuẩn hiểu là yêu cầu bản, tối thiểu cần đạt việc xem xét chất lượng sản phẩm Việc đánh giá phải đảm bảo các yêu cầu sau đây Đảm bảo tính khách quan, chính xác Phản ánh chính xác kết nó tồn trên sở đối chiếu với mục tiêu đề ra, không phụ thuộc vào ý muốn chủ quan người đánh giá Đảm bảo tính toàn diện Đầy đủ các khía cạnh, các mặt cần đánh giá theo yêu cầu và mục đích Đảm bảo tính hệ thống Tiến hành liên tục và đặn theo kế hoạch định, đánh giá thường xuyên, có hệ thống thu thông tin đầy đủ, rõ ràng và tạo sở để đánh giá cách toàn diện Đảm bảo tính công khai và tính phát triển Đánh giá tiến hành công khai, kết công bố kịp thời, tạo động lực để thúc đẩy đối tượng đánh giá mong muốn vươn lên, có tác dụng thúc đẩy các mặt tốt, hạn chế mặt xấu Đảm bảo tính công Đảm bảo học sinhthực các hoạt động học tập với cùng mức độ và thể cùng nỗ lực se nhận kết đánh giá Định hướng đạo đổi kiểm tra, đánh giá 1) Phải có hướng dẫn, đạo chặt chẽ các cấp QLGD Đổi KT-ĐG là yêu cầu cần thiết phải tiến hành thực đổi PPDH đổi giáo dục Đổi GD cần từ tổng kết thực tiễn để phát huy ưu điểm, khắc phục các biểu hạn chế, lạc hậu, yếu kém, trên sở đó tiếp thu vận dụng các thành tựu đại khoa học GD nước và quốc tế vào thực tiễn nước ta Các cấp quản lý GD cần đạo chặt chẽ, coi trọng việc hướng dẫn các quan quản lý GD cấp dưới, các trường học, các tổ chuyên môn và (9) GV việc tổ chức thực hiện, cho đến tổng kết, đánh giá hiệu cuối cùng Thước đo thành công các giải pháp đạo là đổi cách nghĩ, (10) cách làm CBQLGD, GV và đưa các số nâng cao chất lượng dạy học 2) Phải có hỗ trợ đồng nghiệp, là GV cùng môn Đơn vị tổ chức thực đổi PPDH, đổi KT-ĐG là trường học, môn học với điều kiện tổ chức dạy học cụ thể Do việc đổi KT-ĐG phải gắn với đặc trưng môn học, nên phải coi trọng vai trò các tổ chuyên môn, là nơi trao đổi kinh nghiệm giải khó khăn, vướng mắc Trong việc tổ chức thực đổi KT-ĐG, cần phát huy vai trò đội ngũ GV giỏi có nhiều kinh nghiệm, GV cốt cán chuyên môn để hỗ trợ GV mới, GV tay nghề chưa cao, không để GV nào phải đơn độc Phải coi trọng hình thức hội thảo, thao giảng, dự thăm lớp để rút kinh nghiệm kịp thời, đánh giá hiệu giải pháp cụ thể việc đổi PPDH và đổi KT-ĐG: đề kiểm tra bảo đảm chất lượng, kết hợp hình thức tự luận với trắc nghiệm cho phù hợp với đặc trưng môn 3) Cần lấy ý kiến xây dựng HS để hoàn thiện PPDH và KT-ĐG Đổi PPDH và đổi KT-ĐG mang lại kết HS phát huy vai trò tích cực, chủ động, sáng tạo, biết tự tìm cho mình PP học tập hữu hiệu, biết tự học, tự đánh giá kết học tập Trong môi trường sư phạm thân thiện, việc thu thập ý kiến xây dựng HS để giúp GV đánh giá đúng mình, tìm đường khắc phục các hạn chế, thiếu sót, hoàn thiện PPDH, đổi KT-ĐG là cần thiết và là cách làm mang lại nhiều lợi ích, phát huy mối quan hệ thúc đẩy tương hỗ người dạy và người học 4) Đổi KT-ĐG phải đồng với các khâu liên quan và nâng cao các điều kiện bảo đảm chất lượng dạy học Đổi KT-ĐG gắn liền với đổi PPDH GV và đổi PPHT HS, kết hợp đánh giá với đánh giá ngoài Ở cấp độ thấp, GV có thể dùng đề kiểm tra người khác (của đồng nghiệp, nhà trường cung cấp, từ nguồn liệu trên các Website chuyên ngành) để KT-ĐG kết học tập HS lớp mình Ở cấp độ cao hơn, nhà trường có thể trưng cầu trường khác, quan chuyên môn bên ngoài tổ chức KT-ĐG kết học tập HS trường mình Đổi KT-ĐG có hiệu kết hợp đánh giá GV với tự đánh giá HS Sau kỳ kiểm tra, GV cần bố trí thời gian trả bài, hướng dẫn HS tự đánh giá kết làm bài, tự cho điểm bài làm mình, nhận xét mức độ chính xác chấm bài GV Trong quá trình dạy học và tiến hành KT-ĐG, GV phải biết “khai thác lỗi” để giúp HS tự nhận rõ sai sót nhằm rèn luyện PPHT, PP tư Chỉ đạo đổi KT-ĐG phải đồng thời với nâng cao phẩm chất và lực đội ngũ GV, đầu tư nâng cấp CSVC, đó có thiết bị dạy học và tổ chức tốt các phong trào thi đua phát huy đầy đủ hiệu 5) Phát huy vai trò thúc đẩy đổi KT-ĐG đổi PPDH (11) Trong mối quan hệ hai chiều đổi KT-ĐG với đổi PPDH, đổi mạnh mẽ PPDH đặt yêu cầu khách quan phải đổi KT-ĐG, bảo đảm (12) đồng cho quá trình hướng tới nâng cao chất lượng dạy học Khi đổi KT-ĐG bảo đảm yêu cầu khách quan, chính xác, công tạo tiền đề xây dựng môi trường sư phạm thân thiện, tạo động lực thúc đẩy đổi PPDH và đổi công tác quản lý Từ đó, giúp GV và các quan quản lý xác định đúng đắn hiệu giảng dạy, tạo sở để GV đổi PPDH và các cấp quản lý đề giải pháp quản lý phù hợp 6) Phải đưa nội dung đạo đổi KT-ĐG vào trọng tâm vận động "Mỗi thầy cô giáo là gương đạo đức, tự học và sáng tạo" và phong trào thi đua “Xây dựng trường học thân thiện, học sinh tích cực” Trong nhà trường, hoạt động dạy học là trung tâm để thực nhiệm vụ chính trị giao, thực sứ mệnh “trồng người” Hoạt động dạy học đạt hiệu cao tạo lập môi trường sư phạm lành mạnh, bầu không khí thân thiện, phát huy ngày càng cao vai trò tích cực, chủ động, sáng tạo HS Do đó, phải đưa nội dung đạo đổi PPDH nói chung và đổi KT-ĐG nói riêng thành trọng tâm vận động "Mỗi thầy cô giáo là gương đạo đức, tự học và sáng tạo" và phong trào thi đua “Xây dựng trường học thân thiện, học sinh tích cực” Cũng mối quan hệ đó, bước phát triển vận động và phong trào thi đua này tạo động lực thúc đẩy quá trình đổi PPDH và đổi KT-ĐG đạt mục tiêu cuối cùng là thúc đẩy nâng cao chất lượng GD toàn diện Một số nhiệm vụ đạo đổi kiểm tra, đánh giá 2.1 Các công việc cần tổ chức thực a) Các cấp quản lý GD và các trường PT cần có kế hoạch đạo đổi PPDH, đó có đổi KT-ĐG năm học và năm tới Kế hoạch cần quy định rõ nội dung các bước, quy trình tiến hành, công tác kiểm tra, tra chuyên môn và biện pháp đánh giá chặt chẽ, hiệu cuối cùng thể thông qua kết áp dụng GV b) Để làm rõ khoa học việc KT-ĐG, cần tổ chức bồi dưỡng cho đội ngũ GV cốt cán và toàn thể GV nắm vững CTGDPT cấp học, từ mục tiêu cấp học, cấu trúc chương trình, chương trình các môn học, các hoạt động GD và đặc biệt là chuẩn KT-KN, yêu cầu thái độ người học Phải khắc phục tình trạng GV dựa vào sách giáo khoa để làm soạn bài, giảng dạy và KT-ĐG đã thành thói quen, tình trạng này dẫn đến việc kiến thức HS không mở rộng, không liên hệ nhiều với thực tiễn, làm cho học trở nên khô khan, gò bó, dẫn đến kiểm tra đánh giá đơn điệu, không kích thích sáng tạo HS c) Để vừa coi trọng việc nâng cao nhận thức vừa coi trọng đổi hoạt (13) động KT-ĐG GV, phải lấy đơn vị trường học và tổ chuyên môn làm đơn vị triển khai thực (14) Từ năm học 2010-2011, các Sở GDĐT cần đạo các trường PT triển khai số chuyên đề sinh hoạt chuyên môn sau đây (tổ chức theo cấp: cấp tổ chuyên môn, cấp trường, theo các cụm và toàn tỉnh, thành phố) - Về nghiên cứu Chương trình GDPT: Chuẩn KT-KN và yêu cầu thái độ người học các môn học và các hoạt động GD; khai thác chuẩn để soạn bài, dạy học trên lớp và KT-ĐG - Về PPDH tích cực: Nhận diện PPDH tích cực và cách áp dụng hoạt động dạy học, nghệ thuật bồi dưỡng tình cảm hứng thú học tập cho HS; phát huy quan hệ thúc đẩy đổi KT-ĐG với đổi PPDH - Về đổi KT-ĐG: các phương pháp, kỹ thuật đánh giá kết học tập HS và cách áp dụng; cách kết hợp đánh giá GV với đánh giá HS, kết hợp đánh giá với đánh giá ngoài - Về kỹ thuật đề kiểm tra, đề thi: Kỹ thuật đề kiểm tra tự luận, đề trắc nghiệm và cách kết hợp hợp lý hình thức tự luận với hình thức trắc nghiệm cho phù hợp với nội dung kiểm tra và đặc trưng môn học; xây dựng ma trận đề kiểm tra; biết cách khai thác nguồn liệu mở: Thư viện câu hỏi và bài tập, trên các Website chuyên môn - Về sử dụng SGK: GV sử dụng SGK và sử dụng chuẩn KT-KN chương trình môn học nào cho khoa học, sử dụng SGK trên lớp nào cho hợp lý, sử dụng SGK KT-ĐG; - Về ứng dụng CNTT: Ứng dụng CNTT để sưu tầm tư liệu, ứng dụng dạy học trên lớp, KT-ĐG và quản lý chuyên môn nào cho khoa học, tránh lạm dụng CNTT; - Về hướng dẫn HS đổi PPHT, biết tự đánh giá và thu thập ý kiến HS PPDH và KT-ĐG GV; Ngoài ra, tình hình cụ thể mình, các trường có thể bổ sung số chuyên đề phù hợp, thiết thực đáp ứng nhu cầu GV d) Về đạo các quan quản lý GD và các trường Về PP tiến hành nhà trường, chuyên đề cần đạo áp dụng thí điểm, xây dựng báo cáo kinh nghiệm và thảo luận, kết luận nhân rộng kinh nghiệm thành công, đánh giá hiệu chuyên đề thông qua dự thăm lớp, tra, kiểm tra chuyên môn Trên sở tiến hành các trường, các Sở GDĐT có thể tổ chức hội thảo khu vực toàn tỉnh, thành phố, nhân rộng vững kinh nghiệm tốt đã đúc kết Sau đó, tiến hành tra, kiểm tra chuyên môn theo chuyên đề để thúc đẩy GV áp dụng và đánh giá hiệu 2.2 Phương pháp tổ chức thực a) Công tác đổi KT-ĐG là nhiệm vụ quan trọng lâu dài phải có biện pháp đạo cụ thể có chiều sâu cho năm học, tránh chung chung theo kiểu phát động phong trào thi đua sôi nhằm thực “chiến dịch” (15) thời gian định Đổi KT-ĐG là hoạt động thực tiễn chuyên (16) môn có tính khoa học cao nhà trường, cho nên phải đồng thời nâng cao nhận thức, bổ sung kiến thức, trang bị kỹ cho đội ngũ GV, đông đảo HS và phải tổ chức thực đổi hành động, đổi cách nghĩ, cách làm, đồng với đổi PPDH, coi trọng hướng dẫn, kiểm tra, giám sát, kiểm chứng kết để củng cố niềm tin để tiếp tục đổi Trong kế hoạch đạo, phải đề mục tiêu, bước cụ thể đạo đổi KTĐG để thu kết cuối cùng, phát động, xây dựng, củng cố thành nếp chuyên môn vững hoạt động dạy học: - Trước hết, phải yêu cầu và tạo điều kiện cho GV nắm vững chuẩn KTKN và yêu cầu thái độ người học đã quy định chương trình môn học vì đây là pháp lý khách quan để tiến hành KT-ĐG; - Phải nâng cao nhận thức mục tiêu, vai trò và tầm quan trọng KT-ĐG, cần thiết khách quan phải đổi KT-ĐG, bảo đảm khách quan, chính xác, công để nâng cao chất lượng dạy học; - Phải trang bị các kiến thức và kỹ tối cần thiết có tính kỹ thuật KTĐG nói chung và các hình thức KT-ĐG nói riêng, đó đặc biệt là kỹ thuật xây dựng các đề kiểm tra Cần sử dụng đa dạng các loại câu hỏi đề kiểm tra Các câu hỏi biên soạn đảm bảo đúng kỹ thuật, có chất lượng Đây là khâu công tác có tầm quan trọng đặc biệt vì thực tế, phần đông GV chưa trang bị kỹ thuật này đào tạo trường sư phạm, chưa phải địa phương nào, trường PT nào đã giải tốt Vẫn còn phận không ít GV phải tự mày mò việc tiếp cận hình thức trắc nghiệm, dẫn đến chất lượng đề trắc nghiệm chưa cao, chưa phù hợp với nội dung kiểm tra và đặc trưng môn, không ít trường hợp có tình trạng lạm dụng trắc nghiệm - Phải đạo đổi KT-ĐG theo chuyên đề có chiều sâu cần thiết, coi trọng phổ biến kinh nghiệm tốt và tăng cường tháo gỡ khó khăn, vướng mắc thông qua sinh hoạt tổ chuyên môn các GV cùng môn b) Các cấp quản lý phải coi trọng sơ kết, tổng kết, đúc rút kinh nghiệm, nhân điển hình tập thể, cá nhân tiên tiến đổi KT-ĐG c) Trong năm học, các cấp quản lý tổ chức các đợt kiểm tra, tra chuyên đề để đánh giá hiệu đổi KT-ĐG các trường PT, các tổ chuyên môn và GV Thông qua đó, rút kinh nghiệm đạo, biểu dương khen thưởng các đơn vị, cá nhân làm tốt, uốn nắn các biểu bảo thủ ngại đổi thiếu trách nhiệm, bàng quan thờ 2.3 Trách nhiệm tổ chức thực a) Trách nhiệm Sở Giáo dục và Đào tạo: - Cụ thể hóa chủ trương đạo Bộ GDĐT đổi PPDH, đổi KTĐG, đưa công tác đạo đổi PPDH, đổi KT-ĐG làm trọng tâm vận động “Mỗi thầy cô giáo là gương đạo đức, tự học và (17) sáng tạo” và phong trào thi đua “Xây dựng trường học thân thiện, HS tích cực”, với mục tiêu (18) xây dựng môi trường sư phạm lành mạnh và phát huy vai trò tích cực, tinh thần hứng thú, chủ động, sáng tạo học tập HS; - Lập kế hoạch đạo đổi PPDH, đổi KT-ĐG dài hạn, trung hạn và năm học, cụ thể hóa các tâm công tác cho năm học: + Xác định rõ mục tiêu cần đạt được, nội dung, đối tượng, phương pháp tổ chức bồi dưỡng, hình thức đánh giá, kiểm định kết bồi dưỡng; lồng ghép việc đánh giá kết bồi dưỡng với việc phân loại GV, cán quản lý sở GD năm theo chuẩn đã ban hành + Xây dựng đội ngũ GV cốt cán vững vàng cho môn và tập huấn nghiệp vụ đổi PPDH, đổi KT-ĐG cho người làm công tác tra chuyên môn + Tăng cường đầu tư xây dựng CSVC, thiết bị dạy học để tạo điều kiện thuận lợi cho việc đổi PPDH, đổi KT-ĐG + Giới thiệu các điển hình, tổ chức trao đổi, phổ biến và phát huy tác dụng các gương điển hình đổi PPDH, đổi KT-ĐG + Tổ chức tốt việc bồi dưỡng GV: Cần tổ chức sử dụng tài liệu “Hướng dẫn thực chuẩn KT-KN Chương trình giáo dục phổ thông” Bộ GDĐT ban hành, sớm chấm dứt tình trạng GV dựa vào SGK để dạy học và KT-ĐG, không có điều kiện và thói quen tiếp cận nghiên cứu nắm vững chuẩn KT-KN chương trình môn học - Tăng cường khai thác CNTT công tác đạo và thông tin đổi PPDH, KT-ĐG: + Lập chuyên mục trên Website Sở GDĐT PPDH và KT-ĐG, lập nguồn liệu thư viện câu hỏi và bài tập, đề kiểm tra, giáo án, kinh nghiệm, các văn hướng dẫn đổi PPDH, KT-ĐG, các video bài giảng minh họa…; + Thí điểm hình thức dạy học qua mạng (learning online) để hỗ trợ GV, HS giảng dạy, học tập, ôn thi; - Chỉ đạo phong trào đổi PPHT để phát huy vai trò tích cực, chủ động, sáng tạo học tập và rèn luyện đạo đức HS, gắn với chống bạo lực trường học và các hành vi vi phạm quy định Điều lệ nhà trường b) Trách nhiệm nhà trường, tổ chuyên môn và GV: Trách nhiệm nhà trường + Cụ thể hóa chủ trương Bộ và Sở GDĐT đạo đổi PPDH, đổi KT-ĐG đưa vào nội dung các kế hoạch dài hạn và năm học nhà trường với các yêu cầu đã nêu Phải đề mục tiêu phấn đấu tạo cho bước chuyển biến đổi PPDH, đổi KT-ĐG; kiên trì hướng dẫn GV thực hiện, kịp thời tổng kết, rút kinh nghiệm, nhân điển hình tiên tiến và chăm lo đầu tư xây dựng CSVC, TBDH phục vụ đổi PPDH, đổi KT-ĐG; + Tổ chức hợp lý việc lấy ý kiến GV và HS chất lượng giảng dạy, giáo (19) dục GV; đánh giá sát đúng trình độ, lực đổi PPDH, đổi KT9 (20) ĐG GV trường, từ đó, kịp thời động viên, khen thưởng GV thực đổi PPDH có hiệu quả; + Tổ chức tốt công tác bồi dưỡng GV: (i) Trước hết, phải tổ chức cho GV nghiên cứu nắm vững chuẩn KT-KN chương trình, tích cực chuẩn bị TBDH, tự làm đồ dùng DH để triệt để chống “dạy chay”, khai thác hồ sơ chuyên môn, chọn lọc tư liệu liên hệ thực tế nhằm kích thích hứng thú học tập cho HS (ii) Nghiên cứu áp dụng PPDHTC vào điều kiện cụ thể lớp; nghiên cứu tâm lý lứa tuổi để vận dụng vào hoạt động giáo dục và giảng dạy Nghiên cứu các KN, kỹ thuật dạy học và kỹ tổ chức các hoạt động cho HS Tổ chức cho GV học ngoại ngữ, tin học để làm chủ các phương tiện dạy học, ứng dụng CNTT, khai thác Internet phục vụ việc học tập nâng cao trình độ chuyên môn (iii) Hướng dẫn GV lập hồ sơ chuyên môn và khai thác hồ sơ để chủ động liên hệ thực tế dạy học, bồi dưỡng tình cảm hứng thú học tập cho HS + Tổ chức diễn đàn đổi PPDH, đổi KT-ĐG GV, diễn đàn đổi PPHT cho HS; hỗ trợ GV kỹ thuật đề tự luận, trắc nghiệm, cách kết hợp hình thức tự luận với trắc nghiệm cho phù hợp với nội dung kiểm tra và đặc trưng môn học + Kiểm tra các tổ chuyên môn và đánh giá hoạt động sư phạm GV: (i) Kiểm tra công tác bồi dưỡng và tự bồi dưỡng GV, kịp thời động viên cố gắng sáng tạo, uốn nắn các biểu chủ quan tự mãn, bảo thủ và xử lý hành vi thiếu tinh thần trách nhiệm; (ii) Tiến hành đánh giá phân loại GV theo chuẩn đã ban hành cách khách quan, chính xác, công và sử dụng làm để thực chính sách thi đua, khen thưởng; + Phối hợp với Ban đại diện cha mẹ HS để quản lý học tập HS nhà, bồi dưỡng HS giỏi, giúp đỡ HS học lực yếu kém, giảm lưu ban, bỏ học: (i) Duy trì kỷ cương, nếp và kỷ luật tích cực nhà trường, kiên chống bạo lực trường học và vi phạm quy định Điều lệ nhà trường, củng cố văn hóa học đường tạo thuận lợi để tiếp tục đổi PPDH, KT-ĐG; (ii) Tổ chức phong trào đổi PPHT để thúc đẩy tinh thần tích cực, chủ động, sáng tạo và lấy ý kiến phản hồi HS PPDH, KT-ĐG GV + Khai thác CNTT công tác đạo đổi PPDH, KT-ĐG: + Lập chuyên mục trên Website trường PPDH và KT-ĐG, lập nguồn liệu câu hỏi và bài tập, đề kiểm tra, giáo án, kinh nghiệm, các văn hướng dẫn đổi PPDH, KT-ĐG, các video bài giảng minh họa…; + Thí điểm hình thức dạy học qua mạng LAN trường (learning online) để GV giỏi, chuyên gia hỗ trợ GV, HS giảng dạy, học tập, ôn thi - Trách nhiệm Tổ chuyên môn: + Đơn vị tổ chức bồi dưỡng thường xuyên quan trọng là các tổ chuyên (21) môn Cần coi trọng hình thức tổ chức cho GV tự học, tự nghiên cứu, sau đó GV có kinh nghiệm GV cốt cán chủ trì thảo luận, giải đáp thắc mắc, trao đổi kinh 10 (22) nghiệm Sau nghiên cứu chuyên đề, cần tổ chức dự giờ, rút kinh nghiệm để hỗ trợ GV thực đổi PPDH và KT-ĐG; + Tổ chức cho GV nghiên cứu nắm vững chuẩn KT-KN CT môn học và hoạt động GD mình phụ trách và tổ chức đặn việc dự và rút kinh nghiệm, giáo dục ý thức khiêm tốn học hỏi và sẵn sàng chia sẻ kinh nghiệm; thảo luận cách giải vấn đề mới, vấn đề khó, phát huy các hoạt động tương tác và hợp tác chuyên môn; + Yêu cầu GV thực đổi hình thức KT - ĐG học sinh Cần đa dạng hóa các dạng bài tập đánh giá như: các dạng bài tập nghiên cứu; đánh giá trên sản phẩm hoạt động học tập học sinh (tập các bài làm tốt học sinh; tập tranh ảnh học sinh sưu tầm, các bài văn, bài thơ, bài báo sưu tầm theo chủ đề; sổ tay ghi chép học sinh…); đánh giá thông qua chứng minh khả học sinh (sử dụng nhạc cụ, máy móc ); đánh giá thông qua thuyết trình; đánh giá thông qua hợp tác theo nhóm; đánh giá thông qua kết hoạt động chung nhóm… + Đề xuất với Ban giám hiệu đánh giá phân loại chuyên môn GV cách khách quan, công bằng, phát huy vai trò GV giỏi việc giúp đỡ GV lực yếu, GV trường; + Phản ánh, đề xuất với nhà trường công tác chuyên môn và công tác bồi dưỡng GV, phát và đề nghị nhân điển hình tiên tiến chuyên môn, cung cấp các giáo án tốt, đề kiểm tra tốt để các đồng nghiệp tham khảo; + Đánh giá đúng đắn và đề xuất khen thưởng GV thực đổi PPDH, đổi KT-ĐG có hiệu - Trách nhiệm GV: + Mỗi GV cần xác định thái độ cầu thị, tinh thần học suốt đời, không chủ quan thỏa mãn; tự giác tham gia các lớp bồi dưỡng, tự bồi dưỡng thường xuyên và sẵn sàng hoàn thành nhiệm vụ GV cốt cán chuyên môn lựa chọn; kiên trì vận dụng điều đã học để nâng cao chất lượng dạy học; + Phấn đấu thực nắm vững nội dung chương trình, đổi PPDH và KTĐG, rèn luyện kỹ năng, kỹ thuật dạy học (trong đó có kỹ ứng dụng CNTT, khai thác internet…), tích lũy hồ sơ chuyên môn, tạo uy tín chuyên môn tập thể GV và HS, không ngừng nâng cao trình độ các lĩnh vực hỗ trợ chuyên môn ngoại ngữ, tin học; + Thực đổi PPDH GV phải đôi với hướng dẫn HS lựa chọn PPHT hợp lý, biết tự học, tự đánh giá, tự chủ, khiêm tốn tiếp thu ý kiến đồng nghiệp và HS PPDH, KT-ĐG mình để điều chỉnh; + Tham gia tập huấn chuyên môn, nghiệp vụ; dự đồng nghiệp, tiếp nhận đồng nghiệp dự mình, thẳng thắn góp ý kiến cho đồng nghiệp và khiêm tốn tiếp thu góp ý đồng nghiệp; tự giác tham gia hội giảng, thao giảng, thi GV giỏi, báo cáo kinh nghiệm để chia sẻ, học hỏi kinh nghiệm nhằm trau dồi (23) lực chuyên môn Trong quá trình đổi nghiệp GD, việc đổi PPDH và KT-ĐG là giải 11 (24) pháp then chốt để nâng cao chất lượng dạy học nói riêng và chất lượng GD toàn diện nói chung Đây là yêu cầu vừa cấp bách vừa lâu dài, đòi hỏi phải đạo chặt chẽ, liên tục và phải động viên kiên trì nỗ lực sáng tạo đội ngũ GV, lôi hưởng ứng đông đảo HS Để tạo điều kiện thực có hiệu chủ trương đổi PPDH và KT-ĐG, phải bước nâng cao trình độ đội ngũ GV, đồng thời tăng cường đầu tư xây dựng CSVC, là TBDH Các quan quản lý GD phải lồng ghép chặt chẽ công tác đạo đổi PPDH và KT-ĐG với việc tổ chức thực vận động "Mỗi thầy cô giáo là gương đạo đức, tự học và sáng tạo" và phong trào thi đua “Xây dựng trường học thân thiện, học sinh tích cực” để bước nâng cao chất lượng GD toàn diện, đáp ứng yêu cầu nghiệp công nghiệp hóa, đại hóa đất nước và hội nhập quốc tế (25) 12 (26) PHẦN THỨ HAI BIÊN SOẠN ĐỀ KIỂM TRA I - KĨ THUẬT BIÊN SOẠN ĐỀ KIỂM TRA In general, the language components and skills involved in communication are speaking, writing, reading, listening, vocabulary, grammar or syntax and pronunciation On a higher level of language use, those involved may include communication norms, socio-linguistic etiquettes; pragmatics and even cultural elements from a certain community Nevertheless, the language requirements or the goals for students at secondary level are quite fundamental according to the Handbook of Standards in Knowledge and Skills for Secondary Education In testing and evaluation, the components of language are and should be integrated into a unity in order to assess students’ competences Due to the limitations of this short training workshop in testing and evalation, the aforementioned components will be dealt with separately for the sake of understanding and undertaking Kĩ thuật biên soạn câu hỏi Từ vựng/Vocabulary questions The purpose of vocabulary tests is to measure the comprehension and production of words used in language skills In designing a test question, simply choosing difficult words or random Lists of words doesn't make much sense either Somehow we need to find out which words our students need to know The problem can be solved by referring to the Glossary that tells what words must be learned in each unit at the end of the course books Another way is to record the words that students misuse These become test items Still other sources are your textbook, reader, and exercise manual Finally, not overlook the words and phrases needed to run the class, such as "Take your seat" or "The assignment for tomorrow." These are useful test items at the beginning level Deciding how to test vocabulary is related to how we teach it Most teachers today not recommend having students simply memorize lists of words Instead, they teach students to find the meaning of words through the context of the sentence, and they help increase comprehension by teaching important affixes (happy: unhappy/beauty: beautiful) In testing vocabulary, we also need to avoid presenting words in isolation This part will illustrate a variety of ways to use context cues and word building skills in testing vocabulary Checking vocabulary mastery can be adjusted to match your emphasis on oral or written skills Suppose improving conversation skills is your primary objective, you can test vocabulary by using aural cues ("What time is it?") and by requiring responses such as "It's nine o'clock" On the other hand, suppose you are stressing reading, you can offer a written multiple- (27) 13 (28) choice format "He bought a cake at the (A) bank, (B) bakery, (C) hardware store, (D) bookstore" 1.1 Hoàn thành câu nhiều lựa chọn (Multiple-choice completion) A good vocabulary test type for students is multiple choice completion It makes the student depend on context dues and sentence meaning This kind of item is constructed by deleting a word from a sentence, for example: She quickly _ her lunch (A) drank (B) ate* (C) drove (D) slept (The correct choice is marked with an asterisk "*") After reading the sentence, students look at the group of words and chooses which one best completes what they have read The following steps should be taken in writing multiple choice completion items: (1) Select the words to be tested (2) Get the right kind of sentence to put each word in (this sentence creating the context is called the stem) (3) Choose several wrong words to put the right word with (these wrong words are called distractors) Three distractors plus the right word are enough for a written item (4) Finally, prepare clear and simple instructions And if this kind of test question is new to your students, it would be recommendable to prepare one or two examples Vocabulary Choice When selecting vocabulary items, remember the suggestions given earlier Also, realize that sentence-completion items tend to give you a chance to test passive vocabulary Since students have to recognize these words but not necessarily produce them, this is a good way to test more difficult vocabulary items than the usual ones that students study in class But these should still be words or phrases that are useful to your students - words, for example, from their reading materials Of course, words can be chosen from other sources like newspapers, magazines, and textbooks from other reference materials, if you have used these in your English class Another point to remember is that usually only content words (nouns, verbs, adjectives, and adverbs) are included in vocabulary tests Function words (articles, determiners, prepositions, conjunctions, pronouns, auxiliary verbs) (29) appear in grammar tests 14 (30) When using words not found in your classroom textbook, be careful of bias Material from other sources could give students who have experienced them a special advantage (most students may not know the specific vocabulary related from those sources) Context Preparation With suitable words selected, our next step is to prepare contexts for them Sometimes -especially for beginning students - more than one sentence is needed to help clarify meaning You can prepare a two-line mini-dialog like those in the students' books to check the meaning of a word such as (paint) brush: E.g.: "I want to paint, too!” "All right Use that _ over there!' *A brush B pencil C broom D spoon Another way is to find a passage (on your students' level) in which the word appears, remembering that some sentences are much more helpful than others Consider a fairly difficult word - communicate A passage from an English language reader might begin with the sentence: "Human being communicate in many ways." This shows us only that communicate is a verb and that it can be performed by humans Another sentence from the same passage limits the meaning of the word: "Some people communicate disapproval by holding their nose between their thumb and forefinger." This second sentence provides a better "frame" for the word Other verbs such as interrogate, philosophize and investigate can be used as distractors with the second sentence (An asterisk indicates the correct answer.) E.g.: Some people disapproval by holding their nose between their thumb and forefinger A interrogate B philosophize *C communicate D investigate A, B, and D are good distractors because not one of them fits this context Assume that a second rather difficult word, superstitious, appears only in a general context "Frank is certainly very superstitious." We see that a large number of words (such as old, tall, happy, kind; or ambitious, optimistic, courteous) could fit here Since a better sentence is not available in the text, we can write one of our own: "Frank is so superstitious that he thinks you'll have bad luck if you break a mirror." Simplified slightly, it reads: Frank is very ; he says, "Break a mirror, and you'll have bad luck." (31) A ambitious superstitious* B optimistic C courteousD 15 (32) Finally, avoid contexts that are too difficult The following sentence contextualizes the verb implies, which you may want to test, but notice how difficult it is to understand: "Present an analogy which implies the concept you wish to convey.” The vocabulary item is much more easily understood in the following context: "He didn't actually say so, but he implied that you lied." Distractor Preparation There are two common ways to choose distractors Experienced teachers often create their own They can so because they have developed a “feel" for the language that is appropriate for their students But there is a second and equally good way That is to use student errors as distractors Teachers who create their own distractors should follow certain guidelines: Make sure the distractors are the same form of word as the correct answer E.g.: (A poor example) She had to help the _ old man up the stairs *A weak B slowly C try D wisdom When distractors are not the same form as the right answer, students might answer the item correctly for the wrong reason For example, some may know an adjective is needed in this item and they might notice that weak is the only adjective listed (Note that words like strong, energetic and athletic are distractors that contrast with the old man's weakened condition On the other hand, words such as wise, kind, pleasant or bent not contrast as well and are therefore weaker distractors.) ALSO be sure you don't give away the right answer through grammatical cues Notice the effect of the article in the following example She needs to get up earlier so she's buying an clock A time *B alarm C watch D bell In this question, meaning and grammar indicate that alarm clock is right because an is only used with a word beginning with a vowel sound One way of correcting this would be to remove “an” from the sentence and use this form for the choices: A a time *B an alarm C a watch D a bell Multiple-choice items for any one question should be about the same level of difficulty, and ideally, the sentence context should not be difficult for students to read E.g.: They needed operate such equipment lots of training to (33) A: easy 16 *B sophisticated C blue D wise (34) Students might pick sophisticated simply because it contrasts in difficulty with the distractors or because students can eliminate the three easy choices Also be sure not to include more than one correct answer E.g.: She sent the _ yesterday A letter B gift C food D books Actually, any one of the four choices is acceptable The item would be improved by changing the verb to mailed But we know that gifts, food, and books are also mailed Therefore, we can use unmailable choices such as post office, friend, or courage Another possibility is to choose a new sentence But notice how this problem can arise again: She wrote a yesterday A letter B gift C friend D book While "D” is unlikely, "C" is completely acceptable So we still have two ''correct'' answers, and of course we should have only one To eliminate slips like these, have someone else read through your items before you use them on a test At the beginning of the discussion on distractors it was suggested that you could write your own, or that you could use student errors One source of student errors is the composition, and another is student speech These are good because they involve actual communication The difficulty is that such sources take a lot of time to sort through, and usually much of the information that we want is missing, because students can avoid words that they are not sure of A more efficient way to find vocabulary errors is to look at homework and classroom exercises on vocabulary But if the test that you're preparing is important enough, you can collect errors (for distractors) even more systematically: Give the students sentence- completion items without the multiple choice options, and simply have them fill in the blank in each sentence You can then write down their wrong answers You will also find some correct alternatives, but naturally you can't use these as distractors For example, suppose you used "Frank is very ; he says, 'Break a mirror, and you'll have bad luck."' Besides superstitious, you might get words such as silly, wrong, stupid, liar, because, religious, knowing, lucky The first three can't be used because they could possibly appear in such a sentence The last three are adjectives, and so they seem usable However, liar and because not match distractor requirements What you with these? You can probably use them anyway, particularly if more than one person wrote them down Our guidelines are useful generalizations, but the errors made by your (35) students reflect their exact level and their special way of "seeing" the language 17 (36) Distractors chosen from these errors can test your class even better than those that you create yourself Instruction Preparation The instructions for your test should be brief; students shouldn't have to spend a lot of time reading them And they should be clear, anxiety can come from poorly worded questions, and resentment from misunderstood directions Some teachers prefer to give instructions orally, but if any students come late, repeated instructions can distract those working on the exam Keep in mind that instructions can really become a kind of "test," and oral instructions can amount to an unintended "listening test." If you have used multiple-choice sentence-completion exercises in class, instructions can be very short: "Circle the letter of the right answers" (or) "Circle the letter of the word that best completes each sentence!' Naturally the kind of directions given depends on your students' reading ability and how you want them to mark the test paper Consider the following: Read each sentence carefully Then look at the four words below it Choose the one that completes the sentence correctly Put the letter of that word (A, B, C, or D) in the blank at the left You will find it helpful to give both oral and written instructions for students at the beginning level For classes with very little skill in English, you can even give the instructions in the native language One final note: Instructions can be made clearer by one or two examples They are not given for practice They are given to show how to answer the questions Therefore, they should be simple enough that everyone can them without any difficulty E.g.: They drove to work in their new A, house *B car C office D street If needed, a short explanation can follow the example: 'We circle 'B' because 'car' is the only word that fits into the sentence." Alternative to sentence completion, we can use multiple-choice doze Cloze tests are made from stories or essays by deleting words at regular intervals Students have to write in each blank the word that they think belongs there or select the right choice from a group of options given Multiple-choice cloze tests work like regular multiple-choice sentence completion; but usually content words (like school or run) and function words (such as the or in) are deleted In addition, cloze tests provide more context - often more than one paragraph Multiple-choice cloze can test vocabulary when only content words are deleted (37) 18 (38) E.g.: After the capture of Troy, Ulysses set out for his (A neighborhood B continent *C homeland D street) many miles away But so many strange (A sights *B things C places D people) happened to him on his journey that ten (*A years B timer C roads D cities) passed before he reached Ithaca'' Advantages of Multiple-Choice Completion It helps students see the full meaning of words by providing natural contexts Also, it is a good influence on instruction: It discourages wordlist memorization Scoring is easy and consistent It is a sensitive measure of achievement Limitations of Multiple-Choice Completion It is rather difficult to prepare good sentence contexts that clearly show the meaning of the word being tested It is easy for students to cheat by copying what others have circled 1.2 Hoàn thành câu với dạng đúng từ cho trước/ Word formation Word-formation items require students to fill in missing parts of words that appear in sentences These missing parts are usually prefixes and suffixes-for example, the un- in untie or the -ful in thankful A related task is to use words like the following in a sentence and have students supply missing syllables of any kind, such as the rel- in relative or the -ate in deliberate We can see, then, that there is a different emphasis in simple-completion tests than in those we have just looked at Context is still useful, but the emphasis is on word building Moreover, this is a test of active not passive skills The steps in preparing a simple-completion vocabulary test are similar to those mentioned in the previous sections, but with one difference: No distractors are needed Here are the steps: (1) List the prefixes and suffixes that you have taught to your students Then match these with content words that they have studied (including even their passive vocabulary) (2) Prepare sentences that clarify the meaning of these words (3) Then write your instructions and examples If the test is quite important, try it out ahead of time You can have other teachers take it, or possibly native English speakers Then revise it and use it in your class (39) Vocabulary Choice Perhaps your students have studied the -ly ending used with many adverbs (and some adjectives) They might not know adjectives like manly or adverbs like extremely But quick is part of their vocabulary, and so it can be used in testing 19 (40) the suffix (quickly) They might also know the negative prefix un- Recalling that cooperative is part of their passive vocabulary, you decide to challenge them on the test You expect to see if they can produce uncooperative in the test Context Preparation Student success on the exam will depend in part on your sentence contexts For example, one simple-completion vocabulary test included a stem (base word) requiring -ous The sentence read, "He was a very nerv _ person." But a number of these rather advanced students wrote in a "y" instead - an unexpected but correct ending They thus produced nervy, which means "bold" or "offensive." The context did not show that the person intended was "worried" or "timid" (nervous), so nervy had to be accepted It is also possible to check student knowledge of when not to add a prefix or suffix Compare the following: My teach _ is very helpful Did she teach you anything? In the first sentence, the suffix -er is required In the second sentence, no suffix is needed Such sentences are not left empty; students must put an "X" in the blank But notice again how careful we must be in writing our sentences: E.g.: That was a care answer Note that either careful or careless can be used So sometimes, more context is needed to clarify which word we mean: E.g.: Yesterday he got on the wrong bus So today he was care to find the right one Another very popular vocabulary test type is stem-first procedure An advantage of this type is that many words need spelling changes when suffixes are added Following is an example: She has a beautiful new dress (BEAUTY) Advantages It reflects teaching approaches It is generally faster and easier to construct than are items with distractors Limitations Fewer words can be tested this way than with multiple choice There is some difficulty in avoiding ambiguous contexts (41) 20 (42) Bài tập/Tasks The following sentences contain examples of distractor difficulties Identify the weakness in each item Then correct it a Do you need some to write on? A paper B pen C table D material C baby D ran b The mouse _ quickly away A very B little c I think he'll be here in an - A hour B day after tomorrow C weekend D soon d They _ me to get up right away A asked B needed C told D wanted C study D interesting e Choose the odd one out A pleased B nervous Prepare five test items from words in your students' text, or use the following vocabulary words: truth/weekend/secret/ridiculous/perfume a For each word write a sentence context that reflects the meaning of the word as clearly as possible b Prepare three good distractors for each test item c Write simple, clear instructions, and include an example First, supply a word with a prefix or suffix for each blank in the following sentences Then prepare word completion items Example: It was a most mistake (Answer: deplorable/regrettable/ inexcusable, etc.) (It was a most deplor mistake) When you write your check, make it to my sister Please wipe your hands on that cloth The police arrested him for the riot The of the volcano destroyed several villages (43) The boy didn't his shoelaces before taking off his shoes 21 (44) Kĩ thuật biên soạn câu hỏi Ngữ pháp/Grammar questions Grammar tests are designed to measure student proficiency in matters ranging from inflections (bottle-bottles, bake-baked) to syntax Syntax involves the relationship of words in a sentence including manners such as word order, use of the negative, question forms, and connectives As indicated earlier, this material covers vocabulary, grammar, and pronunciation tests Of these three, grammar ones seem to be the most popular There are several reasons for this: Much English teaching has been based on grammar; and unlike various measures of communicative skills, there is general agreement on what to test Grammar items, such as auxiliary verbs, are easy to identify, and errors in grammar can be quickly spotted and counted As with vocabulary exams, either passive or active skills can be checked Also, grammar can be tailored to beginners or advanced learners Of course, in testing grammar, we don't pretend to measure actual communication But we can a good job of measuring progress in a grammar class, and we can diagnose student needs in this area 2.1 Hoàn thành câu nhiều lựa chọn (Multiple-choice completion) The test type presented in this part includes an incomplete sentence stem followed by four multiple-choice options for completing the sentence Here is an easy sample item: E.g.: She is _ her breakfast A eating B ate C eats D eaten While multiple-choice completion is an efficient way to test grammar, teachers need to be cautioned about the temptation to use this bid of item for all of their testing needs Many people are very excited about objective tests, feeling that multiple choice objective exams in particular should be used to test everything However, any given test is a kind of tool; it may be very useful for some jabs but not for others For example, while multiple-choice tests can be used successfully in testing grammar, they don't seem to work as well in testing conversational ability Preparing multiple-choice completion gmmmar items follows about the same procedure as that described in the previous part for writing multiple-choice completion vocabulary items: (1) Choose the grammar points that you need to test; (2) prepare the right (45) kind of sentence context (or stem) for the grammar structure; (3) select three logical distractors; and (4) prepare clear, simple instruction 22 (46) Grammar Choice Choosing grammar points to test is usually rather easy: Just determine what structures you have taught since the last test The results on quizzes or homework assignments can show those things that students have learnt well and those things that need reviewing The points they know well can be generally ignored A few of these, however, could be included at the beginning of the test to encourage students A related matter is how to give different "weight" to various grammar points Let's say you spent three times longer on modal auxiliaries than on two-word verbs You could prepare two or three times as many questions on the modals This is part of the planning that is necessary Before starting to write the questions, you need to decide how many of each grammar type to include Context Preparation Assuming that you have decided what points to test, what multiple-choice type to use, and how many questions to prepare, you are now ready to start writing the items First, choose a structure and then use it correctly in a sentence Remember, a good context is very important! Sometimes only a few words are enough, such as "I don't want to go" (in testing 'to plus verb") But notice how much context is needed for other grammar points In the following sentence, must is used to express a conclusion or deduction: "Jimmy hasn't eaten anything, and he won't talk or play He must be ill." When many of your test items require a lot of context like this, you may consider using a twosentence approach Distractor Preparation We are now ready for distractors You will recall that these are the incorrect options which we put with the correct word or phrase to complete the sentence Experienced teachers usually have a good sense for what to use, but inexperienced teachers need some help For example, "could of' has sometimes been used as a distractor for "could have." This won't work, because it is a native English-speaker error and is almost never made by non-native English speakers Also, avoid using distractors that sound alike Look at this item from an inexperienced teacher's test: E.g.: _ the ones who know the answers A They are B There C They're D Their This is really just a spelling item It might be used on a writing test, but not on a grammar test Another problem is that both A and C are correct options It is also a good idea to avoid items that test divided usage, or items that only test different levels of formality (47) 23 (48) E.g.: You can get it from the lady _ he sold it to A which B who C whom D why Debatable items like this just confuse non-native speakers Notice that choice "C" is in the "correct" case But choice "B" is closer to what native speakers would actually say The easiest way of saying the sentence isn't even provided dropping out the relative pronoun altogether “You can get it from the lady he sold it to" In addition, the who/whom choices tend to stick out as the obvious pair to choose from; and "why" is a very weak distractor But even with this help, how can the inexperienced teacher write distractors that sound right! One way is to look at the errors that students make on exercises or doze passages These errors can be used as distractors Another source of distractors is errors from students’ writing It is good not either to confuse or tire your students by having them reread unnecessary material Take out any repeated words from the distractors and put these in the stem E.g.: If I had a new fur coat, A I showed it to everyone everyone B I'd C I've shown it to everyone D everyone I'll show show it it to to (revised) If I had a new fur coat, _ to everyone A I showed B I'd show C I've shown D I'll show Also, it is best not to mix categories like the following: E.g.: They just bought furniture A a few B several C some D with (revised) They just bought _ A a few furnitures C some furniture B several furnitures D a furniture The example above requires recognition of furniture as a non-count noun and recognition of the right determiner to use with this word Choice "D" (with) is unsatisfactory because it is a preposition and not a determiner Alternate Form of Multiple-Choice Completion Unlike previous test items in this section, error identification does not require students to complete a sentence Instead, they have to find the part (49) containing an error This kind of test question is particularly useful in testing 24 (50) grammar points for which there are few logical options, such as the choice between few and a few, little and a little, some and any, much and many, or this and that, etc E.g.: Rain is slight acidic even in unpolluted air, because carbon dioxide in the A* B atmosphere and other natural acid-forming gases dissolve in the water C D In addition to having students identify the error, it is also possible to have them give the correct form Advantages of Multiple-Choice Completion It is impossible for students to avoid the grammar point being evaluated Scoring is easy and reliable This is a sensitive measure of achievement (and like other multiple-choice language tests, it allows teachers to diagnose specific problems of students) Limitations of Multiple-Choice Completion Preparing good items is not easy It is easy for students to cheat (It is possible to create a second form of the test by rearranging the items, but this is time consuming for the teacher.) It doesn't appear to measure students' ability to reproduce language structures (although in actual fact this kind of test is a good measure of the grammar subskill) I This can have a negative influence on class work if used exclusively (Students may see no need to practice writing if tests are objective.) 2.2 Sentence Completion Simple-completion items used for testing grammar consist of a sentence from which a grammatical element has been removed An elementary item would be "He went to chool." A more advanced open-ended item would be "I would have gone if he had invited me” Students may be asked to decide from the context what word or phrase to write in the blank; or they may be asked to write in an option from a list, or to change the form of a key word (such as write to wrote) There are three steps to follow in preparing sentence completion grammar tests: (1) Select the grammar points that need to be tested; (2) provide an (51) appropriate context; and (3) write good instructions But it is also necessary to decide what kind of sentence-completion question to use Some are easier than multiple-choice completion, and some are much more difficult Most of this section 25 (52) will deal with the three basic kinds of sentence-completion grammar tests: (1) the option form, (2) the inflection form, and (3) the free-response form These three forms vary not only in difficulty but also in objectivity and in the degree of active or passive response that is required As a result, you can tailor the test to the students that you have Your advance planning will determine which general question type to use If you need to check mastery of many structures, you will probably select multiplechoice completion If you have to test sentence combining, word order, or sentence transformation skill, you can use a guided writing procedure E.g.: Combine these two sentences: She knew something He loved her (Answer) She knew that he loved her But if you want a quick way to check the mastery of a few specific points for only one or two classes of students, simple completion is ideal E.g.: He (sing) very well when he was a child The Option Form The easiest simple-completion items are like multiple-choice questions with only two options E.g.: Direction: Complete the following sentences with "do" or "make." He _ a lot of money last year I always _ my best This option form can easily be adapted from exercises in your textbook Sometimes a new pair of options is given for each sentence E.g.: The women _ for the tragedy (was crying, cry) The magician performed some _ tricks (astonishing, astonished) Often there are three or four choices listed, and at times even more For example, here is a nine-option completion item from an English test Students choose the best question word from among the following: who, whom, where, what, when, why, how many, how much, how (53) 26 (54) E.g.: QUESTION ANSWER 1) _ did the clock stop running? At twelve o'clock 2) were you late? We ran out of gas The Inflection Form Testing the mastery of inflections provides for a productive response These vary from simple comparatives to verb tense questions: E.g.: 1) He's the (tall) person in the class 2) They _ (be) in Colorado last week When students have to write in their own answer like this, you have to be careful about context For example, new teachers might think that if they write a sentence like "He (sing) a song," only "is singing" will fit If they're testing the progressive, they may be disappointed to find that several other answers are possible, such as sings, sang, has been singing, had been singing, will sing, etc This problem can be solved by giving part of the verb or adding more context E.g.: He is ing (sing) (or) He singing now (Add one word.) or "What's Tom doing now?" "Oh, he _ (sing)." Another technique is to use a separate blank for each word in the verb phrase E.g.: He _ _ (sing) now The Free-Response Form Sometimes a few simple terms can be used, if everybody in the class knows what they mean The free-response form illustrates how that common terminology can occasionally be used Here are some sentences from an English test: Example: Add a question tag to these sentences: (55) 1) Hamlet was indecisive, ? 27 (56) 2) Polonius knew a lot of aphorisms, _? It is good to use an example to make sure that no one is confused Here are some illustrations used widely in English test Example: Directions: Write in the missing part of the two-word verb "What time did he get this morning?" Directions: Write in a two-word verb that has the same meaning as the word provided in the brackets "Jack (arose) later than usual.'" The following example illustrates free response with a minimum amount of contextual control (Here the conditional is being tested.) “You would get better sooner if _” These take longer to correct than other completion types, and they also take more language skill to evaluate properly Consider a few acceptable ways that students could complete example: "if you dressed warmer," "if you'd see a doctor," "if Mother were here," "if we had some medicine for you." Obviously, thislast kind of simple-completion question requires the most real productivity of all It also provides flexibility; and it is perhaps the most communicative Advantages of Sentence Completion These are generally easier to prepare than are multiple-choice items These give the appearance of measuring productive skills because some items permit flexibility and original expression There is no exposure to incorrect grammatical forms These provide a sensitive measure of achievement Limitations of Sentence Completion These are usually more time consuming to correct than are multiplechoice questions Not only can poor penmanship be a problem but also "irrelevant" errors beyond those being tested Occasionally students can unexpectedly avoid the structure being tested (57) Cloze 28 (58) Cloze tests are prose passages, usually a paragraph or more in length, from which words have been deleted The student relies on the context in order to supply the missing words At the present time, no single test format is more popular than the cloze procedure It is easy to prepare and rather easy to score Teachers like it too because it is integrative-that is, it requires students to process the components of language simultaneously, much like what happens when people communicate Moreover, studies have shown that it relates well to various language measures - from listening comprehension to overall performance on a battery of language tests In brief, it is a good measure of overall proficiency But as we have seen in the introductory part, proficiency tests such as the cloze have some limitations For one thing, they usually don't measure short-term gains very well A good achievement test could show big improvement on question tags studied over a two-to three-week period But a proficiency test generally would not show much if any improvement Fortunately a simple change in the doze format can overcome this problem The cloze is simply a story or essay from which a number of words have been deleted We fill in the missing words much as we while conversing In a noisy restaurant, we guess at the words that we don't hear by relying on the whole conversation So in cloze tests, the overall meaning and surrounding grammar help us replace the missing parts Sentence completion vocabulary and grammar items are similar in a way to cloze tests Cloze passages simply have much larger contexts Preparing a Cloze Test The steps in preparing a cloze test are simple: (1) Select an appropriate passage (e.g., from the reading material in your English class); (2) decide on the words and number of words to take out; (3) write the instructions and prepare an example The first and most important step is to choose a text of the right level Choosing a passage that is rather difficult for your students will simply frustrate them So choose a passage that they can read with little or no difficulty You can even use something that has already been read and discussed in class They will not be able to answer the test from memory The length of the selection depends on the number of blanks you plan to have But most are not longer than 250 - 300 words This means that you will often have (59) to use only part of an article or story When you this, be sure your excerpt 29 (60) makes sense by itself You might have to compose a sentence or two of your own to introduce or end your selection Also there are a few things to avoid: Usually we ignore a passage that is full of proper nouns, numbers, and technical words When these are left out, it is often impossible to know what to write in Also, we usually not pick an article containing a lot of quoted material The quoted might not be at the same level as the rest of the passage But if there are only a few trouble spots in a good selection, you can edit or rewrite them With the passage chosen, you are ready to decide which words to take out Leaving the first sentence or two and the last one intact as they are will help students understand the overall meaning How many words need to be deleted altogether? It depends on your students Get by with 15 to 20 blanks at most Naturally, the more blanks the test has the more stable and reliable it is Advantages of Cloze It is easy to prepare and quite easy to score It is a good measure of integrative English skills Standard cloze is a good measure of overall ability in English Limitations of Cloze It is not a sensitive measure of short-term gains It is difficult for teachers who are non-native English speakers to choose acceptable equivalent words (61) 30 (62) Bài tập/Tasks I- MULTIPLE CHOICE COMPLETION Each of the following item has some defect Indicate what the difficulty is, and then correct it by rewriting the question a “Eva nearly won that race!” I “Yes, .” A she ran well, did she?" B she ran well, wasn't she?" C she ran well, was she?" D she ran well, didn't she?" b While she _ the house, her children were playing outside A has been cleaning B cleaned C has cleaned D.was cleaning c He has lived in this town for only a week and he _ friends A few B a few C not many already has D your d "Mr Adams, _ I be excused from class tomorrow?" A ought to B can C may D wouldn't Construct a multiple-choice completion question for each of the following grammar points Or choose five grammar points that you have taught to your students Give the instructions and the answers a The subordinator although (as in "Although he was tired, he walked to work") b Subject-verb agreement with some form of the verb be (as in "One of the boys was here last night") c Since as an expression of time (as in "They've been here since 10:00") d A question tag (as in 'She works hard, doesn't she?") II- SENTENCE COMPLETION Write down as many words as you can (not phrases) that appropriately complete this sentence: "He walked the house." Prepare four two-option form items testing the too/enough contrast (that is, "too big to" versus "big enough to'') Prepare good contexts Include the answers (63) 31 (64) Prepare four verb-inflection items - a different verb and verb tense for each item Include the uninflected form Supply the answers (Example: She _ [drink] it this morning.) Write four “free-response items Each one should test a different grammar point (One of these can be the conditional as in the example.) Name the grammar points being tested Include two sample correct answers for each of the four items III- CLOZE Write out the major problem that you see in the following cloze test Disregard its short length There was much conflict in early Vermont It remained an unbroken wilderness until , when a French officer established Fort _ on Isle La Motte In 1924 Massachusetts _ fearing attacks by the French and _, built Fort Dummer near the present _ of Brattleboro The French forts at _ and Crown Point were used as for attacks (Key: 1666, St Anne, colonists, Indians, site, Chimney Point, bases) (65) 32 (66) Kĩ thuật biên soạn câu hỏi Đọc hiểu/Reading questions Question Techniques for Beginners There are two useful approaches for testing beginning students who can read simple passages One of these is True-false items, and the other is the matching technique True-false items are rather easy to prepare, and for beginning students they are easier than regular multiple-choice items Here is an example: Among the American Negroes in the southern states, work songs played an integral part in fashioning a folk music which was later to become jazz These had been part of the West African's musical experience at home And now they were transported to a new environment In America they were found to be of no little importance to the slaves' output of work Questions: 1) Jazz is related to the work songs of American blacks *T F 2) Work songs were not helpful in getting more work done T *F One problem with true-false questions is that the student might simply guess the right answer If concerned about this, you can make a correction for guessing: Just subtract the number wrong from the number right This is their new score We can see why this is done, when we recall that on a true-false test, a student could get 50 percent of the answers right simply by guessing In other words, if he knew 50 items on a 100-item test but guessed at the other 50, he might get a score of 75 [50 + 25] that he guessed right But if we subtract the number that he missed [25] from the number he got right [75], the result is 50-the number of items that he actually knew the answer to A "guessing correction" can also be made for regular multiple-choice tests Normally we use a correction only when the test is timed, and many of the students not have a chance to finish This encourages guessing The correction is made by dividing the number of items wrong by the number of (67) distractors and subtracting this from the number of correct answers For fouroption items this is the number right minus the number wrong divided by 3, or R-w/3 For three-option items, this is R-w/2 [remember that of the three choices, one is the correct answer and two are distractors].) 33 (68) A second useful approach for testing beginning students who can read simple passages is the matching technique This procedure simply has students match material in the passage with material in the question It is like "copy work" in beginning writing classes For example, a question such as the following might be written on the "jazz" passage: “What played an integral part in fashioning folk music?” A work songs B jazz Americans C a new environment D Notice that the question and the answer are lifted right from the original passage This gives some practice in handling questions, but little comprehension is required A variation on this procedure asks students simple questions on dialogs that they have practiced in class E.g.: ANN: "Mr Martin never works in'the garage." KEN: "Yes, he does He worked in the garage last Saturday!' When was he working? A in the garage B Mr .Martin *C last Saturday We can see that short test passages like this often concentrate on grammar or vocabulary Question Techniques for More Advanced Students Standard Multiple-choice There are many ways to test reading One of the best is a reading passage followed by multiple choice questions We have already mentioned the variety of sources available Naturally you can use readings from your own English resource, but be careful to give everyone an equal chance to succeed The number of passages and the length of each depend on your particular test Let's assume the whole exam is on reading Multiple-choice questions can be asked on very short passages of 35 to 75 words Quite a few of these can be answered in one period Student level and passage difficulty naturally influence how many can be done Usually longer passages will run from 100 to 300 words This is sufficient since more than one passage will appear on a single test Selections for less advanced students will run from about 100 to 200 words Those for more advanced students will generally range from 150 to 300 words (69) 34 (70) Selections with considerable variety, detail, and contrast are easiest to prepare questions on Normally you will only be able to write roughly three questions per hundred words, or four at the most More than this usually results in looking at insignificant details Fewer than this is inefficient Finally, in order not to give some students a special advantage, use at least three to five passages from different sources Students who read fairly well can answer about a question a minute-including the reading of the passage Slower students and those reading difficult technical material may need almost twice as much time It is a good idea to try some sample passages in class (of the same length and level of difficulty that you plan to use) This prepares students for the instructions and types of questions on the test, and it helps you decide on how much time to allow Some students will take all the time you give them, so have the students raise their hand when they have finished the inclass practice test You can allow time for at least 80 percent to finish Plan to use a variety of types of questions on your reading test One very important type is the paraphrase Look at the following example: Karate is a science of unarmed selfdefense and counterattack It is a sort of "weapon in an empty hand!' In many U.S cities thousands of young people are developing their minds as well as their bodies by learning karate.' The key portion that we will use for our paraphrase question is "In many U.S cities, thousands of young people are learning karate." The paraphrase of this is "Karate is being taught to many young Americans." Every word but "karate" is different Here is the resulting question: In this passage we learn that karate _ A is being taught to many young Americans B and training for the mind are both being taught C can remove a weapon from someone's hand D is used to start a fight A second type of question, the synthesis item, requires integration of ideas from more than one sentence-sometimes from the entire selection For example, in one simple story used to test reading comprehension, a lady stops at a restaurant to eat But she looks confused when it is time to pay her bid (71) 35 (72) Then she says, "I can't pay the bill My purse is gone." At this elementary level, students simply have to complete "The lady couldn't pay for her lunch because ." by choosing this option: "her purse was lost." In short, they just need to pull together the information found in the two sentences For a more advanced example of the synthesis question, we will look at the full version of the "karate" passage: Karate is a science of unarmed selfdefense and counterattack It is a sort of "weapon in an empty hand." In many U.S cities thousands of young people are developing their minds as well as their bodies by learning karate "I've been taking karate lessons for five years now," says sixteen-year-old Bobby Hamilton of Columbus, Ohio, "and it's great! I find myself doing things that I thought I could never do!' Paula Jones has just begun taking karate lessons at her high school in Philadelphia She feels that she has more selfconfidence because of the lessons "I am more aware of myself," she says "I already have learned so much about self-control I know everything in life is not going to be easy Karate helps prepare me for the times when I'll have to meet my problems face to face."' 1) A good title for this selection would be _ A Americans Import a Japanese Sport (73) B Karate-Weaponless Protection for People of All Ages C School Children Enjoy a New Kind of Physical Education Class D Self-Perfection through Self-Protection A third kind of question is the inference item It requires students to see implications in what they read Here is another example from an English test: [Two men, Gerard and Denys, were traveling in a forest They had just been forced to kill a large baby bear in selfdefense.] Then Gerard heard a sound behind them It was a strange sound, too, like something heavy, but not hard, rushing over dry leaves It was a large bear, as big as a horse, running after them a short 36 (74) distance away As soon as he saw it, Gerard cried out in fear, "The baby bear's mother!' The mother bear was probably running because it A was afraid of Gerard and Denys and wanted to escape B wanted to hurt those who had killed its baby C was chasing a horse, a short distance away D enjoyed running, like horses and other animals Various kinds of problems need to be avoided when preparing reading tests like these for intermediate and advanced students: (1) Tests at these levels should not ask for words or phrases exactly as they appear in the passage (2) In addition, they should avoid illogical distractors like those in the following item: E.g.: In this study, the high divorce rate was caused by A the great kindness of husbands and wives to each other B heavy drinking by the mate who was working C positive relationships of parents and children D having lots of money to pay bills with (3) They shouldn't be written in such a way that they can be answered from general knowledge: E.g.: In the article, we learn that Adolf Hitler was A a Russian spy B a French ballet dancer C an American baseball player D a German dictator After you prepare an important reading test, you could try it out in the following way: Copy down only the questions and multiple-choice options Then have another teacher's class or a group of your friends volunteer to "take" the test-without the reading passage Those items that nearly 50 percent or more get right are probably poorly written: Examinees may be depending mostly on logic or general knowledge (See discussions of multiple- choice questions in parts and for additional cautions in writing multiple-choice questions.) Advantages of Passage Comprehension (75) 37 (76) This is the most integrative type of reading test It is objective and easy to score It can evaluate students at every level of reading development Limitations of Passage Comprehension Passage comprehension is more time consuming to take than other kinds of tests One pitfall in preparing this kind of test is utilizing questions that deal with trivial details Passage comprehension tests which use questions on trivial details encourage word-by-word reading (77) 38 (78) Bài tập/tasks PASSAGE COMPREHENSION Read this sample paragraph Then write multiple-choice distractors for the question below The question involves implication or inference Every line in a drawing is significant Each one contributes to the work of the artist Straight lines dominate drawings of urban streets with tall buildings What is the reason? Can you guess? City buildings are often austere, cold, and functional Space in a city is not at all plentiful Every foot of space is important Architects plan urban buildings efficiently and economically? We see in this paragraph that straight lines reflect _> *A the rigidity and economy of city buildings Find a passage (preferably from an ESL reader used by your students); it should be approximately 200 to 300 words long Prepare eight to ten multiple-choice questions on the passage At least four should be paraphrase items, plus two or three items requiring synthesis, and two or three items involving inference Prepare instructions (Examples: paraphrase , synthesis, inference.) (79) 39 (80) Kĩ thuật biên soạn câu hỏi Viết/Writing questions There are many kinds of writing tests The reason for this is fairly simple: A wide variety of writing tests is needed to test the many kinds of writing tasks that we engage in For one thing, there are usually distinct stages of instruction in writing, such as pre-writing, guided writing, and free writing (The stages of instruction in writing can be categorized differently from those presented here.) Each stage tends to require different types of evaluation Test variety also stems from the various applications of writing These range from school uses such as note taking and class reports to common personal needs such as letter writing and filling out forms Besides these, there are specialized advanced applications Such different writing applications also often call for different test applications Another reason for the variety of writing tests in use is the great number of factors that can be evaluated: mechanics (including spelling and punctuation), vocabulary, grammar, appropriate content, diction (or word selection), rhetorical matters of various kinds (organization, cohesion, unity; appropriateness to the audience, topic, and occasion); as well as sophisticated concerns such as logic and style The list is enough to boggle the mind Limited response As we have indicated, formal tests are not needed for teaching the alphabet or cursive writing Vocabulary and grammar, however, need attention much longer, and so both need to be evaluated Growing out of grammar instruction are pre-writing activities such as sentence combining, expansion or contraction of sentence elements, copying, and oral doze These are only a few of the techniques that can be employed at this stage The examples that follow illustrate each of these five procedures Sentence combining, a common prewriting task, takes many forms We will look at just two of them: combining by adding a connective and combining by putting one sentence inside the other When combining sentences by adding a connective, students can demonstrate their understanding of what various connectives mean - for example, connectives that indicate addition (and, moreover, furthermore), contrast (but, however, nevertheless), and result (so, (81) consequently, therefore) to name but a few You can provide simple completion contexts that require each one 40 (82) E.g.: 1) He likes ice cream but he won't eat any 2) She didn't feel well today so she didn't go to work (Students will have learned that words like “and” and “moreover” are not always interchangeable.) We can use this approach not only with sentence connectors but also with subordinators-for example, those expressing time (after, before, since), condition (if, whether m not, unless), and cause (since, because) Combining sentences by having students make internal changes in the grammar also requires considerable proficiency on the part of students Often the subordinators and conjunctions are provided as in these examples: E.g.: 1) Some people come late They will not get good seats (that) (Answer: People that come late will not get good seats.) 2) I am surprised Nobody likes her (It - that) (Answer: It surprises me that nobody likes her) Sentence expansion is another kind of pre-writing evaluation This can involve simply adding words such as adjectives and adverbs Or it can require adding phrases and clauses E.g.: 1) The ( ) man hurried ( _) to the ( ) horse (Answer: The old man hurried out to the frightened horse.) him.) 2) His decision ( _) surprised everyone ( _) (Answer: His decision to quit his job surprised everyone that knew Sentence reduction, still another procedure used in evaluating prewriting proficiency, often provides a cue word (as in the following examples) to show how to begin the new phrase: (6.7) He told us about a man who had a wooden leg (with) (Answer: He told us about a man with a wooden leg.) (83) 41 (84) (68) Her father, who is certainly the stingiest man I know, wouldn't let us borrow his car (one word) (Answer: Her stingy father wouldn't let us borrow his car.) Advantages of Limited-Response Items These are generally quite easy to construct These are suitable for students with limited ability in English Except for the open-ended variety, these are rather objective for a writing-related task Limitations of Limited-Response Items These not measure actual writing skill These can be rather slow to correct-especially the open-ended variety Guided writing The objective in guided-writing tests is to check student ability to handle controlled or directed writing tasks One way is to make certain kinds of changes in a story (text manipulation) Another is to expand the outline of an article or a chapter Multiple-choice sentences can be used as it is easy to score, but they are slower to prepare, and only one thing can be checked in each item E.g.: Directions: The following sentences contain errors in mechanics But there are no spelling errors Find the part of the sentence where the mistake occurs Then circle the letter of that part 1) (A) We sent for / (B) a repairman to take / (C) a look at the / (D)telephone In the office where I work 2) (A) The Doctor told / (B) the young soldier (C) to drive south through the valley / for supplies at the nearest city In 1, the error occurs in part "D"; "in the office where I work" is not a complete sentence; this sentence "fragment" needs to be joined to the main sentence In 2, the error occurs in part "A"; the word "doctor" should not be capitalized unless used with a person's name (Dr Adams), and then it would be (85) abbreviated There are fairly simple ways also to test larger elements like unity and organization One way is to find a good unified paragraph and then add a 42 (86) sentence that is unrelated Students have to find the sentence (or sentences) that don't fit Here is an example: (1) Some people think they have an answer to the troubles of automobile crowding and dirty air in large cities (2) Their answer is the bicycle, or "bike." (3) In a great many cities, hundreds of people now ride bicycles to work every day *(4) Some work with their hands while others depend mostly on their brains while working (5) A group of New York bike riders claim that if more people rode bicycles to work there would be less dirty air in the city from car engines We can use a similar approach to test organization Find or write a wellorganized paragraph with clear transition words Then scramble the sentences Students have to put the sentences back into their original order Here is an example: (1) So on April 18,1775, he started across the Charles River, where he planned to wait for a signal from a friend (2) The American Revolution was a citizens' revolution in which ordinary men took a large part (3) He was living in Boston when British troops arrived to keep people under control (4) When he saw the lights, he jumped on his horse and rode through the countryside warning the people that they must fight at daybreak (5) One (87) such man was Paul Revere, a silver worker (6) Like others, Revere thought the British troops would move from Boston against the villagers.(7) That night after reaching the other side, Revere saw hi friend's lantern signals (Key: 2, 5, 3,6, 1,7,4)' Building from a Paragraph Outline One kind of paragraph outline used for testing writing controls the content and the grammar It takes the following form: I / buy / new white swimsuit / I forget I bring / I / mad / Becky / mother / take / we / shop / Monday night / I find / pretty blue / not expensive / I start / pay / wallet / gone /I / borrow / money / Becky / mother / I / certainly / upset 43 (88) The student paragraph might read: I bought a new white swimsuit, and then I forgot to bring it I was really mad But Becky's mother took us shopping Monday night, and I found a pretty blue one It was not very expensive I started to pay for it, and my wallet was gone! I borrowed some money from Becky's mother, but I was certainly upset The next form of guided-essay tests relaxes the grammar control a little more, although this particular sample promotes the present perfect tense Students are to write a paragraph, beginning with this topic sentence: "Several things have contributed to my being an educated person." They are told to consider (but not limit themselves to) the following sentences: • I have lived in _ ( countries) • I have traveled in _ (places) • I have had certain responsibilities that have matured me (Name them.) • I have read _ (Give an account of reading that has given you special insights.) • I have talked to _ (Tell about people from whom you have learned a lot.) • My parents have taught me _ Our final example of a guided-writing test controls the content of the writing but not necessarily the grammar: E.g.: Directions: Write a paragraph of about seventyfive words describing a store or business that you know very well Base your paragraph on answers to the following questions: What is it called? When did it start to business? How many employees does it have? (89) What the employees have to do? Does it have a lot of customers/clients? Why (not)? Why you choose to go there rather than somewhere else? 44 (90) Is it a good example of what such a store business should be? Students’ writing can be started like this: In my neighborhood there is a It is one thing to get students to write It is quite another matter to grade their writing As mentioned earlier, you need to decide ahead of time what to evaluate: such as the use of complete sentences, agreement of subject and verb, proper inflections (including tense), and basic mechanics It is good to limit these to only a few criteria (See dictation and free composition sections for the discussion on grading.) Advantages of Guided-Writing Tests Guided-writing tests are rather quick and easy to construct Because they require an active rather than a passive response, guided testing techniques give the appearance of being an effective measure of writing Guided-writing tests provide appropriate control for those students who are not ready to write on their own Limitations of Guided-Writing Tests Guided-writing tests not measure ingredients such as organization found in extended writing Guided writing of the paragraph-outline variety is often rather time consuming and difficult to grade Guided writing of the paragraph-outline variety is difficult to score with real consistency Free Writing Few teachers have students write without giving them a specific topic One reason for this is that the skills used in telling a story are quite different from those used in making analogies or refuting arguments We need to make sure that we're testing what we have taught Also we need to be certain that each student is performing the same kind of task Otherwise, we cannot make a fair comparison of their writing For these reasons, we have to provide clear and rather detailed guidelines for writing-even for advanced students Guidelines for Writing Tasks At upper-intermediate to advanced levels (Grade 11 to Grade 12 supposedly) the aim in a writing test is generally to evaluate the effectiveness (91) 45 (92) of the total composition including sentence-level accuracy, larger rhetorical matters such as unity, coherence, and organization, as well as effectiveness in conveying ideas to the intended audience - including socially appropriate language and appropriate selection of supporting details While the main aim at advanced stages is not to control grammar, we need to keep in mind that the subject can influence grammatical content: "How to Use a Pay Telephone" will produce the imperative (Pick up the receiver); "A Typical Day at My School" will produce the present tense and expressions of time (in the morning / after lunch) Having students write directions will produce adverbial expressions of place (across from, down, close to) A conversation task will require knowledge of specialized punctuation ("Come in," she said) When preparing a topic for a writing test, we need to be careful, then, to match the assignment to our students' level of training It is also good test practice to guide the content of what students write Rarely are we interested in testing creativity, ingenuity, or logic Guiding the content frees us to look at essentials One simple way to this is to use pictures A travel poster or an advertisement from a magazine can be useful for descriptive writing A series of pictures (as in some comic strips) can provide guidance for a narrative paragraph Another way of controlling content is to provide charts, tables, or diagrams to be explained With reference to the following lists, students could be told to write a paragraph comparing and contrasting the two cars: (your car) (your friend’s car) horse power 325 220 miles per gallon 16 25 doors wheel size 15 inch 15 inch seating fuel used unleaded regular stereo no Yes power steering yes no power brakes yes yes (93) 46 (94) Still another way to control content is to provide a situation that determines what students are to write about The following test item looks at upper-intermediate letter-writing skills: E.g.: Direction: You need a job for the summer You have just read a "Help Wanted" advertisement for teenage workers (reception, dining room, and cleanup) at the Grand Canyon Lodge in Arizona Address: U.S Forest Service, Grand Canyon Lodge, Box 1128, North Rim, Arizona 82117 Write a business letter; Indicate the position applied for Describe your qualifications, such as your age, language background, travel, and personality characteristics Indicate when you will be available and how long you can work Evaluating Student Writing The introduction to this part pointed out the numerous factors that can be evaluated in a single piece of writing There are several good reasons why teachers ought to consider limiting the number of factors that they check in compositions except at the most advanced levels One reason to evaluate only a few factors at one time is that doing so helps us grade our papers more accurately and consistently Another reason is to speed up our essay grading A third reason for limiting the number of factors to be evaluated is to avoid unnecessary discouragement of our students This latter point deserves elaboration Many students are inhibited in their writing because their work has been overcorrected A more selective grading of writing can offer needed encouragement Let's illustrate: Suppose you have been working to eliminate fragments ("He went home Because he finished his work") Students could be assigned to write a paragraph on a specific topic, in class In grading this short piece of work, you could look only for fragments Regardless of other errors, you could give a score of 100 percent to each paper that was free of fragments (and perhaps 75 percent of the score to a paper with one fragment, 50 percent to a paper with two, etc.) Obviously this is not highly efficient, but it is consistent with our (95) handling of spoken English Students would be tongue-tied if every improper stress, intonation, and trace of foreign pronunciation were corrected Yet many 47 (96) "conscientious" teachers greatly dishearten students’ writing by red-penciling every error that they can find Certainly a compromise between this extreme and the one-item-perpaper scoring would be very natural Generally, we can look for several items in a given paper But the number should be rather limited, particularly on the beginning level, and they should always be drawn from concepts that have been covered in class There is an extreme to be avoided in the selective grading of papers, and that is an exclusive focus on grammar or mechanics Writing, as we know, is much more than grammar On the intermediate and advanced levels, we begin to give more attention to rhetorical matters of unity, organization, and coherence, in addition to grammatical accuracy A test corrected only for grammar-even though written as an essay or letter-is still simply a grammar test And there are more effective ways to test grammar, as we have seen in the previous part Of course, an occasional focus on grammar or vocabulary or mechanics can have a good "backwash" effect on instruction: Students can appreciate the communicative application of these subskills through classwork There are basically two ways to give a formal grade to a piece of writing One is called analytical, and the other holistic Let us take a look at these The analytical method attempts to evaluate separately the various components of a piece of writing; it can be illustrated with several approaches One analytical approach is the “points-off” method Students begin with 100 points or an A grade Then they lose points or fractions of a grade for errors that occur in their piece of writing What would we look for in student writing at or below intermediate level? Mechanics might include capitalization (notably at the beginning of sentences), punctuation (especially end punctuation), and spelling (no penalty for more than one misspelling of the same word) Grammar would include basic material that had been taught (at least matters such as sentence sense, verb tense, and word order) A larger element of writing to be included might well be organization Other possible factors are vocabulary choice and ability to follow the assigned writing task To avoid failing a student for repeated errors of one kind, it is possible to use the following system: one to two errors = one unit off (for example, A to A-, or 100 to 95); three to five errors = two units off; and over five = three units off It is also possible to have grammar errors count double or triple the amount off that mechanical errors do, and for errors in larger elements such as organization to be double the weight of grammar errors (97) 48 (98) Another analytical approach reverses the procedure described above Points are given for acceptable work in each of several areas Consider the following: mechanics ………… 20% vocabulary choice 20% grammar & usage 30% organization …… 30% TOTAL 100% Sometimes a big difference appears between the message that the student conveys and his mastery of the language To encourage such students, it is possible to assign a "split grade" (for example, total grade = (A+B)/2) The one at the left can stand for quality of content; the one at the right, accuracy of language use and the total score will be an average of both A major problem with analytical approaches is that one never knows just how to weight each error or even each area being analyzed We avoid this difficulty in holistic grading Also we focus on communication We are aware of mechanics and grammar, for instance; but we ask ourselves, "How well does this paper communicate?" Minor mechanical errors that interfere very little require very little penalty In fact, we don't count them Instead, we might reduce a grade from to on the basis of a scattering of these errors The same principle applies to other areas To develop a "feel" for such grading, we compare one paper with another The holistic approach doesn't make us feel as secure as we are when we grade a spelling quiz or grammar exam Nevertheless, it is one of the best ways to evaluate the complex communicative act of writing Therefore, although the analytical approach has some things to recommend it, the holistic approach is, on the whole, better Advantages of Free-Writing Approaches Despite in limitations, this is an important, sound measure of overall writing ability This can have a good effect on instruction: Students will be more motivated to write in and out of class, knowing that their test will be an actual writing task There is virtually very little chance of getting a passing grade on a free-writing test by cheating (99) Limitations of Free-Writing Approaches 49 (100) Grading of free writing tends to lack objectivity and consistency Free writing is time consuming to grade Bài Tập/tasks Work in group of fours and read the assigned writing Decide the factors to take into consideration and design a scoring rubric the score the writing (Trainer will provide the authentic materials) (101) 50 (102) Đánh giá đề kiểm tra / Evaluating the tests The previous parts in this material have discussed how to construct and administer examinations of subskills and communication skills But one thing more is needed: how to tell whether or not we have been successful-that is, have we produced a good test? Why is this important? For one thing, good evaluation of our tests can help us measure student skills more' accurately It also shows that we are concerned about those we teach For example, test analysis can help us remove weak items even before we record the results of the test This way we don't penalize students because of bad test questions Students appreciate an extra effort like this, which shows that we are concerned about the quality of our exams And a better feeling toward our tests can improve class attitude, motivation, and even student performance Some insight comes almost intuitively We feel good about a test if advanced students seem to score high and slower students tend to score low Sometimes students provide helpful "feedback," mentioning bad questions, as well as questions on material not previously covered in class, and unfamiliar types of test questions Besides being on the right level and covering material that has been discussed in class, good tests are also valid and reliable A valid test is one that in fact measures what it claims to be measuring A listening test with written multiple-choice options may lack validity if the printed choices are so difficult to read that the exam actually measures reading comprehension as much as it does listening comprehension It is least valid for students who are much better at listening than at reading Similarly, a reading test will lack validity if success on the exam depends on information not provided in the passage, for example, familiarity with British or American culture A reliable test is one that produces essentially the same results consistently on different occasions when the conditions of the test remain the same We noted in the previous part, for example, that teachers' grading of essays often lacks consistency or "reliability" since so many matters are being evaluated simultaneously In defining reliability in this paragraph, we referred to consistent results when the conditions of the test remain the same For example, for consistent results, we would expect the same amount of time to be allowed on each test administration When a listening test is being administered, we need to make sure that the room is equally free of distracting noises on each occasion If a guided oral interview were being administered on (103) 51 (104) two occasions, reliability would probably be hampered if the teacher on the first occasion were warm and supportive and the teacher on the second occasion abrupt and unfriendly In addition to validity and reliability, we should also be concerned about the affect of our test, particularly the extent to which our test may cause undue anxiety Negative affect can be caused by a recording or reading, for example, that is far too difficult or by an unfamiliar examination task, such as translation if this has not been used in class or on other school exams There are differences, too, in how students respond to various forms of tests Where possible, one should utilize test forms that minimize the tension and stress generated by our English language tests Besides being concerned about these general matters of validity, reliability, and affect, there are ways that we can improve our tests by taking time to evaluate individual items While many teachers are too busy to evaluate each item in every test that they give, at least major class tests should be carefully evaluated The following sections describe how this can be done Preparing an Item Analysis Selection of appropriate language items is not enough by itself to ensure a good test Each question needs to function properly; otherwise, it can weaken the exam Fortunately, there are some rather simple statistical ways of checking individual items This procedure is called "item analysis." It is most often used with multiple-choice questions An item analysis tells us basically three things: how difficult each item is, whether or not the question "discriminates" or tells the difference between high and low students, and which distractors are working as they should An analysis like this is used with any important exam-for example, review tests and tests given at the end of a school term or course To prepare for the item analysis, first score all of the tests Then arrange them in order from the one with the highest score to the one with the lowest Next, divide the papers into three equal groups: those with the highest scores in one stack and the lowest in another (The classical procedure is to choose the top 27 percent and the bottom 27 percent of the papers for analysis But since the classes are usually fairly small, dividing the papers into thirds gives us essentially the same results and allows us to use a few more papers in the analysis.) The middle group can be put aside for awhile You are now ready to record student responses This can be done on lined paper as follows: (105) 52 (106) Item # High Group Low Group A B C D (no answer) X Circle the letter of the correct answer Then take the High Group papers, and start with question number one Put a mark by the letter that each person chose, and this for each question on the test Then the same in the "Low Group" column for those in the bottom group Difficulty Level You are now ready to find the level of difficulty for each question This is simply the percentage of students (high and low combined) who got each question right To get the level of difficulty, follow these steps: (1) Add up the number of high students with the correct answer (to question number one, for example) (2) Then add up the number of low students with the correct answer (3) Add the sum found in steps and 2, together (4) Now divide this figure by the total number of test papers in the high and low groups combined A formula for this would be: High Correct + Low Correct Total Number in Sample or Hc + Lc N An example will illustrate how to this Let's assume that 30 students took the test We correct the tests and arrange them in order from high to low Then we divide them into three stacks We would have 10 in the high group and 10 in the low group We set the middle 10 aside The total number (N) in the sample is therefore 20 We now mark on the sheet how many high students selected A, B, C, or D; and how many low students marked these choices (If the item is left blank by anyone, we mark the "X" line.) Below is the tally for item Note that "B" is the right answer for this question We see that in the high group and in the low group got item number correct Thus, + 20 = = 35 % answered this item correctly (107) 20 53 (108) Now we can see if the item is too easy, too difficult, or "about right." Generally, a test question is considered too easy if more than 90 percent get it right An item is considered too difficult if fewer than 30 percent get it right (You can see why by noting that a person might get 25 percent on a fouroption test just by guessing.) Referring to the example, we find that item is acceptable However, it would be best to rewrite much of the test if too many items were in the 30's and 40's If you plan to use your test again with another class, don't use items that are too difficult or too easy Rewrite them or discard them Two or three very easy items can be placed at the beginning of the test to encourage students Questions should also be arranged from easy to difficult Not only is this good psychology, but it also helps those who don't have a chance to finish the test; at least they have a chance to try those items that they are most likely to get right It is obvious that our sample item would come near the end of the test, since only a third of the students got it right Before leaving this discussion of item difficulty, we need to point out that on many language- tests (a grammar exam, for instance), it is not completely accurate to think of very difficult and very easy items as "weak" questions "Difficult" items may simply be grammar points that you have not spent enough class time on or that you have not presented clearly enough Adjusting your instruction could result in an appropriate level of difficulty for the item And an easy item simply points up that almost all students in the class have mastered that grammar point In short, this part of the analysis provides insight into our instruction as well as evaluating the test items themselves Discrimination Level You can use the same high and low group tally in the previous section to check each item's level of discrimination (that is, how well it differentiate between those with more advanced language skill and those with less skill) Follow these steps to calculate item discrimination: (1) Again find the number in the top group who got the item right (2) Find the number in the bottom (109) 54 (110) group who got it right (3) Then subtract the number getting it right in the low group from the number getting it right in the high group (4) Divide this figure by the total number of papers in the high and low groups combined A formula for this would be: High Correct + Low Correct Total Number in Sample or HC + LC N Returning to sample item 1, note that choice "B" is the correct answer So subtract the persons in the low group getting the item right from the in the high group getting it right This leaves Dividing by 20, the number of highs plus lows, you get 0.15, or in other words, 15 percent Generally it is felt that 10 percent discrimination or less is not acceptable, while 15 percent or higher is acceptable Between 10 and 15 percent is marginal or questionable Applying this standard to the sample item, we see that it has acceptable discrimination, There is one caution in applying discrimination to our language tests When doing an item analysis of rather easy and rather difficult questions, be careful not to judge the items too harshly For example, when almost 90 percent get an item right, this means that nearly all low students as well as high students have marked the same (correct) option As a result, there is little opportunity for a difference to show up between the high and low groups In other words, discrimination is automatically low Also be careful when evaluating very small classes-for example, those with only 20 or 25 students This is especially true if students have been grouped according to ability You can't expect much discrimination on a test if all the students are performing at about the same level But if you have a number of high and low students, the discrimination figure is very helpful in telling how effective the item is When you find items that not discriminate well or that are too easy or too difficult, you need to look at the language of the question to find the cause Sometimes you will find negative discrimination-more low students getting a question right than high students Occasionally even useless items like this can be revised and made acceptable For example, an evaluation of one overseas test found a question with unacceptable discrimination Most of the high group thought that the phrase "to various lands and peoples" was wrong; they had learned that "people" did not take the "s" plural, and they did not know this rare correct form Simply changing this part of the test question resulted in a satisfactory item (111) 55 (112) Distractor Evaluation Weak distractors, as we have just seen, often cause test questions to have poor discrimination or an undesirable level of difficulty No set percentage of responses has been agreed upon, but examiners usually feel uneasy about a distractor that isn't chosen by at least one or two examinees in a sample of 20 to 30 test papers But sometimes it does happen that only one or two distractors attract attention There are three common causes for this: (1) Included sometimes is an item that was drilled heavily in class - an item that almost everyone has mastered Therefore, the answer is obvious; the distractors cannot "distract." (2) Sometimes a well-recognized pair is used (this/these, is/are, etc.) Even though not everyone has control of these yet, students know that one of the two is the right answer; no other choice seems likely Here we need to choose another test format (3) A third cause is the use of obviously impossible distractors: ("Did he the work?"/*A Yea, he did B Birds eat worms C Trains can't fly.) The tally of student answers also shows how many people skipped each item Sometimes many questions are left blank near the end of the test In this case you will need to shorten the test or allow more time for it (113) 56 (114) Bài tập / tasks To tasks to below, an item analysis on these four multiplechoice questions There were 27 students in the class, and therefore test papers in each group (Note in the tallying below that | | means 2, and that | | | | means 5, etc.) Calculate the level of difficulty for each of the four items Which of these are too difficult, and which are too easy? Submit your calculations with your answer Calculate the discrimination of each item Which has the poorest discrimination? Which have unsatisfactory discrimination? Which have borderline? Submit calculations Look at the distractors in the four items In which are they the most effective? In which are they the least effective? Do we have any item with negative discrimination? If so, which one? Which item did the fewest students leave blank? Which item did the most leave blank? (115) 57 (116) Kiểm tra đánh giá theo chuẩn Kiến thức Kỹ 6.1 Thực trạng công tác kiểm tra, đánh giá dạy học môn học (Thuận lợi, khó khăn, nguyên nhân) Với mạnh SGK Tiếng Anh từ THCS đến THPT và đặc thù môn, việc kiểm tra đánh giá kết học tập môn Tiếng Anh HS đã có ưu điểm sau: - Đảm bảo tính khách quan quá trình đánh giá Đảm bảo tính thường xuyên Tuy nhiên việc đánh giá kết học tập còn nhiều bất cập như: - Chủ trương thi trắc nghiệm 100% kì thi tốt nghiệp THPT và thi vào cao đẳng và đại học đã làm ảnh hưởng đến quá trình dạy học tiếng Anh THPT là dạy theo định hướng thi, thực hành giao tiếp chưa chú trọng và đầu tư để đạt hiệu Do đó dẫn đến hệ lụy là việc KTĐG không bảo đảm các yêu cầu kiến thức và kĩ mà chương trình đã đề ra, không làm chức KTĐG - Cũng ảnh hưởng thi trắc nghiệm nên các kĩ nói và nghe nhiều trường không đầu tư sở vật chất băng máy, để dạy và học hiệu - Chưa đảm bảo tính toàn diện, hệ thống và phát triển 6.2 Quan niệm đánh giá theo chuẩn kiến thức, kỹ môn học - Bám sát các yêu cầu KT- KN chuẩn KT-KN môn học, - Đánh giá việc áp dụng các kiến thức ngôn ngữ vào các kĩ giao tiếp là kiểm tra các kiến thức ngôn ngữ - Phải vào chuẩn kiến thức và kĩ nội dung môn học cấp, lớp - Đa dạng hóa hình thức kiểm tra đánh giá kết học tập học sinh, tăng cường các hình thức đánh giá theo kết đầu 6.3 Yêu cầu đổi công tác kiểm tra, đánh giá theo chuẩn KT-KN môn học - Phải vào chuẩn kiến thức và kĩ nội dung môn học cấp, lớp - Chỉ đạo, kiểm tra việc thực chương trình, kế hoạch giảng dạy, học tập các nhà trường; tăng cường đổi khâu kiểm tra, đánh giá thường xuyên, định kỳ; phối hợp đánh giá GV, đánh giá HS với HS và tự đánh giá HS, đánh giá nhà trường và đánh giá gia đình, cộng đồng Đảm bảo chất lượng kiểm tra, đánh giá thường xuyên, định kỳ: chính (117) 58 (118) xác, khách quan, công bằng; không hình thức, đối phó không gây áp lực nặng nề - Đánh giá kịp thời, có tác dụng giáo dục và động viên tiến HS, giúp học sinh sửa chữa thiếu sót Cần có nhiều hình thức và độ phân hoá đánh giá phải cao; chú ý tới đánh giá quá trình lĩnh hội tri thức học sinh, quan tâm tới mức độ hoạt động tích cực, chủ động HS tiết học tiếp thu kiến thức, hình thành kĩ - Đánh giá hoạt động dạy học không đánh giá thành tích học tập học sinh mà còn bao gồm đánh giá quá trình dạy học nhằm cải tiến quá trình dạy học Chú trọng kiểm tra, đánh giá hành động, tình cảm học sinh: nghĩ và làm; lực vận dụng vào thực tiễn, thể qua ứng xử, giao tiếp Chú trọng phương pháp, kĩ thuật lấy thông tin phản hồi từ học sinh để đánh giá quá trình dạy học - Đánh giá kết học tập học sinh, thành tích học tập học sinh không đánh giá kết cuối cùng mà chú ý quá trình học tập Tạo điều kiện cho học sinh cùng tham gia xác định tiêu chí đánh giá kết học tập với yêu cầu không tập trung vào khả tái tri thức mà chú trọng khả vận dụng tri thức việc giải các nhiệm vụ phức hợp - Nâng cao chất lượng đề kiểm tra, thi đảm bảo vừa đánh giá đúng chuẩn kiến thức, kỹ năng, vừa có khả phân hóa cao Đổi đề kiểm tra 15 phút, kiểm tra tiết, kiểm tra học kỳ theo hướng kiểm tra kiến thức, kỹ bản, lực vận dụng kiến thức người học, phù hợp với nội dung chương trình, thời gian quy định - Kết hợp hợp lý các hình thức kiểm tra, vấn đáp, tự luận, trắc nghiệm phát huy ưu điểm và hạn chế nhược điểm hình thức 6.4 Hướng dẫn việc kiểm tra đánh giá theo chuẩn KT-KN (xác định mục đích kiểm tra đánh giá; biên soạn câu hỏi, bài tập, đề kiểm tra; tổ chức kiểm tra; xử lý kết kiểm tra, đánh giá) 6.4.1 Để bảo đảm thực chức KTĐG, cần thực các yêu cầu sau trước biên soạn đề kiểm tra: Xác định rõ mục đích KTĐG: - Kiểm tra phân loại để đánh giá trình độ xuất phát người học Kiểm tra thường xuyên Xây dựng tiêu chí đánh giá: - Đảm bảo tính toàn diện: Đánh giá các mặt kiến thức, kỹ Đảm bảo độ tin cậy - Đảm bảo tính khả thi - Đảm bảo yêu cầu phân hoá (119) 59 (120) Xác định rõ nội dung cụ thể các kiến thức kĩ cần KTĐG, - Xây dựng ma trận nội dung KT cần kiểm tra: đơn vị bài, cụm đơn vị bài, cuối học kì, 6.4.2 Lưu ý biên soạn đề kiểm tra: Hình thức bài kiểm tra - Cấu trúc bài kiểm tra - Xác định mức độ cần đạt kiến thức, có thể xác định theo mức độ: nhận biết, thông hiểu, vận dụng, phân tích, tổng hợp, đánh giá (Bloom) Tuy nhiên, học sinh phổ thông, thường sử dụng với mức độ nhận thức đầu là nhận biết, thông hiểu và vận dụng (hoặc có thể sử dụng phân loại Nikko gồm mức độ: nhận biết, thông hiểu, vận dụng mức thấp, vận dụng mức cao) Các kĩ đặt câu hỏi (6 kĩ nhỏ để hình thành lực đặt câu hỏi nhận thức theo hệ thống phân loại các mức độ câu hỏi Bloom) 7.1 Câu hỏi “biết” Mục tiêu : Câu hỏi “biết” nhằm kiểm tra trí nhớ HS các kiện, số liệu, tên người địa phương, các định nghĩa, định luật, quy tắc, khái niệm Tác dụng HS : Giúp HS ôn lại gì đã biết, đã trải qua Cách thức dạy học : Khi hình thành câu hỏi GV có thể sử dụng các từ, cụm từ sau đây : Ai ? Cái gì ? Ở đâu ? Thế nào ? Khi nào ? Hãy định nghĩa ; Hãy mô tả ; Hãy kể lại 7.2 Câu hỏi “hiểu” Mục tiêu: Câu hỏi “hiểu” nhằm kiểm tra HS cách liên hệ, kết nối các kiện, số liệu, các đặc điểm tiếp nhận thông tin Tác dụng HS: - Giúp HS có khả nêu yếu tố bài học - Biết cách so sánh các yếu tố, các kiện bài học Cách thức dạy học: Khi hình thành câu hỏi GV có thể sử dụng các cụm từ sau đây : Hãy so sánh ; Hãy liên hệ ; Vì ? Giải thích ? Câu hỏi “Áp dụng” (121) 60 (122) Mục tiêu: Câu hỏi “áp dụng” nhằm kiểm tra khả áp dụng thông tin đã thu (các kiện, số liệu, các đặc điểm ) vào tình Tác dụng HS: - Giúp HS hiểu nội dung kiến thức, các khái niệm, định luật - Biết cách lựa chọn nhiều phương pháp để giải vấn đề sống Cách thức dạy học: - Khi dạy học GV cần tạo các tình mới, các bài tập, các ví dụ, giúp HS vận dụng các kiến thức đã học - GV có thể đưa nhiều câu trả lời khác để HS lựa chọn câu trả lời đúng Chính việc so sánh các lời giải khác là quá trình tích cực 7.4 Câu hỏi “Phân tích” Mục tiêu: Câu hỏi “phân tích” nhằm kiểm tra khả phân tích nội dung vấn đề, từ đó tìmramối liên hệ, chứng minh luận điểm, đến kết luận Tác dụng HS: Giúp HS suy nghĩ, có khả tìm các mối quan hệ tượng, kiện, tự diễn giải đưa kết luận riêng, đó phát triển tư logic Cách thức dạy học: - Câu hỏi phân tích thường đòi hỏi HS phải trả lời: Tại sao? (khi giải thích nguyên nhân) Em có nhận xét gì? (khi đến kết luận) Em có thể diễn đạt nào? (khi chứng minh luận điểm) - Câu hỏi phân tích thường có nhiều lời giải 7.5 Câu hỏi “Tổng hợp” Mục tiêu: Câu hỏi “tổng hợp” nhằm kiểm tra khả HS có thể đưa dự đoán, cách giải vấn đề, các câu trả lời đề xuất có tính sáng tạo Tác dụng HS: Kích thích sáng tạo HS hướng các em tìm nhân tố mới, Cách thức dạy học: - GV cần tạo tình huống, câu hỏi, khiến HS phải suy đoán, có thể tự đưa lời giải mang tính sáng tạo riêng mình (123) 61 (124) - Câu hỏi tổng hợp đòi hỏi phải có nhiều thời gian chuẩn bị 7.6 Câu hỏi “Đánh giá” Mục tiêu: Câu hỏi “đánh giá” nhằm kiểm tra khả đóng góp ý kiến, phán đoán HS việc nhận định, đánh giá các ý tưởng, kiện, tượng, dựa trên các tiêu chí đã đưa Tác dụng HS: Thúc đẩy tìm tòi tri thức, xác định giá trị HS Cách thức dạy học: GV có thể tham khảo số gợi ý sau để xây dựng các câu hỏi đánh giá: Hiệu sử dụng nó nào? Việc làm đó có thành công không? Tại sao? (125) 62 (126) II - ĐỀ KIỂM TRA MINH HỌA DÙNG CHO LÀM VIỆC THEO NHÓM KHUNG MA TRẬN ĐỀ KIỂM TRA (Dùng cho loại đề kiểm tra TL TNKQ) Tên Chủ đề (nội dung,chương…) Chủ đề Nhận biết Chuẩn KT, KNcần kiểm tra (Ch) Số câu Số điểm Tỉ lệ % Chủ đề Số câu Số điểm Số câu Số điểm Số câu Số điểm Tỉ lệ % Chủ đề n Số câu Số điểm Tổng số câu Tổng số điểm Tỉ lệ % Tỉ lệ % (Ch) (Ch) Số câu Số điểm Số câu Số điểm % Thông hiểu (Ch) Số câu Số điểm (Ch) Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm % Vận dụng Cấp độ thấp Cấp độ cao (Ch) Cộng (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu điểm= % (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu điểm= % (Ch) Số câu Số điểm Số câu Số điểm Số câu Số điểm % Số câu điểm= % Số câu Số điểm (127) 63 (128) KHUNG MA TRẬN ĐỀ KIỂM TRA (Dùng cho loại đề kiểm tra kết hợp TL và TNKQ) Tên Chủ đề (nội dung,chương…) Chủ đề Nhận biết Thông hiểu Vận dụng Cấp độ thấp Cộng Cấp độ cao TNKQ TL TNKQ TL TNKQ TL TNKQ TL (Ch) (Ch) (Ch) (Ch) (Ch) (Ch) (Ch) (Ch) Số câu Số câu Số điểm Tỉ lệ % Số điểm Chủ đề (Ch) Số câu Số câu Số điểm Tỉ lệ % Số điểm Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm (Ch) Số câu Số điểm Số câu Số điểm Số câu điểm= % (Ch) Số câu Số điểm Số câu điểm= % Chủ đề n (Ch) (Ch) Số câu Số câu Số câu Số điểm Số điểm Số điểm Tỉ lệ % Số câu Tổng số câu Số điểm Tổng số điểm % Tỉ lệ % (Ch) (Ch) Số câu Số câu Số điểm Số điểm Số câu Số điểm % (Ch) Số câu Số điểm (Ch) (Ch) Số câu Số câu Số điểm Số điểm Số câu Số điểm % (Ch) Số câu Số điểm Số câu điểm= % Số câu Số điểm (129) 64 (130) YÊU CẦU: Sử dụng ma trận cho trang trước, nhận xét, góp ý và điều chỉnh các đề kiểm tra sau cho phù hợp với nội dung tập huấn Xây dựng ma trận Xác định kiến thức kỹ Nhận xét các đề kiểm tra theo các tiêu chí vừa xác định Bổ sung, cải tiến, nâng cao chất lượng các đề kiểm tra Period 13 TEST Fullname: FIRST 45 MINUTE WRITTEN Class: I Choose a word which has pronunciation different from the others: A difficult B invite C different D visit A tropical B mausoleum C comprise D compulsory A design B symbol C inspiration D sweater A printed B added C started D mentioned II Give correct form of the verbs in the brackets: Mrs Minh (teach) in this school in 2000 -You ever (be) to the cinema? - Yes, I have They (live) in this house for months We used (write) to each other every month I don't know Hung's sister I (not meet) her yet He used to wear uniform when he (be) a student III Choose the correct word in the bracket to complete the sentences: We are proud (in/at/on/of) being Vietnamese Japanese women often wear (veil/skirt/kimono/pants) Everyone knows that women are equal ( to/at /on/of) men We have known each other (for/ in/ago/since) years IV Rewrite the following sentences: She wrote this novel last year This novel… 2.It isn't warm enough for us to go swimming I wish… 3.My grandmother often told us stories My grandmother used… 4.Tom has to this exercise carefully (131) This exercise… V Read the text and answer the questions: 65 (132) Mr Minh is a worker He went to University in 1982 and became a professor in 1990 He has written books about education He is married to his assistant, Linda They have two children In the 1990s they lived in the suburb of London They now live in the center of the city In fact, they have lived there since 2000 When did Mr Minh become a professor? Has he written books about education? Did his family used to live in the suburb of London? How long has his family lived in the center of the city? The End (133) 66 (134) Answer Key: I points: B invite A tropical A design D mentioned II points taught Have you ever been…? have lived to write have not met was III points of kimono to for IV points This novel was written last year I wish it were warm enough for us to go swimming My grandmother used to tell us stories This exercise has to be done carefully by Tom V points He became a professor in 1990 Yes, he has Yes, they did They have lived there since 2000 (135) 67 (136) Period 33 first semester test Fullname: Class: I Choose the word which is pronouced differently from the others: a unit b climate c city d ethnic a visited b wanted c ended d liked a secondary b religion c design d region a abroad b arrive c past d primary a mention b question c inspiration d addition II Give correct form of the verbs in brackets: That room (be) in use since we (decorate) it We wish we (will visit) England some day you (ever/ read) that novel? Your brother (already/visit) Ha Long Bay, hasn't he? What about (play) video games? I (not watch) TV last night I (listen) to music Nam is interested in (play) soccer They didn't use to (travel) by cyclo III Rewrite the following sentences with the words given: She hasn't finished the letter yet -> The letter "I have a test today" Ba said -> Ba said that I'm sorry I can't go camping with you -> I wish "Do you have a computer, Mai?" the teacher asked -> The teacher My brother must repair all the appliances before Tet ->All the appliances "How old is your little son?" said the doctor to Mrs Brown -> The doctor asked The house was cheap so we bought it -> Because Does she go to school by car everyday? -> She goes , ? IV Match the two parts of the sentences: If it is fine a so he will pass his exam Paul is studying hard b we will come to see you You will miss the bus c so I can't tell you I don't know the answer d if you get up late If he studies hard e if it is rainy (137) 68 (138) We won't go f he will past the exam V Use the correct form of the words in brackets: We like living here because the people are (friend) In big cities, the streets are (crowd) with people and vehicles Tet is the most important (celebrate) for Vietnamese Internet also has some (limit) The End (139) 69 (140) Answer Key: I Choose the word which is pronounced differently from the others: b climate d liked a secondary c past b question II Give correct form of the verbs in bracket: has been - decorated would visit Have you ever read…? has already visited playing didn't watch - listened playing travel III Rewrite the sentences with the words given: The letter hasn't been written yet Ba said that he had a test that day I wish I could go camping with you The teacher asked Mai if she had a computer All the appliances must be repaired before Tet The doctor asked Mrs Brown how old her little son was Because the house was cheap, we bought it She goes to school by car everyday, doesn't she? IV Match the two parts of the sentences: b a d c f e V Use the correct form of the words in bracket: friendly crowded celebration limitations (141) 70 (142) Period 53: Fullname: Class: The first semester Test Subject: English Time: 45 minutes I Tìm từ có phần gạch chân phát âm khác so với từ còn lại: A house B hard C hour D history A sit B uniform C tennis D fine A date B father C fast D hard A Christmas B chicken C school D chemistry II Gạch chân phương án trả lời thích hợp để hoàn thành câu: My school is (next/ on/ between/ to the right) the post office and the hospital He usually gets up early (in/ on/ at/ from) Monday morning The museum opens at a.m and (closes/ finishes/ starts/ ends) at 5p.m Tom and Mary (don't usually have/ usually don't have/ isn't usually having/ usually isn't having) a summer holiday How (often/ long/ many/ much) is that blue shirt? There aren't (any/ some/ little/ a lot of) shop in my village She is interested (at/ to/ in/ on) English My mother takes care of sick people She is a (worker/ nurse/ teacher/ farmer) I like (these/ those/ there/ this) doll 10 - (How/ How often/ How far/ How long) is it from your house to school? - It's not far About one kilometer III Cho dạng đúng động từ ngoặc: There (be) some magazines on the table -> What you (buy) when you (go) shopping tomorrow? -> The children (do) their homework in their room now -> He sometimes (help) his mother the housework -> IV Viết lại câu cho nghiã không đổi: I spend hours doing my homework every day -> It takes Minh has two days off Nam has one day off -> Nam has The movie is very interesting -> What ! Let's play some computer games -> How about ? V Đọc đoạn văn sau trả lời câu hỏi: (143) 71 (144) Hello, my name is Nam This is the library in my school It is not very large, but it is nice In the library, there are a lot of books, novels, magazines, newspapers and picture books There is also a study area The library opens at seven o'clock in the morning and closes at 4.30 in the afternoon I often go there I like reading in the library Is the library in Nam's school large? What can you find in the library? What time does it open and close? Why does Nam often go to the library? Answer Keys: I point (0,25 x 4) C hour D fine A date II 2,5 points (0,25 x 10) between on closes don’t usually have much chicken any in nurse this 10 How far III 2,5 points (0,5 x 5) are will buy/ go are doing helps IV points (0,5 x 4) It takes me hours to my homework every day Nam has few days off than Minh (does) 3 What an interesting movie! How about playing some computer games? V points (0,5 x 4) No, it isn’t We can find a lot of books, novels, magazines, newspapers and picture books It opens at seven o’clock in the morning and closes at 4.30 in the afternoon (145) 72 (146) The second 45 - minute test (grade 8) Name:……………………………… Class: …… I Phonetics: Odd one out a festival b behavior c receive d report a great b please c greedy d reach a unfortunate b pronunciation c community d fund a proud b hour c your d sound a traditional b magically c assistance d organization II Make a syllable stress on each word different translation educate chemical dislike university signature engineer III Choose the best answer He (said to / told to) me about you My father can (make / do) very nice tables He (encourages / injures) me in my studying English You should (earn / raise) funds by selling young trees BSA is not very (similar / different) from Y & Y Mrs Nga looked very (happy / happily) this morning Can I participate (in / on) this program? IV Use the correct form of verbs I hope (avoid) the heavy traffic I used (live) in the country when we were young (Listen) to music is her favorite activity in free time I (go) to the concert tonight It (start) at 8.00 pm Do you think it (rain) tomorrow? She dislikes (talk) about herself V Rewrite these sentences with the unchanged meaning “Don’t turn off the radio when I am awake” my grandmother said to me - My grandmother………………………………………………………… He often rode a bicycle to the sea with his friends - He used…………………………………………………………………… My father always smokes cigarettes - My father is……………………………………………………………… Playing games is his interest - He is………………………………………………………………………… (147) “You should phone me at p.m.” he said to me 73 (148) - He said They are fluent speakers - They……………………………………………………………………… VI Use the correct form of the word Our teacher always …………… us to work harder (courage) By recycling, we can save ………… resources (nature) You must try to……… English words regularly (pronunciation) This restaurant ……….food from many parts of the country.(service) The End (149) 74 (150) PHẦN THỨ BA THƯ VIỆN CÂU HỎI VÀ BÀI TẬP Thư viện câu hỏi, bài tập là tiền đề để xây dựng Ngân hàng câu hỏi, phục vụ cho việc dạy và học các thày cô giáo và học sinh, đặc biệt là để đánh giá kết học tập học sinh Trong khuôn khổ tài liệu này chúng tôi nêu số vấn đề Xây dựng Thư viện câu hỏi và bài tập trên mạng internet Mục đích việc xây dựng Thư viện câu hỏi, bài tập trên mạng internet là nhằm cung cấp hệ thống các câu hỏi, bài tập có chất lượng để giáo viên tham khảo việc xây dựng đề kiểm tra nhằm đánh giá kết học tập học sinh theo chuẩn kiến thức, kĩ chương trình giáo dục phổ thông Các câu hỏi thư viện chủ yếu để sử dụng cho các loại hình kiểm tra: kiểm tra thường xuyên và kiểm tra định kì; dùng cho hình thức luyện tập và ôn tập Học sinh có thể tham khảo Thư viện câu hỏi, bài tập trên mạng internet để tự kiểm tra, đánh giá mức độ tiếp thu kiến thức và lực học; các đối tượng khác phụ huynh học sinh và bạn đọc quan tâm đến giáo dục phổ thông tham khảo Trong năm qua số Sở GDĐT, phòng GDĐT và các trường đã chủ động xây dựng website mình đề kiểm tra, câu hỏi và bài tập để giáo viên và học sinh tham khảo Để Thư viện câu hỏi, bài tập các trường học, các sở GDĐT, Bộ GDĐT ngày càng phong phú cần tiếp tục tổ chức biên soạn, chọn lọc câu hỏi, đề kiểm tra có phần gợi ý trả lời; qui định số lượng câu hỏi và bài tập, font chữ, cỡ chữ; cách tạo file đơn vị Trên sở nguồn câu hỏi, bài tập từ các Sở và các nguồn tư liệu khác Bộ GDĐT đã và tổ chức biên tập, thẩm định, đăng tải trên website Bộ GDĐT và hướng dẫn để giáo viên và học sinh tham khảo sử dụng Để xây dựng và sử dụng thư viện câu hỏi và bài tập trên mạng internet đạt hiệu tốt nên lưu ý số vấn đề sau: Về dạng câu hỏi Nên biên soạn loại câu hỏi, câu hỏi dạng tự luận và câu hỏi trắc nghiệm khách quan (nhiều lựa chọn, điền khuyết, đúng sai, ghép đôi ) Ngoài các câu hỏi đóng (chiếm đa số) còn có các câu hỏi mở (dành cho loại hình tự luận), có số câu hỏi để đánh giá kết các hoạt động thực hành, thí nghiệm (151) 75 (152) Về số lượng câu hỏi Số câu hỏi chủ đề chương trình giáo dục phổ thông (GDPT) tương ứng với chương SGK, số tiết chương đó theo khung phân phối chương trình nhân với tối thiểu câu/1 tiết ít câu cho chuẩn cần đánh giá Hàng năm tiếp tục bổ sung để số lượng câu hỏi và bài tập ngày càng nhiều Đối với môn tỷ lệ % loại câu hỏi so với tổng số câu hỏi, các môn bàn bạc và định, nên ưu tiên cho loại câu hỏi trắc nghiệm nhiều lựa chọn và câu hỏi tự luận Đối với các cấp độ nhận thức (nhận biết, thông hiểu, vận dụng) thì tuỳ theo mục tiêu chủ đề để quy định tỉ lệ phù hợp số câu hỏi cho cấp độ, cần có tỉ lệ thích đáng cho các câu hỏi vận dụng, đặc biệt là vận dụng vào thực tế Việc xác định chủ đề, số lượng và loại hình câu hỏi nên xem xét mối quan hệ chặt chẽ với khung phân phối chương trình, các chương, mục sách giáo khoa, quy định kiểm tra định kì và thường xuyên Số lượng câu hỏi tuỳ thuộc vào số lượng các chủ đề, yêu cầu chuẩn KT, KN chủ đề chương trình GDPT Mỗi môn cần thảo luận để đến thống số lượng câu hỏi cho chủ đề Yêu cầu câu hỏi Câu hỏi, bài tập phải dựa vào chuẩn kiến thức kĩ chương trình GDPT Bộ GDĐT ban hành, đáp ứng yêu cầu về: lí thuyết, thực hành, kĩ môn học tích hợp nhiều môn học Các câu hỏi đảm bảo các tiêu chí đã nêu Phần thứ (trang 3) Thể rõ đặc trưng môn học, cấp học, thuộc khối lớp và chủ đề nào môn học Nội dung trình bày cụ thể, câu chữ rõ ràng, sáng, dễ hiểu Đảm bảo đánh giá học sinh ba tiêu chí: kiến thức, kỹ và thái độ Định dạng văn Câu hỏi và bài tập cần biên tập dạng file và in giấy để thẩm định, lưu giữ Về font chữ, cỡ chữ thì nên sử dụng font chữ Times New Roman, cỡ chữ 14 (153) 76 (154) Mỗi câu hỏi, bài tập có thể biên soạn theo mẫu: BIÊN SOẠN CÂU HỎI Mã nhận diện câu hỏi : MÔN HỌC: _ Thông tin chung * Lớp: _ Học kỳ: * Chủ đề: _ * Chuẩn cần đánh giá: _ KHU VỰC VIẾT CÂU HỎI HƯỚNG DẪN TRẢ LỜI HOẶC KẾT QUẢ Các bước tiến hành biên soạn câu hỏi môn học Bước 1: Phân tích các chuẩn kiến thức, kĩ chương trình giáo dục phổ thông môn học, theo khối lớp và theo chủ đề, để chọn các nội dung và các chuẩn cần đánh giá Điều chỉnh phù hợp với chương trình và phù hợp với sách giáo khoa Bước 2: Xây dựng “ma trận số câu hỏi” (hoặc ma trận đề đề kiểm tra) chủ đề, cụ thể số câu cho chủ đề nhỏ, số câu TNKQ, số câu tự luận chuẩn cần đánh giá, cấp độ nhận thức (tối thiểu câu hỏi cho chuẩn cần đánh giá) Xây dựng hệ thống mã hoá phù hợp với cấu nội dung đã xây dựng bước I Ví dụ minh họa: HỆ THỐNG CHỦ ĐỀ VÀ SỐ CÂU HỎI TƯƠNG ỨNG Chương lớp 9: Căn bậc hai Căn bậc ba Chủ đề Nội dung kiểm tra (theo Chuẩn KT, KN) Khái 1.1 (KT): Hiểu khái niệm bậc hai số (155) Nhận biết TN TL Thông hiểu TN TL 4 Vận dụng cấp độ thấp TN TL Vận dụng cấp độ cao TN TL 77 (156) niệm bậc hai Các phép tính và các phép biến đổi đơn giản bậc hai không âm, kí hiệu bậc hai, phân biệt bậc hai dương và bậc hai âm cùng số dương, định nghĩa bậc hai số học 1.2 (KN) Tính bậc hai số biểu thức là bình phương số bình phương biểu thức khác 2.1 (KN) Thực các phép tính bậc hai: khai phương tích và nhân các thức bậc hai, khai phương thương và chia các thức bậc hai 10 2 5 2 14 16 2.2 Thực các phép biến đổi đơn giản bậc hai: đưa thừa số ngoài dấu căn, đưa thừa số vào dấu căn, khử mẫu biểu thức lấy căn, trục thức mẫu 2.3 Biết dùng bảng số và máy tính bỏ túi để tính bậc hai số dương cho trước 3.1 (KT): Hiểu khái niệm bậc ba số thực 3.2 (KN): Tính bậc ba các số biểu diễn thành lập phương số khác Cộng 6 2 6 12 4 8 19 23 2 8 Căn bậc ba Bước 3: Biên soạn các câu hỏi theo ma trận đã xây dựng Cần lưu ý: Nguồn câu hỏi? Trình độ các đội ngũ viết câu hỏi ? Cách thức đảm bảo câu hỏi bảo mật ? Bước 4: Tổ chức thẩm định và đánh giá câu hỏi Nếu có điều kiện thì tiến hành thử nghiệm câu hỏi trên thực tế mẫu đại diện các học sinh Bước 5: Điều chỉnh các câu hỏi (nếu cần thiết), hoàn chỉnh hệ thống câu hỏi và đưa vào thư viện câu hỏi - Thiết kế hệ thống thư viện câu hỏi trên máy tính - 74 (157) Cách thức bảo mật ngân hàng câu hỏi - Cách thức lưu trữ và truy xuất câu hỏi Cách thức xây dựng đề kiểm tra 78 (158) - Chuẩn bị sổ tay hướng dẫn người sử dụng Tập huấn sử dụng thư viện câu hỏi Sử dụng câu hỏi môn học thư viện câu hỏi Đối với giáo viên: tham khảo các câu hỏi, xem xét mức độ câu hỏi so với chuẩn cần kiểm tra để xây dựng các đề kiểm tra sử dụng để ôn tập, hệ thống kiến thức cho học sinh phù hợp với chuẩn kiến thức kĩ quy định chương trình giáo dục phổ thông Đối với học sinh: truy xuất các câu hỏi, tự làm và tự đánh giá khả mình các yêu cầu chuẩn kiến thức, kĩ quy định chương trình giáo dục phổ thông, từ đó rút kinh nghiệm học tập và định hướng việc học tập cho thân Đối với phụ huynh học sinh: truy xuất các câu hỏi cho phù hợp với chương trình các em học và mục tiêu các em vươn tới, giao cho các em làm và tự đánh giá khả các em yêu cầu chuẩn kiến thức, kĩ quy định chương trình giáo dục phổ thông, từ đó có thể kinh nghiệm học tập và định hướng việc học tập cho các em (159) 79 (160) PHẦN THỨ BỐN HƯỚNG DẪN TỔ CHỨC TẬP HUẤN TẠI CÁC ĐỊA PHƯƠNG - Nội dung và hình thức tập huấn các địa phương cần tiến hành Bộ GD&ĐT đã tập huấn cho giáo viên cốt cán - Cần nghiên cứu mục tiêu, nội dung, đối tượng, điều kiện bồi dưỡng - Xây dựng kế hoạch chi tiết đợt bồi dưỡng, tập huấn ( thời gian, địa điểm, số lượng, yêu cầu) - Xác định nhu cầu, đánh giá kết đợt bồi dưỡng thông qua các mẫu phiếu thăm dò, khảo sát ( trước và sau đợt bồi dưỡng)… - Chú ý đến việc tổ chức các hoạt động GV, giảng viên nói ít, tạo điều kiện cho tất GV tham gia hoạt động tích cực - Kết mong đợi là GV nắm vững nội dung chuẩn kiến thức, kĩ và thực hành để dạy học, kiểm tra đánh giá theo chuẩn Cụ thể là: Đối với cán quản lý - Nắm vững chủ trương đổi giáo dục phổ thông Đảng, Nhà nước; nắm vững mục đích, yêu cầu, nội dung đổi thể cụ thể các văn đạo Ngành chương trình SGK PPDH, sử dụng phương tiện, thiết bị dạy học, hình thức tổ chức dạy học và kiểm tra đánh giá - Nắm vững yêu cầu dạy học bám sát chuẩn kiến thức, kĩ chương trình GDPT, đồng thời tích cực đổi PPDH thông qua kiểm tra đánh giá - Có biện pháp quản lý và thực đổi PPDH có hiệu quả; thường xuyên kiểm tra đánh giá, thực hoạt động dạy học theo định hướng dạy học bám sát chuẩn kiến thức, kĩ đồng thời tích cực đổi PPDH - Động viên khen thưởng kịp thời GV thực có hiệu quả, tích cực đổi kiểm tra đánhg giá Đối với giáo viên - Nghiên cứu kĩ Chương trình, SGV và tài liệu Hướng dẫn thực chuẩn KT, KN để xác định mục tiêu theo bài, thiết kế bài giảng nhằm đạt các yêu cầu bản, tối thiểu KT, KN hướng tới kiểm tra đánh giá chính xác khách quan toàn diện (161) 80 (162) - Dựa trên sở yêu cầu KT, KN tài liệu Hướng dẫn thực chuẩn kiến thức kĩ giáo viên vận dụng sáng tạo, linh hoạt các phương pháp, kĩ thuật dạy học nhằm phát huy tính tích cực, chủ động, sáng tạo, tự giác học tập HS - Thiết kế và hướng dẫn HS trao đổi, trả lời các câu hỏi, bài tập nhằm nắm vững, hiểu yêu cầu kiến thức, kỹ - Đa dạng hoá các hình thức tổ chức dạy học nhằm tạo hứng thú cho HS qua đó giúp HS nắm vững và hiểu sâu sắc chuẩn KT, KN Chương trình GDPT nhằm đạt mục tiêu đề kiểm tra đánh giá - Trong việc dạy học theo Chuẩn KT, KN cần chú trọng việc sử dụng hiệu các thiết bị dạy học; đồng thời ứng dụng công nghệ thông tin dạy học cách hợp lí Nghiên cứu mục tiêu, nội dung, đối tượng, điều kiện bồi dưỡng Xây dựng kế hoạch chi tiết đợt bồi dưỡng, tập huấn ( thời gian, địa điểm, số lượng, yêu cầu) Xác định nhu cầu, đánh giá kết đợt bồi dưỡng thông qua các mẫu phiếu thăm dò, khảo sát ( trước và sau đợt bồi dưỡng)… (163) 81 (164) PHỤ LỤC BỘ GIÁO DỤC VÀ ĐÀO TẠO BIÊN SOẠN ĐỀ KIỂM TRA (Kèm theo công văn số /BGDĐT-GDTrH ngày GDĐT) tháng 12 năm 2010 Bộ Đánh giá kết học tập học sinh là hoạt động quan trọng quá trình giáo dục Đánh giá kết học tập là quá trình thu thập và xử lí thông tin trình độ, khả thực mục tiêu học tập học sinh nhằm tạo sở cho điều chỉnh sư phạm giáo viên, các giải pháp các cấp quản lí giáo dục và cho thân học sinh, để học sinh học tập đạt kết tốt Đánh giá kết học tập học sinh cần sử dụng phối hợp nhiều công cụ, phương pháp và hình thức khác Đề kiểm tra là công cụ dùng khá phổ biến để đánh giá kết học tập học sinh Để biên soạn đề kiểm tra cần thực theo quy trình sau: Bước Xác định mục đích đề kiểm tra Đề kiểm tra là công cụ dùng để đánh giá kết học tập học sinh sau học xong chủ đề, chương, học kì, lớp hay cấp học nên người biên soạn đề kiểm tra cần vào mục đích yêu cầu cụ thể việc kiểm tra, chuẩn kiến thức kĩ chương trình và thực tế học tập học sinh để xây dựng mục đích đề kiểm tra cho phù hợp Bước Xác định hình thức đề kiểm tra Đề kiểm tra (viết) có các hình thức sau: 1) Đề kiểm tra tự luận; 2) Đề kiểm tra trắc nghiệm khách quan; 3) Đề kiểm tra kết hợp hai hình thức trên: có câu hỏi dạng tự luận và câu hỏi dạng trắc nghiệm khách quan Mỗi hình thức có ưu điểm và hạn chế riêng nên cần kết hợp cách hợp lý các hình thức cho phù hợp với nội dung kiểm tra và đặc trưng môn học để nâng cao hiệu quả, tạo điều kiện để đánh giá kết học tập học sinh chính xác (165) 82 (166) Nếu đề kiểm tra kết hợp hai hình thức thì nên có nhiều phiên đề khác cho học sinh làm bài kiểm tra phần trắc nghiệm khách quan độc lập với việc làm bài kiểm tra phần tự luận: làm phần trắc nghiệm khách quan trước, thu bài cho học sinh làm phần tự luận Bước Thiết lập ma trận đề kiểm tra (bảng mô tả tiêu chí đề kiểm tra) Lập bảng có hai chiều, chiều là nội dung hay mạch kiến thức, kĩ chính cần đánh giá, chiều là các cấp độ nhận thức học sinh theo các cấp độ: nhận biết, thông hiểu và vận dụng (gồm có vận dụng cấp độ thấp và vận dụng cấp độ cao) Trong ô là chuẩn kiến thức kĩ chương trình cần đánh giá, tỉ lệ % số điểm, số lượng câu hỏi và tổng số điểm các câu hỏi Số lượng câu hỏi ô phụ thuộc vào mức độ quan trọng chuẩn cần đánh giá, lượng thời gian làm bài kiểm tra và trọng số điểm quy định cho mạch kiến thức, cấp độ nhận thức Các bước thiết lập ma trận đề kiểm tra: B1 Liệt kê tên các chủ đề (nội dung, chương ) cần kiểm tra; B2 Viết các chuẩn cần đánh giá cấp độ tư duy; B3 Quyết định phân phối tỉ lệ % tổng điểm cho chủ đề (nội dung, chương ); B4 Quyết định tổng số điểm bài kiểm tra; B5 Tính số điểm cho chủ đề (nội dung, chương ) tương ứng với tỉ lệ %; B6 Tính tỉ lệ %, số điểm và định số câu hỏi cho chuẩn tương ứng; B7 Tính tổng số điểm và tổng số câu hỏi cho cột; B8 Tính tỉ lệ % tổng số điểm phân phối cho cột; B9 Đánh giá lại ma trận và chỉnh sửa thấy cần thiết Cần lưu ý: - Khi viết các chuẩn cần đánh giá cấp độ tư duy: (167) 83 (168) + Chuẩn chọn để đánh giá là chuẩn có vai trò quan trọng chương trình môn học Đó là chuẩn có thời lượng quy định phân phối chương trình nhiều và làm sở để hiểu các chuẩn khác + Mỗi chủ đề (nội dung, chương ) nên có chuẩn đại diện chọn để đánh giá + Số lượng chuẩn cần đánh giá chủ đề (nội dung, chương ) tương ứng với thời lượng quy định phân phối chương trình dành cho chủ đề (nội dung, chương ) đó Nên để số lượng các chuẩn kĩ và chuẩn đòi hỏi mức độ tư cao (vận dụng) nhiều - Quyết định tỉ lệ % tổng điểm phân phối cho chủ đề (nội dung, chương ): Căn vào mục đích đề kiểm tra, vào mức độ quan trọng chủ đề (nội dung, chương ) chương trình và thời lượng quy định phân phối chương trình để phân phối tỉ lệ % tổng điểm cho chủ đề - Tính số điểm và định số câu hỏi cho chuẩn tương ứng Căn vào mục đích đề kiểm tra để phân phối tỉ lệ % số điểm cho chuẩn cần đánh giá, chủ đề, theo hàng Giữa ba cấp độ: nhận biết, thông hiểu, vận dụng theo thứ tự nên theo tỉ lệ phù hợp với chủ đề, nội dung và trình độ, lực học sinh + Căn vào số điểm đã xác định B5 để định số điểm và câu hỏi tương ứng, đó câu hỏi dạng TNKQ phải có số điểm + Nếu đề kiểm tra kết hợp hai hình thức trắc nghiệm khách quan và tự luận thì cần xác định tỉ lệ % tổng số điểm hình thức cho thích hợp Bước Biên soạn câu hỏi theo ma trận Việc biên soạn câu hỏi theo ma trận cần đảm bảo nguyên tắc: loại câu hỏi, số câu hỏi và nội dung câu hỏi ma trận đề quy định Để các câu hỏi biên soạn đạt chất lượng tốt, cần biên soạn câu hỏi thoả mãn các yêu cầu sau: (ở đây trình bày loại câu hỏi thường dùng nhiều các đề kiểm tra) a Các yêu cầu câu hỏi trắc nghiệm khách quan nhiều lựa chọn (169) 84 (170) 1) Câu hỏi phải đánh giá nội dung quan trọng chương trình; 2) Câu hỏi phải phù hợp với các tiêu chí đề kiểm tra mặt trình bày và số điểm tương ứng; 3) Câu dẫn phải đặt câu hỏi trực tiếp vấn đề cụ thể; 4) Không nên trích dẫn nguyên văn câu có sẵn sách giáo khoa; 5) Từ ngữ, cấu trúc câu hỏi phải rõ ràng và dễ hiểu học sinh; 6) Mỗi phương án nhiễu phải hợp lý học sinh không nắm vững kiến thức; 7) Mỗi phương án sai nên xây dựng dựa trên các lỗi hay nhận thức sai lệch học sinh; 8) Đáp án đúng câu hỏi này phải độc lập với đáp án đúng các câu hỏi khác bài kiểm tra; 9) Phần lựa chọn phải thống và phù hợp với nội dung câu dẫn; 10) Mỗi câu hỏi có đáp án đúng, chính xác nhất; 11) Không đưa phương án “Tất các đáp án trên đúng” “không có phương án nào đúng” b Các yêu cầu câu hỏi tự luận 1) Câu hỏi phải đánh giá nội dung quan trọng chương trình; 2) Câu hỏi phải phù hợp với các tiêu chí đề kiểm tra mặt trình bày và số điểm tương ứng; 3) Câu hỏi yêu cầu học sinh phải vận dụng kiến thức vào các tình mới; 4) Câu hỏi thể rõ nội dung và cấp độ tư cần đo; 5) Nội dung câu hỏi đặt yêu cầu và các hướng dẫn cụ thể cách thực yêu cầu đó; 6) Yêu cầu câu hỏi phù hợp với trình độ và nhận thức học sinh; (171) 85 (172) 7) Yêu cầu học sinh phải hiểu nhiều là ghi nhớ khái niệm, thông tin; 8) Ngôn ngữ sử dụng câu hỏi phải truyền tải hết yêu cầu cán đề đến học sinh; 9) Câu hỏi nên gợi ý về: Độ dài bài luận; Thời gian để viết bài luận; Các tiêu chí cần đạt 10) Nếu câu hỏi yêu cầu học sinh nêu quan điểm và chứng minh cho quan điểm mình, câu hỏi cần nêu rõ: bài làm học sinh đánh giá dựa trên lập luận logic mà học sinh đó đưa để chứng minh và bảo vệ quan điểm mình không đơn là nêu quan điểm đó Bước Xây dựng hướng dẫn chấm (đáp án) và thang điểm Việc xây dựng hướng dẫn chấm (đáp án) và thang điểm bài kiểm tra cần đảm bảo các yêu cầu: - Nội dung: khoa học và chính xác; - Cách trình bày: cụ thể, chi tiết ngắn gọn và dễ hiểu; Phù hợp với ma trận đề kiểm tra (Hướng tới xây dựng mô tả mức độ đạt để học sinh có thể tự đánh giá) Cách tính điểm a Đề kiểm tra trắc nghiệm khách quan Cách 1: Lấy điểm toàn bài là 10 điểm và chia cho tổng số câu hỏi Ví dụ: Nếu đề kiểm tra có 40 câu hỏi thì câu hỏi 0,25 điểm Cách 2: Tổng số điểm đề kiểm tra tổng số câu hỏi Mỗi câu trả lời đúng điểm, câu trả lời sai điểm Sau đó qui điểm học sinh thang điểm 10 theo công thức: 10 X Xm ax , đó 86 (173) + a X H S l ; à + s ố X m đ a i x ể m l à đ t t ổ n đ g ợ s c ố c đ ủ i ểm đề (174) Ví dụ: Nếu đề kiểm tra có 40 câu hỏi, câu trả lời đúng điểm, học sinh làm 32 điểm thì qui thang điểm 10 là: 10.32 40 =8 điểm b Đề kiểm tra kết hợp hình thức tự luận và trắc nghiệm khách quan Cách 1: Điểm toàn bài là 10 điểm Phân phối điểm cho phần TL, TNKQ theo nguyên tắc: số điểm phần tỉ lệ thuận với thời gian dự kiến học sinh hoàn thành phần và câu TNKQ có số điểm Ví dụ: Nếu đề dành 30% thời gian cho TNKQ và 70% thời gian dành cho TL thì điểm cho phần là điểm và điểm Nếu có 12 câu TNKQ thì câu trả lời đúng = 0,25 điểm 12 Cách 2: Điểm toàn bài tổng điểm hai phần Phân phối điểm cho phần theo nguyên tắc: số điểm phần tỉ lệ thuận với thời gian dự kiến học sinh hoàn thành phần và câu TNKQ trả lời đúng điểm, sai điểm Khi đó cho điểm phần TNKQ trước tính điểm phần TL theo công thức sau: + XTN là điểm phần TNKQ; X TL = X TN TL T T , đó TN + XTL là điểm phần TL; + TTL là số thời gian dành cho việc trả lời phần TL + TTN là số thời gian dành cho việc trả lời phần TNKQ Chuyển đổi điểm học sinh thang điểm 10 theo công thức: 10 X , đó Xm + X là số điểm đạt HS; + ax Xmax là tổng số điểm đề (175) Ví dụ: Nếu ma trận đề dành 40% thời gian cho TNKQ và 60% thời gian dành cho TL và có 12 câu TNKQ thì điểm phần TNKQ là 12; điểm 87 (176) phần tự luận là: X = 12.60 TL = 18 Điểm toàn bài là: 12 + 18 = 30 40 Nếu học sinh đạt 27 điểm thì qui thang điểm 10 là: 10.27 30 =9 điểm c Đề kiểm tra tự luận Cách tính điểm tuân thủ chặt chẽ các bước từ B3 đến B7 phần Thiết lập ma trận đề kiểm tra, khuyến khích giáo viên sử dụng kĩ thuật Rubric việc tính điểm và chấm bài tự luận (tham khảo các tài liệu đánh giá kết học tập học sinh) Bước Xem xét lại việc biên soạn đề kiểm tra Sau biên soạn xong đề kiểm tra cần xem xét lại việc biên soạn đề kiểm tra, gồm các bước sau: 1) Đối chiếu câu hỏi với hướng dẫn chấm và thang điểm, phát sai sót thiếu chính xác đề và đáp án Sửa các từ ngữ, nội dung thấy cần thiết để đảm bảo tính khoa học và chính xác 2) Đối chiếu câu hỏi với ma trận đề, xem xét câu hỏi có phù hợp với chuẩn cần đánh giá không? Có phù hợp với cấp độ nhận thức cần đánh giá không? Số điểm có thích hợp không? Thời gian dự kiến có phù hợp không? (giáo viên tự làm bài kiểm tra, thời gian làm bài giáo viên khoảng 70% thời gian dự kiến cho học sinh làm bài là phù hợp) 3) Thử đề kiểm tra để tiếp tục điều chỉnh đề cho phù hợp với mục tiêu, chuẩn chương trình và đối tượng học sinh (nếu có điều kiện, đã có số phần mềm hỗ trợ cho việc này, giáo viên có thể tham khảo) 4) Hoàn thiện đề, hướng dẫn chấm và thang điểm (177) 88 (178) I I I 1 1 I Teaching and testing Many language teachers harbour a deep mistrust of tests and of testers The starting point for this book is the admission that this mistrust is frequently well-founded It cannot be denied that a great deal of language testing is of very poor quality Too often language tests have a harmful effect on teaching and learning; and too often they fail to measure accurately whatever it is they are intended to measure Backwash The effect of testijon teaching and learning is known as backieasb BackwaslLCati_be_harmful oi hEneficial If a te�tis_cegaraec_as important, then preparation for it can come to dominate all teaching and learning activities And if the test content and testing techniques are at variance with the objectives of the course, then there is likely to be harmful backwash An instance of this would _be where studeut.s.arefollowing an English course which is-meant to._train -them in the language -skills (including writing) accessary for university study in an English-speaking country, but where the language test which they have to take in order to be admitted to a.uniyersity does not test those skills directly.If the skill of writing, fur_example,is-teste-,i,L) y,Ey-multiple-choi.ceatems,_then-there is great pressure to practise -such items rather than practise the skill of writing itself This is clearly undesirable We have just looked at a case of harmful backwash However, backwash need not always be harmful; indeed it can be_positively beneficial I was once involved in the development of an English language test for an English medium university in a non-English-speaking country The test was to be administered at the end of an intensive year of English study there and would be used to determine which students would be allowed to go on to their undergraduate courses (taught in English) and which would have to leave the university A test was devised which was based directly on an analysis of the English language needs of first year undergraduate students, and which included tasks as similar as possible to those which they would have to perform as undergraduates (reading textbook materials, taking notes during lectures, and so on) The introduction of this test, in place of one which had been entirely multiple (179) Teaching and testing choice, had an immediate effect on teaching: the syllabus was redesigned, new books were chosen, classes were conducted differently The result of these changes was that by the end of their year's training, in circumstances made particularly difficult by greatly increased numbers and limited resources, the students reached a much higher standard in English than had ever been achieved in the university's history This was a case of beneficial backwash Davies (1968: i) has said that the good test is an obedient servant since it follows and apes the teaching' I find it difficult to agree The proper relationship between teaching and testing is surely that of partnership It is true that there may be occasions when the teaching is good and appropriate and the testing is not, we are then likely to suffer from harmful backwash This would seem to be the situation that leads Davies to confine testing to the role of servant of teaching But equally there may be occasions when teaching is poor or inappropriate and when testing is able to exert a beneficial influence We cannot expect testing only to follow teaching What we should demand of it, however, is that it should he supportive of good teaching and, where necessary, exert a corrective influence on bad teaching If testing always had a beneficial backwash on teaching, it would have a much better reputation amongst teachers Chapter of this book is devoted to a discussion of how beneficial backwash can he achieved Inaccurate tests The second reason for mistrusting tests is that very often they fail to measure accurately whatever it is that they are intended to measure Teachers know this Students' true abilities are not always reflected in the test scores that they obtain To a certain extent this is inevitable Language abilities are not easy to measure; we cannot expect a level of accuracy comparable to those of measurements in the physical sciences But we can expect greater accuracy than is frequently achieved Why-are tests inaccurate? The causesof inaccuracy (and ways of minimising their effects) are identified and discussed in-subsequent chapters, but a short answer is possible here There arerivo main sources of inaccuracy The first of these concerns test content and techniques To return to an earlier example, if we want to know how well someone can write, there is absolutely no way we can get a really accurate measure of their ability by means of a multiple choice _ Professional testers have expended great effort, and not a little money, in attempts to it; but they have always failed We may be able to get an approximate measure, but that is all When testing is carried out on a very large scale, when the scoring of tens of thousands of compositions might seem not to be a (180) Teaching and testing practical proposition, it is understandable that potentially greater accuracy is sacrificed for reasons of economy and convenience But it does not give testing a good name! And it does set a bad example While few teachers would wish to follow that particular example in order to test writing ability, the overwhelming practice in large-scale testing of using multiple choice items does lead to imitation in circumstances where such items are not at all appropriate What is more, the imitation tends to be of a very poor standard Good multiple choice items are notoriously difficult to write A great deal of time and effort has to go into their construction Too many multiple choice tests are written where such care and attention is not given (and indeed may not be possible) The result is a set of poor items that cannot possibly provide accurate measurements One of the principal aims of this book is to discourage the use of inappropriate techniques and to show that teacher-made tests can be superior in certain respects to their professional counterparts The second source of inaccuracy is'lack of reliability Reliability is a technical term s hich is explained in Chapter For the moment it is enough to say that a test is reliable if it measures consistently On a reliable test you can be confid_entthat -someone will get more or less the same score, whether they happen to take it on one particular day or on the next; whereas on an unreliable test the score is quite likely to be considerably different, depending on the day on which it is taken Unreliahility has two origins: Matures of the test itself, and the way it is scored In the first case, something about the test creates a tendency for individuals to perform significantly differently on different occasions when they might rake the test Their performance might be quite different if they took the test on, say, Wednesday rather than on the following day As aresult, even if the scoring of their performance on the test is perfectly accurate (that is, the scorers not make any mistakes), they will nevertheless obtain a markedly different score, depending on when they actually sat the test, even though there has been no change in the ability which the test is meant to measure This is not the place to list all possible features of a test which might make it unreliable, but examples are: unclear instructions, ambiguous questions, items that result in guessing on the part of the test takers While it is not possible entirely to eliminate such differences in behaviour from one test administration to another (human beings are not machines), there are principles of test construction which can reduce them In the second case, equivalent test performances are accorded significantly different scores For-example, the_s.ame_cQmposition.may_be given very different scores by differentmarkers (or even by the same rnarker oil different occasions) Fortunately, there are well-understood ways of minimising such differences in scoring Most (but not all) large testing organisations, to their credit, take every (181) Teaching and testing precaution to make their tests, and the scoring of them, as reliable as possible, and are generally highly successful in this respect Small-scale testing, on the other hand, tends to be less reliable than it should be Another aim of this book, then, is to show how to achieve greater reliability in testing Advice on this is to be found in Chapter The need for tests So far this chapter has been concerned to understand why tests are so mistrusted by many language teachers We have seen that this mistrust is often justified One conclusion drawn from this might be that we would be better off without language tests Teaching is, after all, the primary activity; if testing comes in conflict with it, then it is testing which should go, especially when it has been admitted that so much testing provides inaccurate information This is a plausible argument - but there are other considerations, which might lead to a different conclusion Information about people's language ability is often very useful and sometimes necessary It is difficult to imagine, for example, British and American universities accepting students from-c)verseas -vLi.thout soi-ne knowledge of Their p[Qficiency m English,, The same is true for organisations hiring interpreters or translators They certainly need dependable measures oFlanguage ability Within teaching systems, too, as long as it is thought appropriate for individuals to be given a statement of what they have achieved in a second or foreign language, then tests of some kind or other will be needed.' They will also be needed in order to provide information about the achievement of groups of learners, without which it is difficult to see how rational educational decisions can be made While for some purposes teachers' assessments of their own students are both appropriate and sufficient, this is not true for the cases just mentioned Even without considering the possibility of bias, we have to recognise the need for a common yardstick, which tests provide, in order to make meaningful comparisons 11it is accepted that tests are necessary, and if we care about testing and its effect on teaching and learning, the other conclusion (in my view, the correct one) to be drawn from a recognition of the poor quality of so much testing is that we should everything that we can to improve the practice of testing It will become clear that in this book the word 'test' is interpreted widely It is used to refer to any structured attempt to measure language ability No distinction is made between 'examination' and 'test' (182) Teaching and testing What is to be done? The teaching profession can make two contributions to the improvement of testing: they can write better tests themselves, and they can put pressure on others, including professional testers and examining boards, to improve their tests This book represents an attempt to help them both For the reader who doubts that teachers can influence the large testing institutions, let this chapter end with a further reference to the testing of writing through multiple choice items This was the practice followed by those responsible for TOEFL (Test of English as a Foreign Language), the test taken by most non-native speakers of English applying to North American universities Over a period of many years they maintained that it was simply not possible to test the writing ability of hundreds of thousands of candidates by means of a composition: it was impracticable and the results, anyhow, would be unreliable Yet in 1986 a writing test (Test of Written English), in which candidates actually have to write for thirty minutes, was introduced as a supplement to TOEFL, and already many colleges in the United States are requiring applicants to take this test in addition to TOEFL The principal reason given for this change was pressure from English language teachers who had finally convinced those responsible for the TOEFL of the overriding need for a writing task which would provide beneficial backwash READER ACTIVITIES Think of tests with which you are familiar (the tests may be international or local, written by professionals or by teachers) What you think the backwash effect of each of them is? Harmful or beneficial? What are your reasons for coming to Consider these tests again Do you think that they give accurate or inaccurate information? What are your reasons for coming to these conclusions? these conclusions? Further reading For an account of how the introduction of a new test can have a striking beneficial effect on teaching and learning, see Hughes (1988a) For a review of the new TOEFL writing test which acknowledges its potential beneficial backwash effect but which also points out that the narrow range of writing tasks set (they are of only two types) may result in narrow training in writing, see Greenberg Spolsky (1931) (1986) For a discussion of the ethics of language testing, see (183) Testing as problem solving Testing as problem solving: an overview of the book can in, or with, a language It could, for example, include the ability to converse fluently in a language, as well as the ability to recite grammatical rules (if that is something which we are interested in measuring!') It does not, however, The purpose of this chapter is to introduce readers to the idea of testing as problem solving and to show how the content and structure of the book are designed to help them to become successful solvers of testing problems Language testers are sometimes asked to say what is 'the best test' or the best testing technique' Such questions reveal a misunderstanding of what is involved in the practice of language testing In fact there is no best test or best technique A test which proves ideal for one purpose may be quite useless for another; a technique which may work very well in one situation can be entirely inappropriate in another As we saw in the previous chapter, what suits large testing corporations may be quite out of place in the tests of teaching institutions In the same way, two teaching institutions may require very different tests, depending amongst other things on the objectives of their courses, the purpose and importance of the tests, and the resources that are available The assumption that has to be made therefore is that each testing situation is unique and so sets a particular testing problem It is the tester's job to provide the best solution to that problem The aims of this book are to equip readers with the basic knowledge and techniques first to solve such problems, secondly to evaluate the solutions proposed or already implemented by others, and thirdly to argue persuasively for improvements in testing practice where these seem necessary In every situation the first step must be to state the testing problem as clearly as possible Without a clear statement of the problem it is hard to arrive at the right solution Every testing problem can be expressed in the same general terms: we want to create a test or testing system which will: consistently provide accurate measures of precisely the abilities' in which we are interested; have a beneficial effect on teaching (in those cases where the tests are likely to influence teaching); be economical in terms of time and money 'Abilities' is not being used here in any technical sense It refers simply to what people refer to (184) Let its describe the general testing problem in a little more detail The first thing that testers have to be clear about is the purpose of testing in any particular situation Different purposes will usually require different kinds of tests This may seem obvious but it is something which seems not always to be recognised The purposes of testing discussed in this book are: to measure language proficiency regardless of any language courses that candidates may have followed to discover how far students have achieved the objectives of a course of study to diagnose students' strengths and weaknesses, to identify what they know and what they not know to assist placement of students by identifying the stage or part of a teaching programme most appropriate to their ability All of these purposes are discussed in the next chapter That chapter also introduces different kinds of testing and test techniques: direct as opposed to indirect testing; discrete-point versus integrative testing; criterion-referenced testing as against norm-referenced testing; objective and subjective testing In stating the testing problem in general terms above, we spoke of providing consistent measures of precisely the abilities we are interested in A test which does this is said to be 'valid' Chapter addresses itself to various kinds of validity It provides advice on the achievement of validity in test construction and shows how validity is measured The word 'consistently' was used in the statement of the testing problem The consistency with which accurate measurements are made is in fact an essential ingredient of validity If a test measures consistently (if, for example a person's score on the test is likely to he very similar regardless of whether they happen to take it on, say, Monday morning rather than on Tuesday afternoon, assuming that there has been no significant change in their ability) it.is said to be reliable Reliability, already referred to in the previous chapter, is an absolutely essential quality of tests what use is a test if it will give widely differing estimates of an individual's (unchanged) ability? - yet it is something which is distinctly lacking in very many teacher-made tests Chapter gives advice on how to achieve reliability and explains the ways in which it is measured The concept of backwash effect was introduced in the previous chapter Chapter identifies a number of conditions for tests to meet in order to achieve beneficial backwash language aptitude, the talent which people have, in differing language, The measurement of this talent in order to predict quickly individuals will learn a foreign language, is beyond the scope of this book The interested reader is referred to Pimsleur (1968), Carroll ( 981), and Skehan ( 986) degrees, for how well or learning how (185) Testing as problem solving All tests cost time and money - to prepare, administer, score and interpret Time and money are in limited supply, and so there is often likely to be a conflict between what appears to be a perfect testing solution in a particular situation and considerations of practicality This issue is also discussed in Chapter To rephrase the general testing problem identified above: the basic problem is to develop tests which are valid and reliable, which have a beneficial backwash effect on teaching (where this is relevant), and which are practical The next four chapters of the book are intended to look more closely at the relevant concepts and so help the reader to formulate such problems clearly in particular instances, and to provide advice on how to approach their solution The second half of the book is devoted to more detailed advice on the construction and use of tests, the putting into practice of the principles outlined in earlier chapters Chapter outlines and exemplifies the various stages of test construction Chapter discusses a number of testing techniques Chapters 9-13 show how a variety of language abilities can best be tested, particularly within teaching institutions Chapter 14 gives straightforward advice on the administration of tests We have to say something about statistics Some understanding of statistics is useful, indeed necessary, for a proper appreciation of testing matters and for successful problem solving At the same time, we have to recognise that there is a limit to what many readers will be prepared to do, especially if they are at all afraid of mathematics For this reason, statistical matters are kept to a minimum and are presented in terms that everyone should be able to grasp The emphasis will be on interpretation rather than on calculation For the more adventurous reader, however, Appendix I explains how to carry out a number of statistical operations Further reading The collection of critical reviews of nearly 50 English language tests (mostly British and American), edited by Alderson, Krahnke and Stansfield ( 1987), reveals how well professional test writers are thought to have solved their problems A full understanding of the reviews will depend to some degree on an assimilation of the content of Chapters 3, 4, and of this book Kinds of test and testing This chapter begins by considering the purposes for which language testing is carried out It goes on to make a number of distinctions: between direct and indirect testing, between discrete point and integrative testin , between_nortrt_reEerenced and criterion-referenced esting, and between objective and subjective testing.mally there is a note on communicative language testing We use tests to obtain information The information that we hope to obtain will of course vary from situation to situation It is possible, nevertheless, to categorise tests according to a small number of kinds of information being sought This categorisation will prove useful both in deciding whether an existing test is suitable for a particular purpose and in writing appropriate new tests where these are necessary The four types of test which we will discuss in the following sections are: tzroficiencv tests, achievement tests, diagnostic tests, and placement tests Proficiency tests Proficiency tests are designed to measure people's ability in a language regardless of any training they may have had in that language The content of a proficiency test, therefore, is not based on the content or objectives of language courses which people taking the test may have followed Rather, it is based on a specification of what candidates have to be able to in the language in order to be considered proficient This raises the question of what we mean bythe word `proficient' In the case of some proficiency tests, 'proficient' means having sufficient command of the language fora particular purpose An example of this would be a test designed to discover whether someone can function successfully as a United Nations translator Another example would be a test used to determine whether a student's English is good enough to follow a course of study at a British university Such a test max' (186) even attempt to tae into account the level and kind of English needed to follow courses in particular subject areas It might, for example, have one form of the test for arts subjects, another for sciences, and so on Whatever the particular purpose to which the language is to be put, this (187) Kinds of test and testing will be reflected in the specification of test content at an early stage of a test's development There are other proficiency tests which, by contrast, not have any occupation or course of study in mind For them the concept of proficiency is more general British examples of these would be the Cambridge examinations (First Certificate Examination and Proficiencv Examination) and the Oxford EFL examinations (Preliminary and Higher) The function of these tests is to show whether candidates have reached a certain standard with respect to certain specified abilities Such examining bodies are independent of the teaching institutions and so can be relied on by potential employers etc to make fair comparisons between candidates from different institutions and different countries Though there is no particular purpose in mind for the language, these general proficiency tests should have detailed specifications saying just what it is that successful candidates will have demonstrated that they can Each test should be seen to be based directly on these specifications All users of a test (teachers, students, employers, etc.) can then judge whether the test is suitable for them, and can interpret test results It is nor enough to have some vague notion of proficiency, however prestigious the testing body concerned Despite differences between them of content and level of difficulty, all proficiency tests have in common the fact that they are not based on courses that candidates may have previously taken On the other hand, as we saw in Chapter 1, such tests may themselves exercise considerable influence over the method and content of language courses Their backwash effect - for this is what it is - may be beneficial or harmful In my view, the effect of some widely used proficiency tests is more harmful than beneficial However, the teachers of students who take such tests, and whose work suffers from a harmful backwash effect, may be able to exercise more influence over the testing organisations concerned than they realise The recent addition to TOEFL, referred to in Chapter 1, is a case in point Achievement tests Most teachers are unlikely to be responsible for proficiency tests It is much more probable that they will be involved in the preparation and use of achievement tests In contrast to proficiency tuts, achievement tests are directly related to language courses, their purpose being to establish how successful i 'victual Students ,_groups of students, or the courses themselves hay en t�evinuoblectves The in ac (188) Kinds of test and testing study They may be written and administered by ministries of education, official examining boards, or by members of teaching institutions Clearly the content of these tests must be related to the courses with which they are concerned, hut the nature of this relationship is a matter of disagreement amongst language testers In the view of some testers, the content of a final achievement test should be based directly on a detailed course syllabus or on the books and other materials used This has been referred to as the `syllabus-content approach' It has an obvious appeal, since the test only contains what it is thought that the students have actually encountered, and thus can he considered, in this respect at least, a fair test The disadvantage is that if the syllabus is badly designed, or the books and other materials are badly chosen, then the results of a test can be very misleading Successful performance on the test may not truly indicate successful achievement of course objectives For example, a course may have as an objective the development of conversational ability, but the course itself and the test may require students only to utter carefully prepared statements about their home town, the weather, or whatever Another course may aim to develop a reading ability in German, but the test may limit itself to the vocabulary the students are known to have met Yet another course is :m y are of two kinds I achievement tests and ro ress achievement tests ?{ 1A, Q Final achievement tests are those administered at the end of a course of 10 intended to prepare students for university study in English, but the syllabus (and so the course and the test) may not include listening (with note taking) to c.nglish delivered in lecture style on topics of the kind that the students will have to deal with at university In each of these examples all of them based on actual cases - test results will fail to show what students have achieved in terms of course objectives The alternative approach is to base the test content directly on the objectives of the course This has a number of advantages First, it compels course designers to be explicit about objectives Secondly, it makes it possible for performance on the test to show just how far students have achieved those objectives This in turn puts pressure on those responsible for the syllabus and for the selection of books and materials to ensure that these are consistent with the course objectives Tests based on objectives work against the perpetuation of poor teaching practice, something which course-content-based tests, almost as if part of a conspiracy, fail to It is my belief that to base test content on course objectives is much to be preferred: it will provide more accurate information about individual and group achievement, and it is likely to promote a more beneficial backwash effect on teaching.' Of course, it objectives are unrealistic, then tests will also reveal a failure to achieve them This too can only be regarded as salutary There may be disagreement as to why there has been a failure to achieve the objectives, but at least this provides a starting point for necessary discussion which otherwise might never have taken place 11 (189) Kinds of test and testing 1 I Now it might be argued that to base test content on objectives rather than on course content is unfair to students If the course content does not i f t well with objectives, they will be expected to things for which they have not been prepared In a sense this is true But in another sense it is not If a test is based on the content of a poor or inappropriate course, the students taking it will be misled as to the extent of their achievement and the quality of the course Whereas if the test is based on objectives, not only will the information it gives be more useful, but there is less chance of the course surviving in its present unsatisfactory form Initially some students may suffer, but future students will benefit from the pressure for change The long-term interests of students are best served by final achievement tests whose content is based on course objectives The reader may wonder at this stage whether there is any real difference between final achievement tests and proficiency tests If a test is based on the objectives of a course, and these are equivalent to the language needs on which a proficiency test is based, then there is no reason to expect a difference between the form and content of the two tests Two things have to be remembered, however First, objectives and needs will not typically coincide in this way Secondly, many achievement tests are not in fact based on course objectives These facts have implications both for the users of test results and for test writers Test users have to know on what basis an achievement test has been constructed, and be aware of the possibly limited validity and applicability of test scores Test writers, on the other hand, must create achievement tests which reflect the objectives of a particular course, and not expect a general proficiency test (or some imitation of it) to provide a satisfactory alternative _-Progress achievement tests, as their name suggests, are intended to _, measure t e ress t aI-t, students are making Since `progress' is w to ar7s the achievement of course objectives, these tests too should relate to objectives But how? One way of measuring progress would be repeatedly to administer final achievement tests, the (hopefully) increasing scores indicating the progress made This is not really feasible, particularly in the early stages of a course The low scores obtained would be discouraging to students and quite possibly to their teachers The alternative is to establish a series of well-defined short-term objectives These should make a clear progression towards the final achieveent test base L a course objectives Then if the syllabus and teaching are appropriate to these objectives, progress tests based on short-term objectives will fit well with what has been taught If not, there will be pressure to create a better fit If it is the syllabus that is at fault, it is the tester's responsibility to make clear that it is there that change is needed, not in the tests 12 (190) In addition to more formal achievement tests which require careful preparation, teachers should feel free to set their own `pop quizzes' These serve both to make a rough check on students' progress and to keep students on their toes Since such tests will not form part of formal assessment procedures, their construction and scoring need not be too rigorous Nevertheless, they should be seen as measuring progress towards the intermediate objectives on which the more formal progress achievement tests are based They can, however, reflect the particular Kinds of that testanand `route' individual teacher is taking towards the achievement of testing objectives It has been argued in this section that it is better to base the content of achievement tests on course objectives rather than on the detailed content of a course However, it may not be at all easy to convince colleagues of this, especially if the latter approach is already being followed Not only is there likely to be natural resistance to change, but such a change may represent a threat to many people A great deal of skill, tact and, possibly, political manoeuvring may be called for - topics on which this book cannot pretend to give advice Diagnostic tests Diagnostic tests are used to identify students' strengths and w-eak_nesses They are intended primarily to ascertain what further teaching is necessary At the level of broad language skills this is reasonably stiatghtforward We can be fairly confident of our ability to create tests that will tell us that a student is particularly weak in, say, speaking as opposed to reading in a language Indeed existing proficiency tests may often prove adequate for this purpose We may be able to go further, analysing samples of a student's performance in writing orspeakiag in, order to create profiles of the student's ability with respect to such categories as grammatical accuracy' orslinguistic appropriacy' (See CT;apter for a scoring system that may provide such an analysis.) But it is not so easy to obtain a detailed analysis of a student's command of grammatical structures, something which would tell us, for example, whether she or he had mastered the present perfect/past tense distinction in English In order to be sure of this, we would need a number of examples of the choice the student made between the two structures in every different context which we thought was significantly different and important enough to warrant obtaining information on A single example of each would not be enough, since a student might give the correct response by chance As a result, a comprehensive diagnostic test of English grammar would be vast (think of what would be involved in 13 (191) Kinds of test and testing Kinds of test and testing testing the modal verbs, for instance) The size of such a test would make it impractical to administer in a routine fashion For this reason, very few tests are constructed forpurely diagnostic purposes, and those that there are not provide very detailed information The lack of good diagnostic tests is unfortunate They could be extremely useful for individualised instruction or self-instruction Learners would be shown where gaps exist in their command of the language, and could be directed to sources of information, exemplification and practice Happily, the ready availability of relatively inexpensive computers with very large memories may change the situation Well-written computer programmes would ensure that the learner spent no more time than was absolutely necessary to obtain the desired information, and without the need for a test administrator Tests of this kind will still need a tremendous amount of work to produce Whether or not they become generally available will depend on the willingness of individuals to write them and of publishers to distribute them Placement tests Placement tests, as their name suggests, are intended to provide information which will help to place students at the stage (or in the part) owe teaching programme most appropriate to their abilities Typically they are used to assign students m.elasses_ardifferent levels Placement tests can be bought, but this is not to be recommended unless the institution concerned is quite sure that the test being considered suits its particular teaching programme No one placement test will work for every institution, and the initial assumption about any test that is commercially available must be that it will not work well The placement tests which are most successful are those constructed for particular situations They depend on the identification of the key features at different levels of teaching in the institution They are tailor-made rather than bought off the peg This usually means that they have been produced `in house' The work that goes into their construction is rewarded by the saving in time and effort through accurate placement An example of how a placement test might be designed within an institution is given in Chapter 7; the validation of placement tests is referred to in Chapter Direct versus indirect testing So far in this chapter we have considered a number of uses to which test Testing is said to be direct when it requires the candi(j:ite t9-perform precisely the skill which ve wish to measure If we want to know how well candidates can write compositions, we get them to write compositions If we want to know how well they pronounce a language, we get them to speak The tasks, and he texts which are used, should be as authentic as possible The fact that candidates are aware that they are in a test situation means that the tasks cannot be really authentic Nevertheless the effort is made to make them as realistic as possible Direct testing is easier to carry out when it is intended to measure the productive skills of speaking and writing The very acts of speaking and writing provide us with information about the candidate's ability With listening and reading, however, iris necessary to get candidates not only to listen or read but also to demonstrate that they have done this successfully The tester has to devise methods of eliciting such evidence accurately and without the method interfering with the performance of the skills in which he or she is interested Appropriate methods for achieving this are discussed in Chapters I I and 12 Interestingly enough, in many texts on language testing it is the testing of productive skills that is presented as being most problematic, for reasons usually connected with reliability In fact the problems are by no means insurmountable, as we shall see in Chapters and 10 Direct testing has a number of, attractions First, provided that we are clear about just what abilities we want to assess, it is relatively straight forward to create the conditions which will elicit the behaviour on which to base our judgements Secondly, at least in the case of the productive skills, the assessment and interpretation of students' results are put We now distinguish between two approaches to test construction 14 (192) -performance is also quite straightforward Thirdly, since practice for the test involves practice of the skills that we wish to foster, there is likely to be a helpful backwash effect Indirect testing attempts to measure the abilities which underlie the skills in which we are interested One section of the TOEFL, for example, was developed as an indirect measure of writing ability It contains items of the following kind: At first the old woman seemed unwilling to accept anything that was offered her by my friend and where the candidate has to identify which of the underlined elements is erroneous or inappropriate in formal standard English While the ability to respond to such items has been shown to be related statistically to the ability to write compositions (though the strength of the relationship was not particularly great), it is clearly not the same thing Another example of indirect testing is Lado's (1961) proposed method of testing pronunciation ability by a paper and pencil test in which the candidate has to identify pairs of words which rhyme with each other 15 (193) Kinds of test and testing 1I J Perhaps the main appeal of indirect testing is that it seems to offer the possibility of testing a representative sample of a finite number of abilities which underlie a potentially indefinitely large number of manifestations of them If, for example, we take a representative sample of grammatical structures, then, it may be argued, we have taken a sample which is relevant for all the situations in which control of grammar is necessary By contrast, direct testing is inevitably limited to a rather small sample of tasks, which may call on a restricted and possibly unrepresentative range of grammatical structures On this argument, indirect testing is superior to direct testing in that its results are more generalisable The main problem with indirect tests is that the relationship between performance on them and performance of the skills in which we are usually more interested tends to be rather weak in strength and uncertain in nature We not yet know enough about the component parts of, say, composition writing to predict accurately composition writing ability from scores on tests which measure the abilities which we believe underlie it We may construct tests of grammar, vocabulary, discourse markers, handwriting, punctuation, and what we will But we still will not be able to predict accurately scores on compositions (even if we make sure of the representativeness of the composition scores by taking many samples) It seems to me that in our present state of knowledge, at least as far as proficiency and final achievement tests are concerned,, it is preferable to concentrate on direct testing Provided that we sample reasonably widely (for example require at least two compositions, each calling for a different kind of writing and on a different topic), we can expect more accurate estimates of the abilities that really concern us than would be obtained through indirect testing The fact that direct tests are generally easier to construct simply reinforces this view with respect to institutional tests, as does their greater potential for beneficial backwash It is only fair to say, however, that many testers are reluctant to commit themselves entirely to direct testing and will always include an indirect element in their tests Of course, to obtain diagnostic information on underlying abilities, such as control of particular grammatical structures, indirect testing is called for Discrete point versus integrative testing Discrete porn testing refers to the testing of one element at a time item y item is might involve, for example; a series til items each testing a particular grammatical structure Integrative testing, by contrast, -requires the candidate to combine many _langu ge elements in the _ ` completion of a task T is mtg t involve writing a composition, making noi7C wliilelistenin'g to a lecture, taking a dictation, or completing a doze 16 (194) Kinds of test methods, suchand as the doze procedure, are indirect testing passag e Clearly this distinct ion is not unrelat ed to that betwee n indirec t and direct testing Discre te point tests will almost always be indirec t, while integra tive tests will tend to be direct Howev er, some integra tive testing Norm-referenced versus criterion-referenced testing Imagine that a reading test is administered to an individual student When we ask how the student performed on the test, we may be given two kinds of answer An answer of the first kind would be that the student obtained a score that placed her or him in the top ten per cent of candidates who have taken that test, or in the bottom five per cent; or that she or he did better than sixty per cent of those who took it A test which is designed to give this kind of information is said to be norrnre renced It relates one candidate's performance to that of oilier candidates We are not torFdirecily what the student is e ipabl"e of doing in the language The other kind of answer we might be given is exemplified by the following, taken from the Interagency Language Roundtable (ILK) language skill level descriptions for reading: Sufficient comprehension to read simple, authentic written materials in a form equivalent to usual printing or typescript on subjects within a familiar context Able to read with some misunderstandings straightforward, familiar, factual material, but in general insufficiently experienced with the language to draw inferences directly from the linguistic aspects of the text Can locate and understand the main ideas and details in materials written for the general reader The individual can read uncomplicated, but authentic prose on familiar subjects that are normally presented in a predictable sequence which aids the reader in understanding Texts may include descriptions and narrations in contexts such as news items describing frequently-occurring events, simple biographical information, social notices, formulaic business letters, and simple technical information written for the general reader Generally the prose that can be read by the individual is predominantly in straightforward/high-frequency sentence patterns The individual does not have a broad active vocabulary but is able to use contextual and real-world clues to understand the text Similarly, a candidate who is awarded the Berkshire Certificate of Proficiency in German Level can `speak and react to others using simple language in the following contexts': - to greet, interact with and take leave of others; to exchange information on personal background, home, school life and interests; 17 (195) Kinds of test and testing - to discuss and make choices, decisions and plans; - to express opinions, make requests and suggestions; - to ask for information and understand instructions In these two cases we learn nothing about how the individual's performance compares with that of other candidates Rather we learn something about what he or she can actually in the language Tests which are criterion-re erenced.-designed to provide this kind of information direct!}' are said to be The purpose of criterion-referenced tests is to classify people according to whether or not thev are able to perform some task or set of ta sks satisfactorily The tasks are set, and the performances are evaluated It does not matter in principle whether all the candidates are successful, o r none of the candidates is successful The tasks are set, and those who perform them satisfactorily ;p_asx' those who don't, 'fail' This means that students are encouraged to measure their progress in relation to meaningful criteria, without feeling that, because they are less able than most of their fellows, they are destined to fail In the case of the Berkshire German Certificate, for example, it is hoped that all students who are entered for it will he successful Criterion-referenced tests therefore have two positive virtues: they set standards meaningful in terms of what people can do, which not change with different groups of candidates; and they motivate students to attain those standards The need for direct interpretation of performance means that the construction of a criterion-referenced test may be quite different from that of a norm-referenced test designed to serve the same purpose Let us Imagine that the purpose is to assess the English language ability of students in relation to the demands made by English medium universities The criterion-referenced test would almost certainly have to be based on an analysis of what students had to be able to with or through English at university Tasks would then be set similar to those to be met at university If this were not done, direct interpretation of performance would be impossible The norm-referenced test, on the other hand, while its content might he based on a similar analysis, is not so restricted The Michigan Test of English Language Proficiency, for instance, has multiple choice grammar, vocabulary, and reading comprehension components A candidate's score on the test does not tell us directly what his or her English ability is in relation to the demands that would be made on it at an English-medium university To know this, we must consult a table which makes recommendations as to the academic load that a student Kinds of test and testing with that score should be allowed to carry, this being based on experience over the years of students with similar scores, not on any meaning in the score itself In the same way, university administrators have learned from experience how to interpret TOEFL scores and to set minimum scores for their own institutions Books on language testing have tended to give advice which is more appropriate to norm-referenced testing than to criterion-referenced testing One reason for this may be that procedures for use with norm-referenced tests (particularly with respect to such matters as the analysis of items and the estimation of reliability) are well established, while those for criterion-referenced tests are not The view taken in this _' (196) book, and argued for in Chapter 6, is that criterion-referenced tests are often to be preferred, not least for the beneficial backwash effect they are likely to have The lack of agreed procedures for such tests is not sufficient reason for them to be excluded from consideration Objective testing versus subjective testing The distinction here is between methods of scorini and nothing else If n _ judhement is required on the part of the scorer, then the scoring is objective A multiple choice test, with the correct responses unambiguously identified, would he a case in point If judgement is called for, the scoring is said to be silh)z tiv e There are different degrees of subjectivity in testing The impressionistic scoring of a composition may he con- sidered more subjective than the scoring of short answers in response to questions on a reading People differ somewhat in their use of the term 'criterion-referenced' This is um important provided that the sense intended is made clear The sense in which it is used problems here is the one which I feel will be most useful to the reader in analysing testing 18 passage Objectivity in scoring is sought after by many testers, not for itself, but for the greater reliability it brings In general, the less subjective the scoring, the greater agreement there will be between two different scorers (and between the scores of one person scoring the same test paper on different occasions However, there are ways of obtaining reliable subjective scoring, even of compositions These are discussed first in Chapter J Communicative language testing Much has been written in recent years about 'communicative language testing' Discussions have centred on the desirability of measuring the ability to take part in acts of communication (including reading and listening' and on the best way to this It is assumed in this book that it is usually communicative ability which we want to test As a result, what I believe to be toe most significant points made in discussions of communi19 (197) Kinds of test and testing I cative testing are to be found throughout A recapitulation under a separate heading would therefore be redundant READER ACTIVITIES I I I I Consider a number of language tests with which you are familiar For each of them, answer the following questions: What is the purpose of the test? Does it represent direct or indirect testing (or a mixture of both)? 3.Are the items discrete point or integrative (or a mixture of both)? JWhich items are objective, and which are subjective? Can you order the subjective items according to degree of subjectivity? 1s the test norm-referenced or criterion-referenced? ' Does the test measure communicative abilities? Would you describe it as a communicative test? Justify your answers ) What relationship is there between the answers to question and the answers to the other questions? Further reading I For a discussion of the two approaches towards achievement test content specification, see Pilliner (1968) Alderson (1987) reports on research into the possible contributions of the computer to language testing Direct testing calls for texts and tasks to be as authentic as possible: Vol 2, No (1985) of the journal Language Testing is devoted to articles on authenticity in language testing An account of the development of an indirect test of writing is given in Godshalk et al (1966) Classic short papers on criterion-referencing and norm-referencing (not restricted to language testing) are by Popham (1978), favouring criterion-referenced testing, and Ebel ( 1978), arguing for the superiority of norm-referenced testing The description of reading ability given in this chapter comes from the Interagency Language Roundtable Language Skill Level Descriptions Comparable descriptions at a number of levels for the four skills, intended for assessing students in academic contexts, have been devised by the American Council for the teaching of Foreign Languages (ACTFL) These ACTFL Guidelines are available from ACTFL at 579 Broadway, Hastings-on-Hudson, NY 10706, USA It should be said, however, that the form that these take and the way in which they were constructed have been the subject of some controversy Doubts about the applicability of criterion-referencing to language testing are expressed by Skehan (1984); for a different view, see Hughes (1986) Carroll (1961) (198) Kinds of test and testing scussion of the topic is to be found in Canale and Swain (1980), made the distinction between discrete point and integrative language testing Oiler ( 1979) discusses integrative testing techniques Morrow (1 97 9) is a se mi na l pa pe r on co m m un ica tiv e la ng ua ge tes tin g Fu rth er di Alderson and Hughes (1981, Part 1), Hughes and Porter (1983), and Davies (1988) Weir's (1988a) book has as its title Communicative language testing 20 (199) Validity question Validity What is the importance of content validity? First, the greater a test's content validity, the more likely it is to be an accurate measure of what it is supposed to measure A test in which major areas identified in the specification are under-represented - or not represented at all - is unlikely to be accurate Secondly, such a test is likely to have a harmful We already know from Chapter that a test is said to be valid if it measures accurately what.it isintended to measure This seems simple enough When closely examined, however, the concept of validity reveals a number of aspects, each of which deserves our attention This chapter will present each aspect in turn, and attempt to show its relevance for the solution of language testing problems Content validity A test is said to have content validity ifits_content constitutes a representative sam`1e o t}ie lEi age kills, structures, etc with which it is meant to be concerne It is obvious that_a grammar test, for instance, must he made up of items testing knowledge or control of grammar But this in itself does not ensure content validity The test would have content validity only if it included a propLr sample of the relevant structures Just what are the relevant structures will ec end, of course, upon the purpose of the test We would not expect an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners In order to judge whether or not a test has content validity, we need a specification of the skills or'structures etc that it is meant to cover Such a specification should he made at a very early stage in test construction It isn't to be expected that everything in the speci fication will always appear in the test; there may simply be too many things for all of them to appear in a single test But it will provide the test constructor with the basis for making a principled selection of elements for inclusion in the test A comparison of test specification and test content is the basis for judgements as to content validity Ideally these judgements should he made by people who are familiar with language teaching and testing but who are not directly concerned with the production of the test in (200) backwash effect Areas which are not tested are likely to become areas ignored in teaching and learning Too often the content of tests is determined by what is easy to test rather than what is important to test The best safeguard against this is to write full test specifications and to ensure that the test content is a fair reflection of these Advice on the writing of specifications and on the judgement of content validity is to be found in Chapter Criterion-related validity Another approach to test validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate's ability This independent assessment is thus the criterion measure against whic'b the test is validated There are essentially two kinds of criterion-related validity: concurrent_ validity and predictive validity Concurrent validity is established when the test and the criterion are administered at about the same time To exemplify this kind of validation in achievement testing, let us consider a situation where course objectives call for an oral component as part of the final achievement test The objectives may list a large number of `functions' which students are expected to perform orally, to test all of which might take 45 minutes for each student This could well be unpractical Perhaps it is felt that only ten minutes can be devoted to each student for the oral component The question then arises: can such a ten-minute session give a sufficiently accurate estimate of the student's ability with respect to the functions specified in the course objectives? Is it, in other words, a valid measure? From the point of view of content yalidixy, this will depend on how many of the functions are tested in the component, and how representative they are of the complete set of functions included in the objectives Every effort should be made when designing the oral component to give it content validity Once this has been done, however, we can go further We can attempt to establish the concurrent validity of the component To this, we should choose at random a sample of all the students taking the test These students would then he subjected to the full 45 minute oral component necessary for coverage of all the functions, using perhaps four scorers to ensure reliable scoring (see next chapter) This would be the criterion_ test_ against which the shorter test would be judged The students' scores on the full test would he compared with the ones they obtained on the ten-minute session, which would have been conducted and scored in the usual way, without knowledge of their performance on the longer version If the comparison between the two sets of scores reveals a high level of agreement, then the shorter version of (201) Validity the oral component may be considered valid, inasmuch as it gives results similar to those obtained with the longer version If, on the other hand, the two sets of scores show little agreement, the shorter version cannot be considered valid; it cannot be used as a dependable measure of achievement with respect to the functions specified in the objectives Of course, if ten minutes really is all that can be spared for each student, then the oral component may be included for the contribution that it makes to the assessment of students' overall achievement and for its backwash effect But it cannot be regarded as an accurate measure in itself References to 'a high level of agreement' and `little agreement' raise the question of how the level of agreement is measured There are in fact standard procedures for comparing sets of scores in this way, which generate what is called a `validity coefficient', a mathematical measure of similarity Perfect agreement between two sets of scores will result in a validity coefficient of Total lack of agreement will give a coefficient of zero To get a feel for the meaning of a coefficient between these two extremes, read the contents of Box Box To get a feel for what a coefficient means in terms of the level of agreement between two sets of scores, it is best to square that coefficient Let us imagine that a coefficient of 0.7 is calculated between the two oral tests referred to in the main text Squared, this becomes 0.49 If this is regarded as a proportion of one, and converted to a percentage, we get 49 per cent On the basis of this, we can say that the scores on the short test predict 49 per cent of the variation in scores on the longer test In broad terms, there is almost 50 per cent agreement between one set of scores and the other A coefficient of 0.5 would signify 25 per cent agreement, a coefficient of 0.8 would indicate 64 per cent agreement It is important to note that a 'level of agreement' of, say, 50 per cent does not mean that 50 per cent of the students would each have equivalent scores on the two versions We are dealing with an overall measure of agreement that does not refer to the individual scores of students This explanation of how to interpret validity coefficients is very brief and necessarily rather crude For a better understanding, the reader is referred to Appendix Whether or not a particular level of agreement is regarded as satisfactory will depend upon the purpose of the test and the importance of the decisions that are made on the basis of it If, for example, a test of oral ability was to be used as part of the selection procedure for a high level diplomatic post, then a coefficient of 0.7 might well be regarded as too low for a shorter test to be substituted for a full and thorough test of oral (202) Validity ability The saving in time would not be worth the risk of appointing someone with insufficient ability in the relevant foreign language On the other hand, a coefficient of the same size might be perfectly acceptable for a brief interview forming part of a placement test It should be said that the criterion for concurrent validation is not necessarily a proven, longer test A test may be validated against, for example, teachers' assessments of their students, provided that the assessments themselves can be relied on This would be appropriate where a test was developed which claimed to be measuring something different from all existing tests, as was said of at least one quite recently developed `communicative' test This validity The second kind of criterion-related validity is -predictive concerns the degree to which a test can predict rani idates' Jiittire performance An example would be how well a proficiency test could predict a student's ability to cope with a graduate course at a British university The criterion measure here might be an assessment of the student's English as perceived by his or her supervisor at the university, or it could be the outcome of the course (pass/fail etc.) The choice of criterion measure raises interesting issues Should we rely on the subjective and untrained judgements of supervisors? How helpful is it to use i f nal outcome as the criterion measure when so many factors other than ability in English (such as subject knowledge, intelligence, motivation, health and happiness) will have contributed to every outcome? Where outcome is used as the criterion measure, a validity coefficient of around 0.4 (only 20 per cent agreement) is about as high as one can expect This is partly because of the other factors, and partly because those students whose English the test predicted would be inadequate are not normally permitted to take the course, and so the test's (possible) accuracy in predicting problems for those students goes unrecognised As a result, a validity coefficient of this order is generally regarded as satisfactory The further reading section at the end of the chapter gives references to the recent reports on the validation of the British Council's ELTS test, in which these issues are discussed at length Another example of predictive validity would be where an attempt was _made to validate a placement test Placement tests attempt to predict the most appropriate class for any particular student Validation would involve an enquiry, once courses were under way, into the proportion of students who were thought to be misplaced It would then be a matter of comparing the number of misplacements (and their effect on teaching and learning) with the cost of developing and administering a test which would place students more accurately (203) Validity Construct validity A test, part of a test, or a testing technique is said to have construct validity if it can be demonstrated that it mea_ ices just the ability which it s_supuposec-to measure The word 'construct' refers to any underlying ability (or trait) which is hypothesised in a theory of language ability One might hypothesise, for example, that the ability to read involves a number of sub-abilities, such as the ability to guess the meaning of unknown words from the context in which they are met It would be a matter of empirical research to establish whether or not such a distinct ability existed and could be measured If we attempted to measure that 4bility in a particular test, then that part of the test would have construct ,validity only if we were able to demonstrate that we were indeed measuring just that ability Gross, commonsense constructs like 'reading ability' and 'writing ability' are, in my view, unproblematical Similarly, the direct measurement of writing ability, for instance, should not cause its too much concern: even without research we can be fairly confident that we are measuring a distinct and meaningful ability Once we try to measure such an ability indirectly, however, we can no longer take for granted what we are doing We need to look to a theory of writing ability for guidance as to the form an indirect test should take, its content and techniques Let us imagine that we are indeed planning to construct an indirect test of writing ability which must for reasons of practicality he multiple choice Our theory of writing tells us that underlying writing ability are a number of sub-abilities, such as control of punctuation, sensitivity to demands on style, and so on We construct items that are meant to measure these sub-abilities and administer them as a pilot test How we know that this test really is measuring writing ability? One step we would almost certainly take is to obtain extensive samples of the writing ability of the group to whom the test is first administered, and have these reliably scored We would then compare scores on the pilot test with the scores given for the samples of writing If there is a high level of agreement (and a coefficient of the kind described in the previous section can be calculated), then we have evidence that we are measuring writing ability with the test So far, however, though we may have developed a satisfactory indirect test of writing, we have not demonstrated the reality of the underlying constructs (control of punctuation etc.) To this we might administer a series of specially constructed tests, measuring each of the constructs by a number of different methods In addition, compositions written by the people who took the tests could be scored separately for performance in relation to the hypothesised constructs (control of punctuation, for (204) Validity example) In this way, for each person, we would obtain a set of scores for each of the constructs Coefficients could then be calculated between the various measures If the coefficients between scores on the same construct are consistently higher than those between scores on different constructs, then we have evidence that we are indeed measuring separate and identifiable constructs Construct validation is a research activity, the means by which theories are put to the test and are confirmed, modified, or abandoned It is through construct validation that language testing can be put on a sounder, more scientific footing But it will not all happen overnight; there is a long way to go In the meantime, the practical language tester should try to keep abreast of what is known When in doubt, where it is possible, direct testing of abilities is recommended Face validity A test is said to have face validity if it looks as if it measures what it is supposed to-measure For example, a test which pretended to measure pronunciation ability but which did not require the candidate to speak (and there have been some) might be thought to lack face validity This would be true even if the test's construct and criterion-related validity could be demonstrated Face validity is hardly a scientific concept, yet it is very important A test which does not have face validity may not be accepted by candidates, teachers, education authorities or employers It may simply not he used; and if it is used, the candidates' reaction to it may mean that they not perform on it in a way that truly reflects their ability Novel techniques, particularly those which provide indirect measures, have to be introduced slowly, with care, and with convincing explanations The use of validity What use is the reader to make of the notion of validity? First, every effort should be made in constructing tests to ensure content validity Where possible, the tests should be validated empirically against some criterion Particularly where it is intended to use indirect testing, reference should be made to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to he used (this may often result in disappointment another reason for favouring direct testing!) Any published test should supply details of its validation, without 2 (205) Validity which its validity (and suitability) can hardly be judged by a potential purchaser Tests for which validity information is not available should be treated with caution Reliability READER ACTIVITIES Consider any tests with which you are familiar Assess each of them in terms of the various kinds of validity that have been presented in this chapter What empirical evidence is there that the test is valid? If evidence is lacking, how would you set about gathering it? Further reading For general discussion of test validity and ways of measuring it, see Anastasi (1976) For an interesting recent example of test validation (of the British Council ELTS test) in which a number of important issues are raised, see Criper and Davies (1988) and Hughes, Porter and Weir (1988) For the argument (with which not agree) that there is no criterion against which `communicative' language tests can be validated (in the sense of criterion-related validity), see Morrow (1986) Bachman and Palmer (1981) is a good example of construct validation For a collection of papers on language testing research, see Oiler (1983) Imagine that a hundred students take a 100-item test at three o'clock one Thursday afternoon The test is not impossibly difficult or ridiculously easy for these students, so they not all get zero or a perfect score of 100 Now what if in tact they had not taken the test on the Thursday but had taken it at three o'clock the previous afternoon? Would we expect each student to have got exactly the same score on the Wednesday as they actually did on the Thursday? The answer to this question must be no Even if we assume that the test is excellent, that the conditions of administration are almost identical, that the scoring calls for no judgement on the part of the scorers and is carried out with perfect care, and that no learning or forgetting has taken place during the one-day interval - nevertheless we would not expect every individual to get precisely the same score on the Wednesday as they got on the Thursday Human beings are not like that; they simply not behave in exactly the same way on every occasion, even when the circumstances seem identical But it this is the case, it would seem to imply that we can never have complete trust in any set of test scores We know that the scores would have been different if the test had been administered on the previous or the following day This is inevitable, and we must accept it What we have to is construct, administer and score tests in such a way that the scores actually obtained on a test on a particular occasion are likely to be very similar to those which would have been obtained if it had been administered to the same students with the same ability, but at a different time The more similar the scores would have been, the more reliable the test is said to be Look at the hypothetical data in Table 1a) They represent the scores obtained by ten students who took a 100-item test (A) on a particular occasion, and those that they would have obtained if they had taken it a day later Compare the two sets of scores (Do not worry for the moment about the fact that we would never be able to obtain this information Ways of estimating what scores people would have got on another (206) occasion are discussed later The most obvious of these is simply to have people take the same test twice.) Note the size of the difference between the two scores for each student (207) Reliability Reliability TABLE 1a) SCORES ON TEST A (INVENTED DATA) Student Score obtained Score which would have been obtained on the following day Bill 68 Mary Ann Harry Cyril Pauline Don Colin Irene Sue 46 19 89 82 28 34 67 43 56 63 59 43 35 23 27 76 62 62 49 Now look at Table 1b), which displays the same kind of information for a second 100-item test (B) Again note the difference in scores for each student TABLE 1b) SCORES ON TEST B (INVENTED DATA) Student Score obtained Score which would have been obtained on the following day Bill 65 69 Mary Ann Harry Cyril Pauline Don Colin 48 52 21 Irene Sue 23 85 44 56 38 19 67 52 90 39 59 35 16 62 57 Look now at Table lc), which represents scores of the same students on an interview using a five-point scale TABLE IC) SCORES ON INTERVIEW (INVENTED DATA) Score which would have Which test seems the more reliable? The differences between the two sets of scores are much smaller for Test B than for Test A On the evidence that we have here (and in practice we would not wish to make claims about reliability on the basis of such a small number of individuals), Test B appears to be more reliable than Test A 30 (208) Student Score obtained Bill Ann Harry Cyril Pauline Don Mary Colin Irene Sue been obtained on the follotuing day 4 S In one sense the two sets of interview scores are very similar The largest difference between a student's actual score and the one which would have been obtained on the following day is But the largest possible difference is only 4! Really the two sets of scores are very different This becomes apparent once we compare the size of the differences between students with the size of differences between scores for individual students They are of about the same order of magnitude The result of this can be seen if we place the students in order according to their interview score, the highest first The order based on their actual scores is markedly different from the one based on the scores they would have obtained if they had had the interview on the following day This interview turns out in fact not to be very reliable at all The reliability coefficient It is possible to quantify the reliability of a test in the form of a reliability coefficient Reliability coefficients are like validity coefficients (Chapter 4) They allow us to compare the reliability of different tests The ideal reliability coef ent�-is I - a test wit(i a relic-6ility coefficient oFT i one which would give preasely the same results for a _particular_set of candidates reganiless of when it happened to be administered A test which had a relial5ilify coefficient of zero-(and let us hope that no such 31 (209) Reliability test exists!) would give sets of results quite unconnected with each other, in the sense that the score that someone actually got on a Wednesday would be no help at all in attempting to predict the score he or she would get if they took the test the day after It is between the two extremes of and zero that genuine test reliability coefficients are to be found Certain authors have suggested how high a reliability coefficient we should expect for different types of language tests Lado (1961), for example, says that good vocabulary, structure and reading tests are usually in the 90 to 99 range, while auditory comprehension tests are more often in the 80 to 89 range Oral production tests may be in the 70 to 79 range He adds that a reliability coefficient of 85 might be considered high for an oral production test but low for a reading test These suggestions reflect what Lado sees as the difficulty in achieving reliability in the testing of the different abilities In fact the reliability coefficient that is to be sought will depend also on other considerations, most particularly the importance of the decisions that are to be taken on the basis of the test The more important the decisions, the greater reliability we must demand: if we are to refuse someone the opportunity to study overseas because of their score on a language test, then we have to be pretty sure that their score would not have been much different if they had taken the test a day or two earlier or later The next section will explain how the reliability coefficient can be used to arrive at another i f gure (the standard error of measurement) to estimate likely differences of this kind Before this is done, however, something has to he said about the way in which reliability coefficients are arrived at The first requirement is to have two sets of scores for comparison The most obvious way of obtaining these is to get a group of subjects to take the same test twice This is known as the test-retest method The drawbacks are not difficult to see If the second administration of the test is too soon after the first, then subjects are likely to recall items and their responses to them, making the same responses more likely and the reliability spuriously high If there is too long a gap between adininistrations, then learning (or forgetting!) will have taken place, and the coefficient will he lower than it should be However long the gap, the subjects are unlikely to be very motivated to take the same test twice, and this too is likely to have a depressing effect on the coefficient These effects are reduced somewhat by the use of two different forms of the same test (the alternate forms method) However, alternate forms are often simply not available It turns out, surprisingly, that the most common methods of obtaining the necessary two sets of scores involve only one administration of one test Such methods provide us with a coefficient of `internal consistency' The most basic of these is the split half method In this the subjects take the test in the usual way, but each subject is given two scores One score is (210) Reliability Because of the reduced length, which will cause the coefficient to be less than it would be for the whole test, a statistical adjustment has to be made (see Appendix for details) for one half of the test, the second score is for the other half The two sets got scores are then used to obtain the reliability coefficient as if the whole test had been taken twice In order for this method to work, it is necessary for the test to be split into two halves which are really equivalent through the careful matching of items (in fact where items in the test have been ordered in terms of difficulty, a split into odd-numbered items and even-numbered items may be adequate) It can be seen that this method is rather like the alternate forms method, except that the two `forms' are only half the length It has been demonstrated empirically that this altogether more economical method will indeed give good estimates of alternate forms coefficients, provided that the alternate forme are closely equivalent to each other Details of other methods of estimating reliability and of carrying out the necessary statistical calculations are to be found in Appendix I The standard error of measurement and the true score While the reliability coefficient allows us to compare the reliability of tests, it does not tell us directly how close an individual's actual score is to what he or she might have scored on another occasion With a little further calculation, however, it is possible to estimate how close a person's actual score is to what is called their `true score' Imagine that it were possible for someone to take the same language test over and over again, an indefinitely large number of times, without their performance being affected by having already taken the test, and without their ability in the language changing Unless the test is perfectly reliable, and provided that it is not so easy or difficult that the student always gets full marks or zero, we would expect their scores on the various administrations to vary If we had all of these scores we would be able to calculate their average score, and it would seem not unreasonable to think of this average as the one that best represents the student's ability with respect to this particular test It is this score, which for obvious reasons we can never know for certain, which is referred to as the candidate's true score We are able to make statements about the probability that a candidate's true score (the one which best represents their ability on the test) is within a certain number of points of the score they actually obtained on the test In order to this, we first have to calculate the standard error of measurement of the particular test The calculation (described in 32 (211) Reliability Appendix 1) is very straightforward, and is based on the reliability coefficient and a measure of the spread of all the scores on the test (for a given spread of scores the greater the reliability coefficient, the smaller will be the standard error of measurement) How such statements can be made using the standard error of measurement of the test is best illustrated by an example Suppose that a test has a standard error of measurement of An individual scores 56 on that rest We are then in a position to make the following statements:We can be about 68 per cent certain that the person's true score lies in the range of to 61 (i.e within one standard error of measurement of the score actually obtained on this occasion) We can be about 95 per cent certain that their true score is in the range 46 to 66 (i.e within two standard errors of measurement of the score actually obtained) We can he 99.7 per cent certain that their true score is in the range to 71 (i.e within three standard errors of measurement of the score actually obtained) These statements are based on what is known about the pattern of scores that would occur if it were in fact possible for someone to take the test repeatedly in the way described above About 68 per cent of their scores would be within one standard error of measurement, and so on If in fact they only take the test once, we cannot be sure how their score on that occasion relates to their true score, but we are still able to make probabilistic statements as above.3 In the end, the statistical rationale is not important What is important is to recognise how we can use the standard error of measurement to inform decisions that we take on the basis of test scores We should, for example, he very wary of taking important negative decisions about people's future if the standard error of measurement indicates that their These statistical statements are based on what is known about the way a person's scores would tend to be distributed it they took the same rest an indefinitely large number of times (without the experience of any test-taking occasion affecting performance on any other occasion) The scores would follow what is called a normal distribution (see Woods, Fletcher, and Hughes, 986, for discussion beyond the scope of the present book) It is the known structure of the normal distribution which allows its to say what percentage of scores will fall within a certain range jfnr example about 68 per cent of scores will fall within one standard error of measurement of the true score) Since about 68 per cent of actual scores will be within one standard error of measurement of the true score, we can be about 68 per cent certain that any particular actual score will be within one standard error of measurement of the true score It should be clear that there is no such thing as a 'good' or a 'bad' standard error of measurement It is the particular use made of particular scores in relation to a particular standard error of measurement which may be considered acceptable or unacceptable (212) Reliability true score is quite likely to be equal to or above the score that would lead to a positive decision, even though their actual score is below it For this reason, all published tests should provide users with not only the reliability coefficient but also the standard error of measurement We have seen the importance of reliability If a test is not reliable then we know that the actual scores of many individuals are likely to be quite different from their true scores This means that we can place little reliance on those scores Even where reliability is quite high, the standard error of measurement serves to remind its that in the case of some individuals there is quite possibly a large discrepancy between actual score and true score This should make us very cautious about making important decisions on the basis of the test scores of candidates whose actual scores place them close to the cut-off point (the point that divides 'passes' from 'fails') We should at least consider the possibility of gathering further relevant information on the language ability of such candidates Having seen the importance of reliability, we shall consider, later in the chapter, how to make our tests more reliable Before that, however, we shall look at another aspect of reliability I lowever, we did not assume perfectly consistent scoring in the case of the interview scores discussed earlier in the chapter It would probably have seemed to the reader an unreasonable assumption We can accept that scorers should be able to he consistent when there is only one easily Scorer reliability In the first example given in this chapter we spoke about scores on a multiple choice test It was most unlikely, we thought, that every candidate would get precisely the same score on both of two possible administrations of the test We assumed, however, that scoring of the test would be 'perfect' That is, if a particular candidate did perform in exactly the same way on the two occasions, they would be given the same score on both occasions That is, any one scorer would give the same score on the two occasions, and this would be the same score as would be given by any other scorer on either occasion It is possible to quantify the level of agreement given by different scorers on different occasions by means of a scorer reliability coefficient which can he interpreted in a similar way as the test reliability coefficient In the case of the multiple choice test just described the scorer reliability coefficient would be As we noted in Chapter 3, when scoring requires no judgement, and could in principle or in practice be carried out by a computer, the test is said to be objective Only carelessness should cause the reliability coefficients of objective tests to fall below 34 (213) Reliability Reliability recognised correct response But when a degree of judgement is called for on the part of the scorer, as in the scoring of performance in an interview, perfect consistency is not to be expected Such subjective tests will not have reliability coefficients of 1! Indeed there was a time when many people thought that scorer reliability coefficients (and also the reliability of the test) would always be too low to justify the use of subjective measures of language ability in serious language testing This view is less widely held today While the perfect reliability of objective tests is not obtainable in subjective tests, there are ways of making it sufficiently high for test results to be valuable It is possible, for instance, to obtain scorer reliability coefficients of over 0.9 for the scoring of compositions It is perhaps worth making explicit something about the relationship between scorer reliability and test reliability If the scoring of a test is not reliable, then the test results cannot be reliable either Indeed the test reliability coefficient will almost certainly be lower than scorer reliability, since other sources of unreliability will be additional to what enters through imperfect scoring In a case I know of, the scorer reliability coefficient on a composition writing test was 92, while the reliability coefficient for the test was 84 Variability in the performance of individual candidates accounted for the difference between the two coefficients How to make tests more reliable As we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring We will begin by suggesting ways of achieving consistent performances from candidates and then turn our attention to scorer reliability Take enough samples of behaviour Other things being equal, the more items that you have on a test, the more reliable that test will be This s ee►ns mtuttively right if we wanted to know how good-an archer someone was, we wouldn't rely on the evidence of a single shot at the target That one shot could be quite unrepresentative of their ability To be satisfied that we had a really reliable measure of the ability we would want to see a large number of the ones already in the test will be needed to increase the reliabilit shots at the target The same is true for language testing It has been demonstrated empirically that the addition of further items will make a test more reliable There is even a formula (the Spearman-Brown formula, see the Appendix) that allows one to estimate how many extra items similar to 36 (214) y coefficient to a required level One thing to bear in mind, however, is that the additional items should be independent of each other and of existing items Imagine a reading test that asks the question: 'Where did the thief hide the jewels?' If an additional item following that took the form 'What was unusual about the hiding place?', it would not make a full contribution to an increase in the reliability of the test Why not? Because it is hardly possible for someone who got the original question wrong to get the supplementary question right Such candidates are effectively pre- vented from answering the additional question; for them, in reality, there is no additional question We not get an additional sample of their behaviour, so the reliability of our estimate of their ability is not increased Each additional item should as far as possible represent a fresh start for the candidate By doing this we are able to gain additional information on all of the candidates, information which will make test results more reliable The use of the word 'item' should not be taken to mean only brief questions and answers Ina test of writing, for example, where candidates have to produce a number of passages, each of those passages is to be regarded as an item The more independent passages there are, the more reliable will be the test In the same way, in an interview used to test oral ability, the candidate should be given as many 'fresh starts' as possible More detailed implications of the need to obtain sufficiently large samples of behaviour will be outlined later in the book, in chapters devoted to the testing of particular abilities While it is important to make a test long enough to achieve satisfactory reliability, it should not be made so long that the candidates become so bored or tired that the behaviour that they exhibit becomes unrepresentative of their ab ity t�tXhe same time, it may often be necessary to resist pressure to m;Ykc'a test shorter than is appropriate Tlie usual argument for shortening a test is that it is not practical The- answer to this is that accurate information does not come cheaply: if such information is needed, then the price has to be paid In general, the more important the decisions based on a test, the longer the test should be Jephthah used the pronunciation of the word 'shibboleth' as a test to distinguish his own men from Ephraimites, who could not pronounce sb Those who failed the test were executed Any of Jephthah's own men killed in error might have wished for a longer, more reliable test Do not allow candidates too much freedom In some kinds of language test there is a tendency to offer candidates a choice of questions and then to allow them a great deal of freedom in the way that they answer the ones that they have chosen An example would 37 (215) Reliability be a test of writing where the candidates are simply given a selection of titles from which to choose Such a procedure is likely to have a depressing effect on the reliability of the test The more freedom that is given, the greater is likely to be the difference between the performance actually elicited and the performance that would have been elicited had the test been taken, say, a day later In general, therefore, candidates should not be given a choice, and the range over which possible answers might vary should be restricted Compare the following writing tasks: a) Write a composition on tourism b) Write a composition on tourism in this country c) Write a composition on how we might develop the tourist industry in this country d) Discuss the following measures intended to increase the number of foreign tourists coming to this country: i) More/better advertising and/or information (where? what forth should it take?) ii) Improve facilities (hotels, transportation, communication etc.) in) Training of personnel (guides, hotel managers etc.) The successive tasks impose more and more control over what is written The fourth task is likely to be a much more reliable indicator of writing ability than the first The general principle of restricting the freedom of candidates will be taken up again in chapters relating to particular skills It should perhaps be said here, however, that in restricting the students we must be careful not to distort too much the task that we really want to see them perform The potential tension between reliability and validity is taken up at the end of the chapter Write unambiguous items It is essential that candidates should not he presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated In a reading test I once set the following open-ended question, based on a lengthy reading passage about English accents and dialects: Where does the author direct the reader who is interested in non-standard dialects of English? The expected answer was the Further reading section of the book, which is where the reader was directed to A number of candidates answered 'page 3', which was the place in the text where the author actually said that the interested reader should look in the Further reading section Only the alertness of those scoring the test revealed that there was a completely unanticipated correct answer to the question If that had not happened, then a correct answer would have been scored as incorrect The fact that an individual 38 (216) Reliability 39 candidate might interp ret the question in different ways on different occasions means that the item is not contributing fully to the reliability of the test The best way to arrive at unambiguous items is, having drafted them, to subject them to the critical scrutiny of colleagues, who should try as hard as they can to find alternative interpretations to the ones intended If this task is entered into in the right spirit, one of good-natured perversity, most of the problems can be identified before the test is administered Pretesting of the items on a group of people comparable to those for whom the test is intended (see Chapter 7) should reveal the remainder Where pretesting is not practicable, scorers must be on the lookout for patterns of response that indicate that there are problem items Provide clear and explicit instructions This applies both to written and oral instructions If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will It is by no means always the weakest candidates who are misled by ambiguous instructions; indeed it is often the better candidate who is able to provide the alternative interpretation A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions The frequency of the complaint that students are unintelligent, have been stupid, have wilfully misunderstood what they were asked to do, reveals that the supposition is often unwarranted Test writers should not rely on the students' powers of telepathy to elicit the desired behaviour Again, the use of colleagues to criticise drafts of instructions (including those which will be spoken) is the best means of avoiding problems Spoken instructions should always he read from a prepared text in order to avoid introducing confusion Ensure that tests are well laid out and perfectly legible Too often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced As a result, students are faced with additional tasks which are not ones meant to measure their language ability Their variable performance on the unwanted tasks will lower the reliability of a test Candidates should be familiar with format and testing techniques If any aspect of a test is unfamiliar to candidates, they perform less well than they would otherwise (on subsequently taking a are likely to (217) Reliability parallel version, for example) For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them This may mean the distribution of sample tests (or of past test papers), or at least the provision of practice materials in the case of tests set within teaching institutions Provide uniform and non-distracting conditions of administration The greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate's performance on the two occasions Great care should be taken to ensure uniformity For example, timing should be specified and strictly adhered to; the acoustic conditions should be similar for all administrations of a listening test Every precaution should be taken to maintain a quiet setting with no distracting sounds or movements which, as we saw We turn now to ways of obtaining scorer reliability, above, is essential to test reliability Use items that permit scoring which is as objective as possible This may appear to be a recommendation to use multiple choice items, which permit completely objective scoring This is not intended While it would be mistaken to say that multiple choice items are never appropriate, it is certainly true that there are many circumstances in which they are quite inappropriate What is more, good multiple choice items are notoriously difficult to write and always require extensive pretesting A substantial part of Chapter is given over to a discussion of the construction and use of multiple choice items An alternative to multiple choice is the open-ended iteuh which has a unique, possibly one-word, correct response which the candidates produce themselves This too should ensure objective scoring, but in fact problems with such matters as spelling which snakes a candidate's meaning unclear (say, in a listening test) often make demands on the scorer's judgement The longer the required response, the greater the difficulties of this kind One way of dealing with this is to structure the candidate's response by providing part of it For example, the openmay be ended question What was different about the results' designed elicit the response Success to was closely associated with high motivation This is likely to cause problems for scoring Greater scorer reliability will probably be achieved if the question is followed by: was more closely associated with (218) Reliability Items of this kind are discussed in later chapters Make comparisons between candidates as direct as possible This reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests The scoring should be all the more reliable if the compositions are guided as in the example above, in the section, Do not allow candidates too much freedom Provide a detailed scoring key This should specify acceptable answers and assign points for partially correct responses For high scorer reliability the key should be as detailed as possible in its assignment of points It should be the outcome of efforts to anticipate all possible responses and have been subjected to group criticism (This advice applies only where responses can be classed as partially or totally 'correct', not in the case of compositions, for instance.) Train scorers This is especially important where scoring is most subjective 'File scoring of compositions, for example, should not be assigned to anyone who has not learned to score accurately compositions from past administrations After each administration, patterns of scoring should be analysed Individuals whose scoring deviates Markedly and inconsistently from the norm should not he used again Agree acceptable responses and appropriate scores at outset of scoring A sample of scripts should be taken immediately after the administration of the test Where there are compositions, archetypical representatives of different levels of ability should be selected Only when all scorers are agreed on the scores to be given to these should real scoring begin Much more will be said in Chapter about the scoring of compositions For short answer questions, the scorers should note any difficulties they have in assigning points (the key is unlikely to have anticipated every relevant response), and bring these to the attention of whoever is supervising that part of the scoring Once a decision has been taken as to (219) Reliability the points to be assigned, the supervisor should convey it to all the scorers concerned Identify candidates by number, not name Scorers inevitably have expectations of candidates that they know Except in purely objective testing, this will affect the way that they score Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given For example, a scorer may he influenced by the gender or nationality of a name into making predictions which can affect the score given The identification of candidates only by number will reduce such effects Employ multiple, independent scoring As a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scorers Neither scorer should know how the other has scored a test paper Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scores and investigates discrepancies Reliability and validity To be valid a test must provide consistently accurate measurements It must therefore be reliable A reliable test, however, may not he valid at all For example, as a writing test we might require candidates to write down the translation equivalents of 500 words in their own language This could well he a reliable test; but it is unlikely to be a valid test of writing In our efforts to make tests reliable, we must he wary of reducing their validity Earlier in this chapter it was admitted that restricting the scope of what candidates are permitted to write in a composition might diminish the validity of the task This depends in part on what exactly we are trying to measure by setting the task If we are interested in candidates' ability to structure a composition, then it would be hard to justify providing them with a structure in order to increase reliability At the same time we would still try to restrict candidates in ways which would not render their performance on the task invalid There will always be some tension between reliability and validity The tester has to balance gains in one against losses in the other (220) Reliability READER ACTIVITIES What published tests are you familiar with? Try to find out their reliability coefficients (Check the manual; check to see if there is a review in Alderson et at 1987.) What method was used to arrive at these? What are the standard errors of measurement? ? The TOEFL test has a standard error of measurement of 15 A particular American college states that it requires a score of 600 on the test for entry What would you think of students applying to that college and making scores of 605, 600, 595, 590, 575? Look at your own institutional tests Using the list of points in the chapter, say in what ways you could improve their reliability What examples can you think of where there would be a tension between reliability and validity? In cases that you know, you think the right balance has been struck? Further reading For more on reliability in general, see Anastasi ( 976) For the more mathematically minded, Krzanowski and Woods' ( 984) article on reliability will be of interest (note that errata in that article appear in a later issue -1.2, 984 - of the journal Language Testing) For what I think is an exaggerated view of the difficulty of achieving high reliability in more communicative tasks, see Lado (1961) 4� (221)