Introduction
Background of the Study
Testing is an essential aspect of both social and academic environments, particularly in language education, where it involves gathering information and assessing students' language knowledge and usage abilities (Chapelle & Brindley, 2002) In educational systems, tests serve multiple purposes, including regulating university admissions, upholding educational standards, and holding schools accountable for student success (Read & Chapelle, 2001).
English language tests are crucial components of entrance examinations for higher education institutions in Armenia, significantly impacting students' futures These exams not only determine which academic institution a student can attend but also assess their readiness for university-level education Beyond their gatekeeping role, entrance examination scores offer valuable insights for administrators, test developers, curriculum designers, and teachers, enabling them to improve instructional methods The test items are designed to evaluate the essential content knowledge that students are expected to have mastered.
The items should give administrators and curriculum designers insight into the potential success of test takers as future students if admitted to the institution, addressing both the content validity of the test and its predictive validity.
This study aimed to determine whether the 2009 English entrance examination accurately reflected students' knowledge at the required level and aligned with the English language curriculum for Armenian secondary schools According to the Ministry of Education and Science of Armenia, the unified school-leaving and university entrance exams are structured based on the officially adopted curriculum and approved textbooks Consequently, these English tests assess the proficiency of English as a Foreign Language (EFL) students within the framework of secondary education in Armenia, serving as both school-leaving assessments and criteria for admission to Armenian state universities, excluding the Russian-Armenian and French universities.
The test is divided into two levels, A and B, with Level A designed for students completing their school-leaving certificate who do not intend to pursue higher education Levels A and B are both tailored for students aiming to enter universities Comprising 80 multiple-choice items, the test offers practical advantages such as cost-effectiveness, ease of administration, and quick scoring, while also ensuring high reliability However, the multiple-choice format presents challenges, as it fails to accurately measure real language performance and does not reflect actual language use in real-life situations (Underhill, 1982, p 18).
Based on the research purpose, the objective of the current study was the following:
1 To find out the extent to which the content of the English entrance examination of
2009 was consistent with the curriculum (teaching/learning objectives) of the English language program in secondary schools of Armenia?
The recent introduction of unified tests in Armenia necessitates a thorough investigation of their validity and reliability, as they could serve as a benchmark for enhancing future entrance examinations Insights from this study's analysis can be valuable for test designers, curriculum developers, educators, and other stakeholders involved in language testing and assessment within Armenia's educational landscape.
The outline of the contents of the current research study is as follows:
Chapter I presents the introduction of the study that consists of the background of the study, objectives, significance of the study, and research paper outline
Chapter II provides a theoretical background for Validity Study which includes a discussion of the validity conceptualization emphasizing the role of content validity in the validity paradigm and finally it reviews the literature on the evaluation of test content and factors affecting test validity.
Chapter III comprises research methodology, including description of the data and participants, as well as data collection and data analysis techniques.
Chapter IV illustrates and analyzes the findings of the study
Chapter V covers the conclusion of the study, the implications of the findings; and, finally, discusses the limitations of this study and proposes future areas of research.
Significance of the Study
The recent introduction of unified tests in Armenia necessitates a thorough investigation of their validity and reliability, as these assessments could serve as a benchmark for enhancing future entrance examinations The insights gained from this study will be valuable for test designers, curriculum developers, educators, and other stakeholders involved in language testing and assessment within Armenia's educational framework.
Research Paper Outline
The outline of the contents of the current research study is as follows:
Chapter I presents the introduction of the study that consists of the background of the study, objectives, significance of the study, and research paper outline
Chapter II provides a theoretical background for Validity Study which includes a discussion of the validity conceptualization emphasizing the role of content validity in the validity paradigm and finally it reviews the literature on the evaluation of test content and factors affecting test validity.
Chapter III comprises research methodology, including description of the data and participants, as well as data collection and data analysis techniques.
Chapter IV illustrates and analyzes the findings of the study
Chapter V covers the conclusion of the study, the implications of the findings; and, finally, discusses the limitations of this study and proposes future areas of research.
Review of the related literature
Types and Purpose of Language Tests
Tests are essential in educational settings, despite often causing anxiety for both students and teachers They serve as a crucial method for assessing an individual's knowledge, skills, or performance in a specific area According to Brown (2004), a test measures a person's abilities, while Farhady and colleagues describe it as a tool for gathering numerical data on various attributes.
Language tests have served as significant historical documents, revealing insights into attitudes toward language, testing, and teaching throughout history According to Weir (2005), these tests provide valuable evidence of past language classroom practices when little else remains to inform us about that era.
Language tests vary primarily in their design and objectives, with a key distinction between pen-and-paper tests and performance assessments Pen-and-paper tests focus on evaluating specific language components such as grammar and vocabulary, as well as receptive skills like listening and reading comprehension These tests often utilize fixed response formats, particularly multiple-choice questions (MCQs) While test designers are increasingly making MCQs more complex, requiring deeper thought and calculations, the fundamental nature of multiple-choice tests remains intact.
Multiple-choice tests have inherent limitations in the types of achievements they can assess, as they primarily require test-takers to identify the correct answer rather than produce original responses This format cultivates an expectation for a definitive right answer, which can lead to frustration when faced with the complexities of real-life situations and further educational experiences.
Open-ended test items require students to provide responses in the form of short answers or extended essays, also referred to as "constructed response" items since students must create their answers rather than choose from given options (Zucker, 2003) These types of questions enable students to demonstrate their knowledge and apply critical thinking skills effectively Assessing writing ability, for instance, is challenging without the inclusion of essays or writing samples.
Constructed-response items necessitate human evaluation, which can lead to concerns about the fairness and objectivity of scoring However, recent advancements in technology have introduced computer programs capable of scoring essays, addressing some of these concerns (Sireci, 2000; Rudner, 2001; Shermis).
Short-answer questions are typically evaluated by identifying key terms, as they usually do not require full sentences In contrast, essays and longer responses are generally assessed using a standardized rubric, which offers a comprehensive description of the criteria needed for assigning specific scores.
The main distinction in terms of test objectives lies among aptitude, placement, diagnostic, achievement, and proficiency tests A brief description of each is provided below.
Aptitude tests are essential for predicting learning rates and overall success in training (Carroll, 1981) Placement tests help categorize new students into teaching groups of similar skill levels within a specific learning area (Mousavi, 2009, p 532) Diagnostic tests assess a learner's existing skills and knowledge, identifying their strengths and weaknesses (Mousavi, 2009, p 190) Proficiency tests focus on future language use, independent of prior training or teaching methods, measuring performance against a predetermined criterion (Mousavi, 2009; Brown, 2005).
Achievement tests are administered at various stages of a training course, particularly at the end, to evaluate progress and identify learning challenges These tests can be tailored to meet the specific objectives of the course designers However, assessing learners' achievements across different courses that share similar subject matter but differ in content and rationale presents unique challenges This situation is often addressed through external examinations, such as the USL and UEE in Armenia, highlighting the complexities faced by school-leavers in evaluating their academic performance.
In Armenia, high school students are required to take the same exit and entrance exams, ensuring a standardized assessment process While they study English, the instructional materials used vary significantly in both content and rationale, leading to diverse learning experiences.
Language tests in educational settings primarily aim to provide valuable information for decision-making, particularly in evaluation and measurement As Bachman (1990) notes, evaluation involves two key components: information and value judgments The decisions made based on these evaluations often pertain to individuals and can significantly impact their lives.
Reliable and valid information is crucial for informed decision-making, particularly in the context of testing According to Bachman (1990), these qualities serve as the primary justification for using test scores to make inferences or decisions.
Test Measurement Qualities
Reliability in measurement refers to the consistency of test outcomes across different conditions, as highlighted by Bachman and Palmer (1996) and further emphasized by Jonson and Jonson (1999), who state that a reliable test yields similar results under varying circumstances This consistency in scores is crucial, as it ensures that the test accurately reflects the ability being assessed Without reliability, test scores would lack the necessary credibility to provide meaningful insights into the measured abilities.
As has been said, reliability of a test indicates its consistency Brown (2005) views reliability as a precondition for validity But a reliable test is not necessarily a valid one, too
A test can exhibit reliability without necessarily being valid, meaning it may be designed for a specific purpose but consistently measures something different (Brown, 2005) While reliability and validity are distinct qualities, they are interconnected; enhancing the reliability of measures often meets essential conditions for validity, as a valid test score must first be reliable (Bachman, 1990) Davies and Elder (2004) draw a parallel between the relationship of validity and reliability to that of meaning and form in language studies.
Meaning is what matters, and yet, without Form, Meaning disappears.
Validity is essential for a test, providing it with its distinctiveness as a measure However, for validity to be meaningful, reliability is also required While reliability is important, it alone is not enough; the adequacy of a test ultimately hinges on its validity.
Validity is a crucial concept for all language test users, as it shapes the accepted practices of test validation that inform what qualifies as an effective language test for specific contexts (Chapelle, 1999) Essentially, beliefs about validity and the validation process are foundational to claims regarding the worth of a given type of test.
Validity is believed to be the most complex criterion of an effective test According to
According to Davies and Elder (2004, p.798), validity pertains to the truth-value of a test and its scores, making it a crucial yet complex aspect of language testing It is described as "powerful" due to its influence on all facets of testing, and "precarious" because it is influenced by various factors such as logic, reliability, and both local and universal contexts Additionally, philosophers have long debated the concept of validity, with Marcus (1995) providing a philosophical perspective that further enriches our understanding of this critical notion.
Modern logic defines valid arguments as those that maintain truth-preserving properties An argument is considered valid if the combination of its premises and the negation of its conclusion leads to inconsistency.
An argument is considered valid when its premises align with its conclusion According to Davies and Elder (2004), validity is a self-contained concept, akin to abstractions like beauty and truth, and should be evaluated through a validation process This process involves test users providing evidence to support the inferences or decisions derived from test scores (Cronbach, 1971, as cited in Crocker & Algina, 1986) Ultimately, the strength of validity is dependent on the rigor of its validation procedures.
Validity and validation are often confused, but they refer to different concepts Validity is a characteristic of a test, while validation is the process researchers use to determine if a test possesses validity As noted by Borsboom, Mellenbergh, and van Heerden (2004), validation involves activities aimed at assessing the validity of a test.
If validity is seen as an abstract notion, this may suggest that validity can be defined, established and measured only operationally Validity, for instance, is defined by Gronlund
Test validity refers to the appropriateness, meaningfulness, and usefulness of inferences drawn from assessment results, as described by the 1998 source (p 226) Brown (2005, p 220) defines it as the degree to which a test accurately measures what it claims to measure Furthermore, Messick (1995, p 742) views validity as a comprehensive evaluation of the evidence supporting score interpretation and its actual and potential consequences.
Validity, as defined by the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999), is the extent to which evidence and theory substantiate the interpretations of test scores associated with their intended uses Validating a proposed interpretation involves assessing the claims derived from the test scores, and the evidence required for this validation is contingent upon the specific inferences and assumptions involved.
Applied linguists argue that absolute validity is unattainable and categorize validity into various types based on different perspectives Savignon (1983) identifies five distinct types of validity: face validity, content validity, predictive validity, concurrent validity, and construct validity.
Scholars such as Brown (2004) and Bachman (1990) suggest that validity encompasses a combination of content, criteria, and construct considerations These elements should be seen as complementary types of evidence that are essential for the validation process.
Recent advancements in language testing have reshaped the understanding of validity, now viewed as a holistic and unified concept centered on construct validity (Farhady, 2006) This comprehensive theory emphasizes the significance of score interpretation and the social values associated with test usage Unified validity effectively combines aspects of content, criteria, and consequences within a singular construct framework.
Construct Validity
Before discussing construct validity, it's important to define the term 'construct.' Chapelle (1998) outlines three perspectives on this definition: a construct can be viewed as a trait, a behavior, or a combination of both In the trait perspective, a person's consistent performance on a test reflects their stable knowledge and skills, which can be applied across various contexts.
A construct defined by behavior suggests that an individual's consistent test performance is influenced by the context in which their behavior is assessed This implies that test results reflect the person's abilities in a particular task or setting, but do not necessarily indicate their performance in different tasks or contexts.
The distinction between trait and behaviorist definitions of linguistic competence is evident when comparing multiple-choice grammar tests with open-ended performance assessments A trait definition suggests that a person's performance on multiple-choice items reflects a consistent knowledge of grammar applicable in all contexts Conversely, a behaviorist definition views an individual's essay writing as indicative of their performance on similar tasks, without necessarily predicting their ability in unrelated tasks.
The interactionist definition of a construct combines traits and behaviors, suggesting that an individual's test performance reflects both their underlying traits and the influence of the specific context in which the performance takes place This approach allows for insights into both practice-specific behaviors and person-specific traits, highlighting the interplay between individual characteristics and situational factors in assessment outcomes.
According to Chapelle (1998), in the field of Second Language Acquisition (SLA) research, a construct is defined as a meaningful interpretation of observed behavior, which must demonstrate performance consistency to be effectively interpreted.
In performance consistency testing, operational definitions are outlined through test method facets or task characteristics (Bachman and Palmer, 1996) Adequate justification for the interpretations of test performance in operational settings is essential for appropriate decision-making in educational contexts and for theoretical development in research (Chapelle, 1998: 49) This process of obtaining sufficient justification is recognized as validation.
The unitary view of test validity emphasizes that the primary goal of measurement is to draw inferences from observed test scores to unobservable constructs This perspective highlights the importance of evaluating the construct validity of these inferences when assessing a test's effectiveness.
Messick (1995) identified six facets of construct validity that serve as essential criteria for educational and psychological measurements, including performance assessments These facets include content validity, which assesses the relevance and technical quality of the test; substantive validity, which provides empirical evidence for the appropriateness of the content (Farhady, 2006, p 151); structural validity, focusing on the reliability of the scoring structure; generalizability, which examines the applicability of results across different populations; external validity, pertaining to the relationships between assessment scores and other measures, reflecting the expected interactions based on the construct theory (Messick, 1995, p 746); and consequential validity, which considers issues of bias, fairness, and the broader social implications of the assessment.
Construct validity refers to the evidence and reasoning that support the reliability of score interpretations, focusing on the underlying concepts that explain test performance and the relationships between scores and other variables.
Construct validity encompasses not only reliance on various forms of evidence but also emphasizes the importance of content relevance, representativeness, and criterion-relatedness Understanding the range and limits of content coverage, along with the specific criterion behaviors that test scores predict, is crucial for accurate score interpretation Additionally, the correlations between test scores and criterion measures, when considered alongside other supportive evidence, enhance the overall construct validity of both the predictor and the criterion.
Bachman and Palmer (1996) emphasized that construct validity is a crucial element of test usefulness, which they regarded as the most significant quality of any language assessment Their model highlights the importance of ensuring that language tests accurately measure the intended constructs.
In 1996, six key qualities were identified that determine the usefulness of a test: construct validity, authenticity, interactiveness, impact, and practicality Understanding these qualities is crucial for grasping the overall concept of test usefulness Note that the discussion on reliability has been omitted as it was previously addressed in the paper.
Authenticity refers to the alignment between language test tasks and real-world target language use (TLU) tasks, as defined by Bachman and Palmer (1996) This quality is essential for assessing how well test scores can be applied to actual language use in TLU contexts Furthermore, authenticity is intricately linked to construct validity, as both concepts involve evaluating the generalizability of score interpretations beyond the test itself.
Interactiveness refers to the degree and nature of a test taker's involvement, influenced by their language ability, background knowledge, and motivation According to Bachman and Palmer (1996), this involvement plays a crucial role in completing test tasks Additionally, the relationship between interactiveness and construct validity is significant, as the alignment of high interactiveness with construct validity hinges on the definition of the construct and the individual characteristics of the test taker.
The impact of tests is a crucial aspect of their usefulness, as it pertains to their influence on society, educational systems, and the individuals involved, such as test takers, educators, and decision makers (Bachman and Palmer, 1996) Tests can directly or indirectly affect all stakeholders within these systems.
Criterion Validity
Criterion validity is a crucial aspect of the validation process in language testing, as it establishes the relationship between test scores and a relevant criterion that indicates the tested ability This validity can manifest concurrently, occurring at the same time as the test administration, or it may serve a predictive function, forecasting future performance based on test results.
Concurrent criterion relatedness assesses differences in test performance among individuals with varying language abilities and examines correlations among different measures of a specific ability Test users commonly inquire about the correlation between a new test and established standardized tests (Bachman, 1990, pp 248-249) To address this, test developers frequently administer a standardized test alongside their own to determine if there is a correlation between the scores obtained.
Predictive validity refers to the effectiveness of test scores in forecasting future behaviors, indicating that a test focused on one attribute can serve as a reliable predictor for outcomes associated with other attributes This concept is particularly crucial for placement tests, admission assessments, and language aptitude evaluations, where the goal is to assess and predict a candidate's potential for future success (Brown, 2004, pp 24-25).
Criterion-related validity is essential for assessing learning and achievement outcomes, which play a crucial role in providing students with opportunities for college admission and employment D’Agostino (2004) emphasizes the importance of researchers demonstrating that student achievement measures effectively predict significant life outcomes However, he notes that there have been limited studies on predictive validity to confirm that achievement test scores can accurately forecast life events, including college attendance, securing meaningful employment, and preventing unemployment.
Content Validity
The concept of content validity has sparked controversy since its inception, with many validity theorists arguing that the term is technically incorrect and does not fit within acceptable psychometric classifications (Messick, 1989b).
Cronbach, 1989) According to the current unitary conceptualization of validity, content validity is seen as content representation in the instrument construction and evaluation processes (Sireci, 1998).
Researchers like LaDuca (1994) emphasize the importance of content validity evidence in various testing scenarios, arguing that it should not be overly simplified within a singular framework (D’Agostino, 2005) D’Agostino further notes that these researchers view content domains as constructs, suggesting that tests aimed at representing objectives rely on inferences related to abstract theoretical concepts, similar to tests designed to assess mental attributes.
Content validity is a crucial aspect to consider when developing a test, as it evaluates how well the test items represent the intended content It assesses the extent to which the measure accurately samples the relevant material, ensuring that conclusions drawn from the test are valid (Rubio et al., 2003).
Content validity, also known as context validity, refers to how well the tasks chosen for a test represent the broader range of tasks that the test is intended to sample (Weir, 2005) Hughes (1989) further elaborates on this concept, emphasizing its importance in ensuring that a test accurately reflects the skills or knowledge it aims to measure.
Content validity refers to the extent to which a test accurately represents the language skills and structures it aims to assess To evaluate a test's content validity, it is essential to have a clear specification of the skills and structures intended for assessment, along with a principled selection of relevant elements for inclusion A higher degree of content validity increases the likelihood that the test will effectively measure its intended outcomes.
Bachman (1990) identifies two key aspects of content validation: content relevance and content coverage Content relevance emphasizes the importance of aligning test content with the specific facets and ability domains being assessed It is essential that the test items reflect the materials taught in the instructional program, as highlighted by Farhady (2006) Additionally, the need for clear domain specification is acknowledged by advocates of criterion-referenced test development, ensuring that assessments accurately measure the intended constructs.
Content coverage refers to how well the tasks in a test represent the specific behavioral domain being assessed If the test developer clearly defines this domain and the potential tasks involved, a systematic random sampling method can be applied to ensure that the test tasks accurately reflect the intended domain (Bachman, 1990, pp 244-247).
Language testers, such as Bachman (2002), have identified significant challenges in assessing content validity in language tests Bachman (1990) notes that defining a clear and unambiguous domain for language use tasks is problematic, making it difficult to establish content relevance and coverage While efforts have been made to identify representative samples, such as the needs analysis proposed by Branden, Depauw, and Gysen (2002), these methods face challenges, particularly when test-takers come from diverse backgrounds This issue also extends to criterion-referenced tests, which are typically used to demonstrate the adequacy of content evidence for validating language assessments.
Kane (2006, p 131) emphasizes that the interpretation of test scores hinges on evaluating the procedures used to generate those scores He argues that inferences regarding constructs and the intended applications of test scores depend on the relevance of the observed performances to these interpretations This evidence, derived from procedural and judgmental assessments during test development, is commonly known as content-related validity evidence.
Content-related validity evidence has been criticized for a number of reasons
Analyzing the structure and content of tests is essential for evaluating the validity of inferences and assumptions related to test score interpretations According to Kane (2006), all validation strategies depend on content-related evidence to assess relevance, representativeness, and the effects of systematic errors While this type of evidence alone does not completely validate interpretations, it is a crucial component of the overall validity analysis.
Messick (1995) argued that content-related validity evidence is restricted to the test's content and procedures, supporting claims about the test itself without addressing the interpretation or application of the scores He emphasized that such evidence should not be used to validate conclusions that exceed its scope Conversely, Kane (2006) noted that while content-related evidence can justify score interpretations based on expected performance across a domain, it does not support construct interpretation Table 1 outlines the key shifts in the conceptualization of validation over time.
Table 1 Summary of contrasts between the two conceptualizations of validation (Chapelle, 1999, p
Validity was considered a characteristic of a test: the extent to which a test measures what it is supposed to measure.
Validity is considered an argument concerning test interpretation and use: the extent to which test interpretations and uses can be justified.
Reliability was seen as distinct from and a necessary condition for validity
Reliability can be seen as one type of validity evidence
Validity was often established through correlations of a test with other tests.
Validity is argued on the basis of a number of types of rationales and evidence, including the consequences of testing.
Construct validity was seen as one of three types of validity (the three validities were content, criterion-related, and construct).
Validity is a unitary concept with construct validity as central (content and criterion- related evidence can be used as evidence about construct validity).
Establishing validity was considered within the purview of testing researchers responsible for developing large-scale, high-stakes tests.
Justifying the validity of test use is the responsibility of all test users
To establish validity in testing, it is crucial to clearly define the objectives of the assessment, including what content is being measured and the target population Without this clarity, accurately evaluating the effectiveness of the test becomes unattainable (Brown, 2005, p 222).
Evaluating content validity involves assessing both the test and its individual components According to Sireci (1998), content validity refers to the credibility and reliability of the assessment tool in measuring the intended construct.
Content validity is narrower than construct validity, as noted by Sireci (1998), who emphasizes the importance of assessing the quality and relevance of the assessment instrument before evaluating score-based inferences It is crucial to analyze the content characteristics of a test in relation to its specific purpose, as a test may demonstrate content validity for one purpose while lacking it for another.
Once the content to be tested is established, test developers must create item specifications that outline the testing objectives for the specific test These item specifications, which include a general description, a sample item, stimulus attributes, and supplemental lists, enable any test writer to create items that assess the same concepts (Brown, 2005) According to Brown, ensuring a strong alignment between the test items and the specifications is crucial for supporting claims of content validity Therefore, the careful consideration of test content plays a vital role in both the development and application of tests.
The Effect of Coaching/Tutoring on Validity
In Messick's framework, validation analyses explore the impact of individual differences, task characteristics, testing environments, and coaching on test outcomes, along with the social implications of testing Although these aspects were not examined in the current study due to contextual challenges, their potential importance warrants careful consideration and structured investigation in future test validation research.
In Armenia, test preparation through coaching and tutoring is widely practiced, but it is essential to differentiate between the two terms While both involve educational support, coaching encompasses traditional tutoring elements and aims to develop students into more effective learners, going beyond mere exam survival.
According to the Encyclopedic Dictionary of Language Testing (2009), coaching refers to focused, short-term training aimed at enhancing test-taking skills and familiarizing individuals with the specific types of questions found on the relevant examination (p 98).
This paper defines "coaching" and "tutoring" as practices aimed at improving test-takers' scores on standardized high-stakes tests Standardized testing is characterized by a uniform administration and scoring process, ensuring that all students take the same test under identical conditions, which allows for accurate comparisons of performance across different educational settings As a result, the outcomes of standardized tests can be effectively compared across schools, districts, and states.
High-stakes testing refers to the practice of using standardized test results as the primary factor in significant decisions, such as student promotions or high school graduations (Brown, 2004; Resnick, 2004; Cizek, 2001) A notable example includes college entrance exams, where a single multiple-choice test can have a profound impact on a student's future Consequently, the implications of high-stakes testing extend beyond the students themselves, affecting their families as well.
Parents advocate for their children's best interests, especially when their future is at risk They strive to shield their children from negative experiences, believing that students should not be penalized for the school's shortcomings in providing essential knowledge and skills To ensure their children are adequately prepared for exams, parents often seek the help of tutors, making coaching an essential resource in their educational journey.
The contemporary theory of test validation has broadened its focus to encompass the potential changes resulting from the introduction of tests, particularly in the preparation of test takers According to McNamara (2000), such changes can affect what the test measures, raising concerns about the fairness of inferences drawn about candidates This concept is referred to as consequential validity Messick (1989) was the pioneer in integrating consequences into the validity argument, while Shepard (1993, 1997) expanded this definition by emphasizing the need to examine both the positive and negative, as well as intended and unintended consequences of score-based inferences, to effectively assess the validity of an assessment system.
Consequential validity refers to the social implications of utilizing a specific test for its intended purpose, with its validity measured by the societal benefits derived from that use While some testing experts emphasize the importance of these social consequences in determining a test's validity, others argue that such factors should not be included in the definition of validity itself.
Consequential validity refers to the comprehensive effects of a test, encompassing its accuracy in assessing intended criteria, the influence it has on test-taker preparation, its impact on learners, and the broader social implications of how the test is interpreted and utilized (Brown, 2004).
Coaching students for test performance can significantly impact the consequential validity of test preparation courses This suggests that test results may reflect socioeconomic conditions, as access to coaching opportunities can vary based on these factors.
The availability of coaching for students varies significantly, as some children benefit from years of training, while others may only receive limited support due to financial constraints This disparity in access can impact assessment outcomes and overall student performance.
In a study by Ferman (2004), the impact of a national EFL oral matriculation test on the Israeli educational system was analyzed, revealing that teachers of grades 11 and 12 adapted their teaching methods to prepare students specifically for the exam Techniques included student coaching, focusing solely on test-relevant content, intensive practice for struggling students with cue cards, and integrative teaching approaches This shift in teaching strategy is attributed to increasing pressures on educators to achieve higher standards, leading many to prioritize test-oriented material over traditional curriculum content (Kaufhold).
In 1998, it was noted that teachers often tailor their instruction to simulate exam tasks, prioritizing student success in assessments over genuine language acquisition Many educators admit that their teaching methods would differ significantly if not for the pressure of exams As Ferman (2004) highlights, the primary focus of English instruction tends to be on preparing students for exams rather than fostering true language learning.
This phenomenon is known as teaching for/to the test, which actually can invalidate test scores and defeat the purpose of the test Robert L Linn and Norman E Gronlund
According to the authors of "Measurement and Assessment in Teaching" (2000), teaching to the test can lead to a detrimental reduction in the breadth of the curriculum, potentially inflating test scores and altering the interpretation of results This shift may transform the focus from essential problem-solving skills to mere memorization abilities.
Powers (1986, p 1-2) presented some possible outcomes of special test preparation as discussed by Messick (1981) and Anastasi (1981) One of them is that:
Special preparation, especially through test orientation or familiarization, can improve test-taking skills and increase scores that might otherwise be unfairly low due to irrelevant challenges like unfamiliarity with question types or unclear test instructions By addressing these unrelated difficulties, we can enhance both predictive validity and construct validity of the assessments.
The Relationship of Alignment between Curriculum, Assessment, and Instruction to
Alignment in education involves connecting content and performance standards to assessment, instruction, and learning in the classroom (Mousavi, 2009) It serves as a framework for evaluating how well various elements of an educational system collaborate to achieve shared objectives On a broader scale, curricular alignment assesses how effectively the curriculum across different grades builds upon and reinforces prior learning (Tyler, 1949, cited in Martone & Sireci, 2009) La Marca et al (2000) highlighted that assessments should enable students to showcase their knowledge and skills in line with curriculum expectations, ensuring accurate interpretations of their performance.
The assessment should thoroughly address the content standards with the necessary depth and emphasis, ensuring it encompasses a range of performance scores It must provide all students the chance to showcase their proficiency and be reported in a way that clearly communicates their proficiency in relation to the content standards.
The theory underlying alignment research is that a consistent message from all aspects of the educational structure will result in systemic, standards-based reform (Smith & O’Day,
In 2002, Porter emphasized the importance of a cohesive instructional system, stating that it should be guided by content standards These standards must then be reflected in assessments, curriculum materials, and professional development, ensuring that all elements are closely aligned with the established content standards (Martone & Sireci, 2009).
Sireci (1998a, 1998b) identifies four key aspects of content validity studies: defining the domain, representing the domain, ensuring domain relevance, and evaluating the appropriateness of test construction procedures.
Domain representation is crucial for assessing how well a test measures all aspects of the intended content area (1998a) To evaluate this representation, a thorough examination of the test items and tasks is necessary (Martone & Sireci, 2009) Typically, subject matter experts, such as teachers, review test items to determine their alignment with test specifications (Crocker, Miller, & Franks, 1989; Sireci, 1998a) Domain relevance focuses on the relevance of each test item to the assessed domain Martone and Sireci (2009) emphasize that appropriate test development procedures ensure that the content accurately reflects the intended construct without including irrelevant material Evaluating a state test requires analyzing both the test content and state standards, as well as the instruction received by students Alignment research encompasses these elements, offering validity evidence for assessing tests, curricula, and instructional practices (Martone & Sireci, 2009).
Alignment research plays a crucial role in identifying potential deficiencies in assessment and instruction by systematically comparing various elements of the educational process As noted by Webb (1999), a lack of alignment among educational components can lead to inconsistent messaging regarding the values of the educational system Consequently, alignment research serves as a valuable tool to address concerns about curriculum simplification (Linn, 2000) and to ensure that all students receive equitable opportunities to learn the material on which they are assessed (Winfield).
In 1993, it was noted that states have not adequately tackled the issue of enhancing instructional quality (Rothman et al., 2002) Evaluations play a crucial role in expanding upon the insights offered by standard content validity studies, which primarily assess the extent to which test items align with the domains outlined in a test blueprint (Martone & Sireci, 2009).
Statement of Purpose
This study considers the importance of evaluating test validity of standardized high- stakes tests in the light of the English unified school-leaving and university entrance exam of
In 2009, this research focused on assessing the content validity of the specified test by employing statistical procedures, specifically analyzing data derived from the administration of the test, including both test and item score data.
In light of the above-discussed factors, the research question to be addressed is:
1 To what extent is the content of the unified English school-leaving and entrance examination of 2009 consistent with the curriculum (teaching/learning objectives) of the English language program in secondary schools of Armenia?
Methodology
Data and Participants
The data for this study were obtained from the results of the two tests administered to
A study involving 62 school-leavers (46 females and 16 males) was conducted, randomly selecting participants from various secondary schools in Yerevan and the Preparatory Course of the Russian-Armenian (Slavonic) University (RAU) The participants, primarily school-leavers from different Armenian schools, were deemed suitable for the research as the Preparatory Course aims to equip students for RAU's entrance exams Notably, RAU's English exam has distinct features compared to the Unified School Leaving and University Entrance Examination in English (USL and UEEE) It comprises three sections: the first two sections include multiple-choice questions assessing reading comprehension (Section 1) and grammar knowledge (Section 2), while Section 3 involves translating five independent sentences.
Armenian (Slavonic) University you may visit the site: http://www.rau.am/index.php? l=1&l1=6&l29.
The age range of the participants was between 15 and 17 The research was conducted at the end of the academic year, early in June, 2010.
Instrumentation
This study utilized two instruments for data collection, with the primary tool being the 2009 Unified School Leaving and University Entrance Examination in English (USL and UEEE), available at www.atc.am/?id=3 The test comprises two levels: Level A and Level B, featuring a total of 80 multiple-choice questions (MCQs)—50 in Level A and 30 in Level B These items are designed to assess test-takers' reading comprehension, vocabulary, and grammar knowledge A detailed description of the specific skills and knowledge evaluated in each subsection is provided for clarity.
Level A is comprised of 8 subsections:
Subsection 1 is aimed at checking test-takers’ reading comprehension skills, their ability of comprehending the main idea of the written text, the ability of understanding sentences and vocabulary
Subsection 2 tests school-leavers’ ability to understand the gist of a coherent text by selecting the field the texts are taken from (e.g Chemistry, Environment, Plants, Animals & Birds, etc).
Subsections 3, 4 and 5 check testees’ knowledge of the parts of speech through small texts.
Subsection 3 requires knowledge of the Verb; namely, tense and aspect, regular and irregular verbs, transitive and intransitive, auxiliary verbs, link-verbs, the voice of the verbs (Active and Passive) and the sequence of tenses
Subsection 4 encompasses the following grammatical aspects: Non-finite verbs (Infinitive, Gerund, Participle I and Participle II), Noun (number and case), Adjective
This article explores the degrees of comparison, detailing their formation and construction, as well as the distinctions between ordinal and cardinal numerals It also examines various types of pronouns, including personal, possessive, reflexive, reciprocal, objective, relative, definite, indefinite, negative, and interrogative pronouns Additionally, the article discusses adverbs, highlighting their different types and degrees of comparison.
Subsection 5 comprises Modal verbs (meanings and use), Articles (Definite,
Subsection 6 tests test-takers’ knowledge of Word-formation including Noun-,
This article explores the formation of adjectives, adverbs, numerals, and verbs through various suffixes and prefixes Additionally, Subsection 7 delves into the unique characteristics involved in converting Direct Speech to Reported Speech and the reverse process.
Subsection 8 checks the testees’ knowledge of Question types and their word-order
Subsection 9 is aimed at checking test-takers’ reading comprehension skills, their ability to recognize the main idea of the reading selection and identify details presented in it, the ability to understanding sentences and vocabulary, as well as referencing skills Items checking the school-leavers’ ability of making inferences are also included in the task
Subsection 10 is designed to check test-takers’ syntactical knowledge: compound and complex sentences, primary and secondary clauses, types of subordinate clauses, linking words/ conjunctions It mainly tests school-leavers’ skills to complete the given sentences or to connect the two parts of a complex sentence by selecting logically and grammatically correct options.
Subsection 11 tests school-leavers’ ability to understand the gist of the coherent text The testees have to fit missing paragraph into the gaps of the text to maintain cohesion between the paragraphs of the given text
Subsection 12 includes some grammatical aspects The testees here are required to find the odd word
The second instrument utilized in this study was a specifically developed achievement test, aligned with the textbook series sanctioned by the Ministry of Education and Science of Armenia This test was designed with careful consideration of the official English exam requirements and the teaching and learning objectives established by the Ministry Detailed information regarding the test development process can be found in the Procedure section.
(For convenience, the test designed for the present research will hereinafter be used as
Test 1 and the USL and UEEE of 2009 as Test 2)
Procedure
The study aimed to assess the relevance of the USL and UEEE tests in relation to the instructional materials used in Grades 5 to 10 To achieve this, an analysis was conducted on the textbooks and the High School English Language teaching objectives outlined by the Ministry of Education and Science (2007) While various instructional materials are utilized in Armenian schools, not all are officially endorsed by the Ministry Therefore, gathering information on the textbooks approved by the Ministry was crucial to ensure alignment with the curriculum.
The unified school-leaving and university entrance exams in Armenia are structured based on the curriculum set by the Ministry of Education and Science, utilizing textbooks and test materials approved by the Ministry (Decision N 238, RA Government, March 11, 2010) Consequently, the initial phase of the research involved identifying the specific textbooks or series relevant to the study, leading to the selection of the following textbooks for examination.
Baghdassarian, S., S Gurjayants & M Araratian (2005) English 10 Yerevan:
Baghdassarian, S., S Gurjayants & M Araratian (2006) English 9 Yerevan:
Hovhannesian, N., H Kachberuny & G Gasparyan (2000) English 8 Yerevan: Macmillan Armenia.
Grigoryan, L (2002) English 7 Yerevan: Macmillan Armenia.
Gasparyan, G., N Hovhannesian, H Kachberuny (2004) English 6 Yerevan:
Gasparyan, G., N Hovhannesian, H Kachberuny (2003) English 5 Yerevan:
A comprehensive table of contents has been developed using data from the textbooks, encompassing essential information on language skills such as reading, writing, speaking, and listening, as well as grammar and vocabulary, including subject-specific terminology (refer to Appendix 1.2).
A table was created to align with the English Language teaching and learning objectives outlined in the high school curriculum, with key points summarized in Appendix 2.
To ensure the content and construct validity of the test, detailed specifications were developed based on the table of contents and the teaching objectives of the English Language (Abbas Mousavi, 2009, p 761) Consequently, a table of specifications for the achievement test was created.
The initial draft of the test was developed based on specified tasks, incorporating sections for reading comprehension, grammar and structure, and writing However, listening comprehension was omitted due to the limited focus on listening activities in the textbooks Up to English 8, the textbooks do not include listening tasks, while English 9 and 10 feature only four listening activities, primarily recognition-based, with minimal emphasis on understanding main ideas or details As a result, assessing listening comprehension skills was deemed unsuitable for the test.
The test must align with the language abilities of the test-takers, incorporating tasks that reflect the topics covered in the relevant textbook Additionally, these tasks should engage students by connecting to their interests and age, while also drawing on their prior knowledge and experiences.
As a result, the developed draft of the test comprised 3 Sections: a Reading
The assessment included a Comprehension section with two texts featuring multiple-choice questions, a Grammar and Structure section divided into six subsections totaling 57 items, and a Production section where test-takers wrote a composition on selected topics This test provided students with diverse item types, predominantly performance-based, with the exception of the reading comprehension and Subsection III All item types utilized are aligned with the textbooks being reviewed.
For the reading comprehension assessment, texts were chosen to match an appropriate difficulty level Analyzing the readability indexes of 12 passages from the English 8, English 9, and English 10 textbooks, the average readability index was found to be around 78 Consequently, two texts were selected for the test, both featuring suitable readability indexes.
77 and 72, respectively The statistical analysis of the data on readability is presented in Table A.
Table A: Descriptive Statistics of Readability Indexes
N Min Max Mean Std Deviation
The Grammar and Structure section evaluated students' skills in recognizing, understanding, and applying grammatical forms through realistic and contextualized test items These items were designed to align with the topics and structures outlined in the table of contents, with a comprehensive description of each subsection available in the table of specifications (see Appendix 2).
In Sections 1 and 2, each item was assigned equal weight, with 1 point awarded for correct answers and 0 points for incorrect ones For an answer to be considered correct, it needed to be both grammatically and lexically accurate As stated in Memorandum #51 from the Evaluation and Examination Service of the University of Iowa, using inappropriate weighting could compromise reliability To prevent any negative outcomes, a uniform scoring system of 1/0 was implemented, avoiding differential weighting altogether.
In the final phase of the test, students participated in the Production Section, where they were tasked with writing a short composition or essay of approximately 100 words on a given topic Their performance was evaluated based on the organization and development of their ideas, as well as the appropriateness of their vocabulary and grammatical accuracy Scoring adhered to the established criteria outlined in the Writing Rubrics.
California High School Proficiency Examination (The rubrics can be found in Test
To establish interrater reliability, which measures the agreement between multiple raters, writings were evaluated by a high school teacher and the researcher (Brown, 2005) According to Brown, utilizing more than one rater is essential when assessing students' productive skills, such as writing and speaking, to ensure the reliability of the results (p 185) To estimate this reliability, a correlation coefficient was calculated between the scores from both raters The computations were performed using the SPSS program, employing the Spearman rank-order correlation (Rho) as recommended by Hatch & Farhady (1981).
Once the initial version of the test was completed, a colleague reviewed it, leading to necessary revisions and modifications before it was piloted This process aimed to ensure the test's effectiveness and reliability.
1 To identify problems in task specifications and clarity of instruction.
2 To discover how test takers respond to the test tasks; i.e to obtain preliminary information on the test taking process, on test takers’ perception of test tasks, and test performance
3 To determine appropriate time allocation
The pilot test conducted with 19 school-leavers from school No 114 identified several issues related to individual items, instructions, and entire tasks Modifications were made based on the findings, particularly in Section 1, Text 2, where item 8, designed to assess students' ability to identify the main idea, posed challenges The correct response was not clearly defined, leading many test-takers to choose an effective distracter instead After reformulating this item, the final results indicated that 38% of test takers, or 25 out of 62, answered correctly.
Data Analysis
The current study utilized both quantitative and qualitative data through various analyses, collecting quantitative data from two tests (Test 1 and Test 2 - USL and UEEE test of 2009) and 40 essays The total scores, along with the individual components of the tests and the essay scores from the production section of Test 2, were analyzed using SPSS to determine any statistically significant differences and to evaluate the associations between the tests and their components, which measure different skills, sub-skills, and grammar knowledge Additionally, correlations between the essays and Tests 1 and 2 were examined to assess the relationship between test scores and writing performance, aiming to ascertain if the test scores accurately reflected the writing abilities of the test takers.
Qualitative data were gathered by assessing the alignment among the curriculum, assessments, and primary instructional materials, such as textbooks, utilized in schools Martone and Sireci (2009) emphasize that effective tests should reflect the content taught in the classroom, ensuring coherence between curriculum and instruction Research on alignment serves as a valuable method to evaluate the relationship between testing, content standards, and instructional practices, including the materials used for teaching.
Result and Discussion
Results
4.1.1 The Reliability Analysis of Tests 1 and 2
Before evaluating content validity, it was crucial to assess the reliability of the tests involved in the research Reliability is a prerequisite for validity; while a measure can be reliable without being valid, a valid measure must always be reliable As stated by Brown (2005), "if a test is not reliable, it is not valid either" (p 220) Therefore, determining the reliability of both Test 1 and Test 2 was essential.
To estimate the tests’ reliability, the internal-consistency method Cronbach alpha coefficient (α) was used (Pallant, 2007).The higher the alpha is, the more reliable the test is
The Cronbach alpha coefficient, which ideally should exceed 70, was found to be α = 83 for Test 1 and α = 87 for Test 2, indicating very good internal consistency reliability for the scale in a sample of 62 participants These results confirm that the precondition for the validity study was satisfied.
This study employed multiple analyses to address the research question, beginning with data collection from two tests: Test 1, specifically designed for this research, and Test 2, which includes USL and UEEE 2009 Notably, the writing section of Test 1 was analyzed independently, as only 40 out of 62 participants completed it, ensuring the integrity of the findings and avoiding any distorted or misleading results.
A Paired Samples T-test, also known as repeated measures, was conducted to determine if there is a statistically significant difference between the mean scores of Test 1 and Test 2 Given that parametric statistical methods, such as t-tests and ANOVA, rely on the assumption of normal distribution within the population, it was essential to verify whether the study's data met this criterion Histograms with curves illustrating the distribution of scores for both Test 1 and Test 2 are presented in Figures 1 and 2.
Figure 1: Distribution of scores on Test 1
Figure 2: Distribution of scores on Test 2
The data presented in Figure 1 indicates a slight negative skew in the "bell curve," suggesting that a greater number of participants achieved higher scores on Test 1 According to Pallant (2007, p 238), a sample size of 30 or more minimizes the impact of this skewness, making it unlikely to cause significant issues Even if the distribution deviates from normality, it is often reasonable to treat it as a good approximation of a normal distribution, allowing for the continued use of various statistical methods, including the t-test.
The results of the paired samples t-test are presented in Tables 1 and 2
Table 1: Paired Samples T-test Descriptive Statistics
Table 2: Paired Samples T-Test t df
Table 1 presents the descriptive statistics of the data on the total scores obtained from
Tests 1 and 2 Table 2 indicates that the observed probability ( p ) value is 000 (2-tailed) which is less than the critical alpha value of 05; the obtained t value is -9.36, and the degrees of freedom (df) = 61 These results suggest that there is a significant difference between the two mean scores of Tests 1 and 2
To explore the relationship between the scores of Tests 1 and 2, a correlation analysis was conducted using the Pearson product-moment coefficient correlation (r) This coefficient ranges from -1.00 to 1.00, with values categorized as follows: r = 00 to 59 indicates a low correlation, r = 60 to 79 signifies a moderate correlation, and r = 80 to 1.00 reflects a strong correlation between the two sets of scores, as outlined by Brown (2005, p 159).
The strength of association between the two tests was high (r = 0.855) (see Table 2), and the correlation coefficient was significantly different (p < 000), with 72% (0.855 2 ) of shared variance (c.f Pallant, 2007; Brown, 2005; Campbell & Machin, 1999; Hatch &
A correlation analysis was conducted to examine the relationships among the various components of Tests 1 and 2, which assessed different skills, sub-skills, and grammatical knowledge This analysis included exploring the connections both between the two tests and within the individual components of each test.
Table 3 outlines the components utilized for the correlation analysis, with variables labeled as 1 corresponding to Test 1 and those marked as 2 corresponding to Test 2, totaling 23 components.
Table 3: Components of Tests 1 and 2
Table 3 presents a correlation matrix featuring the Pearson product-moment correlation coefficients (r) for each dataset pair This matrix is symmetric, indicating that the correlation between any two components remains consistent regardless of the variable orientation For example, the correlation between VOC 1 and M.IDEA 2 is identical to that of M.IDEA 2 and VOC 1 Therefore, to interpret the results effectively, focus was directed toward the correlations on one side of the diagonal, as the information on the opposite side is redundant The values within the correlation matrix range from -1 to 1.
1] interval As the correlation of a variable with itself is 1, the diagonal elements of the matrix are all equal to 1
Table 4: Correlation Matrix of the 23 Components between and within Tests 1 and 2 (Nb)
V O C 1 (1 ) M I D E A 1 (2 ) SP I N F O 1 (3 ) R E F E R 1 ( 4) IN F E R 1 ( 5) V E R B – F O R M 1 ( 6) P A R M V 1 (7 ) C O N J 1 (8 ) P R A D G 1 (9 ) W F O R M 1 ( 10 ) R ep or te d Sp ee ch 1 ( 11 ) Q U E ST 1 ( 12 ) V O C 2 ( 13 ) M I D E A 2 ( 14 ) SP I N F O 2 ( 15 ) R E F E R 2 ( 16 ) V E R B -F O R M 2 ( 17 ) P A R M V 2 ( 18 ) C O N J 2 (1 9) P R A D G 2 ( 20 ) W F O R M 2 ( 21 ) R ep or te d S pe ec h 2 (2 2) Q U E ST 2 ( 23 )
Correlation between the Components of Tests 1 and 2
VOC 2 M.IDEA 2 SP INFO 2 REFER 2 GR.VERB 2 PR.ART
** Correlation is significant at the 0.01 level (2-tailed).
* Correlation is significant at the 0.05 level (2-tailed).
To facilitate the observation of relationships between variables, r values are color-coded: yellow indicates a weak correlation (r = 10 to 29), green signifies a moderate correlation (r = 30 to 49), and red represents a strong correlation (r = 50 to 1.0), following the guidelines established by Cohen (1988, in Pallant, 2007).
132) The few blue cells indicate a negative correlation (Table 4.1 is provided to illustrate the relationships between the components of Test 1 and Test 2 only)
Table 4 reveals that the relationship between scores on components of Tests 1 and 2 is generally weak to moderate The weakest correlation was found between Variable 2 (Main Idea 1) and Variable 18 (Preposition, Article, Modal Verb 2), with a correlation coefficient of r = 021 and a p-value of 868 Conversely, the strongest correlation was identified between Variable 10 (Word Formation 1) and Variable 19.
There was also a strong positive relationship between Word Formation 1 and the following components of Test 2: Vocabulary (r = 511), Pronoun, Adjective, Gerund (r = 595),
Word-formation (r = 616), Reported Speech (r = 551), Questions (r = 595) In all the mentioned correlations, the p value is 000, which suggests a statistically significant relationship between the pairs of variables
Variable 9 (Pron., Adj., Gerund 1) also had a high degree of correlation with Variables 19,
20, 22, 23 and 4: namely, Conjunction 2 (r = 569), Pronoun, Adjective, Gerund 2 (r = 533),
Reported Speech 2 (r = 653), Questions 2 (r = 535), as well as Reference 2 (r = 569) of Test 2.
With these pairs too, the p observed is 000: i.e., there is a statistically significant relationship between the pairs
The other relatively strong correlations obtained between the components of Tests 1 and 2 are as follows: Variable 8 (Conjunction 1) was strongly related to Conjunction 2 (r = 524),
Reported Speech 2 (.546), and Questions 2 (r = 643); Variable 1 (Vocabulary 1) had a high degree of correlation with Variables 19 (Conjunction 2, r =.536, p = 000), 20 (Pron., Adj.,
Gerund 2, r = 515, p = 000) and 21 (Word-formation 2 r = 549, p = 000); a strong relationship was also observed between Variable 5 (Inference 1) and Variables 19 (Conjunction 2, r = 549, p
The study reveals that variables with moderate to strong correlations consistently show a p-value below the critical threshold of 05, indicating their reliability Conversely, many low correlations did not achieve statistical significance at the traditional p