Dr. V.K.Maheshwari, M.A(Socio, Phil) B.Se. M. Ed, Ph.D
Former Principal, K.L.D.A.V.(P.G) College, Roorkee, India
Achievement is the accomplishment or proficiency of performance in a given skill or body of knowledge. Therefore, it can be said that achievement implies the overall mastery of a pupil on a particular context. Any measuring instrument that measures the attainments or accomplishments of a pupil’s achievement must be valid and reliable.
Testing is a systematic procedure for comparing the behavior of two or more persons. This way an achievement test is an examination to reveal the relative standing of an individual in the group with respect to achievement.
As achievement is the competence of a person in relation to a domain of knowledge An Achievement Test is a test of knowledge or proficiency based on something learned or taught. The purpose of an achievement test is to determine student’s knowledge in a particular subject area.
Characteristics of Good Measurement Instruments:
Measurement tools can be judged on a variety of merits. These include practical issues as well as technical ones. All instruments have strengths and weaknesses no instrument is perfect for every task. Some of the practical issue that need to be considered includes:
Criteria of a good measuring instrument
Practical Criteria Technical Criteria
* Ease in administration * Reliability
* Cost * Validity
* Time and effort required for respondent to complete measure
Ease in administration:
A test is good only when the conditions of answering are simple (scientific and logical). Its instruction should be simple and clear.
A good test should be in expensive, not only from the view point of money but also from the view point of time and effort taken in the construction of a test. Fortunately there is no direct relationship between cost and quality.
Time and effort required for respondent to complete measure:
Generally the time given to students is always in short supply however the students too do not accept very long tests. Therefore a test should neither be very long nor very short.
A good test should be acceptable to student to whom its being given without regard to any specific situation that is the question given in the test should be neither very difficult nor very easy.
Along with the practical issues, measurement tools may be judged on the following:
Consistency (Reliability): -
Reliability of a test refers to its consistency or stability. A test good reliability means that the test taker will obtain the same test score over repeated testing as long as no other extraneous factors have affected the score. Reliability is the extent to which the measurements resulting from a test are the result of characteristics of those being measured. For example, reliability has elsewhere been defined as “the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and repeatable for an individual test taker” (Berkowitz, Wolkowitz, Fitch, and Kopriva, 2000).
Technically, the theoretical definition of reliability is the proportion of score variance that is caused by systematic variation in the population of test-takers. This definition is population-specific. If there is greater systematic variation in one population than another, such as in all public school students compared with only eighth-graders, the test will have greater reliability for the more varied population. This is a consequence of how reliability is defined. Reliability is a joint characteristic of a test and examinee group, not just a characteristic of a test. Indeed, reliability of any one test varies from group to group
Reliability is the quality of a test which produces scores that are not affected much by chance. Students sometimes randomly miss a question they really knew the answer to or sometimes get an answer correct just by guessing; teachers can sometimes make an error or score inconsistently with subjectively scored tests.
Reliability of a measuring instruments depends on two factors-
1. Adequacy in sampling
2. Objectivity in scoring
A good instrument will produce consistent scores. An instrument’s reliability is estimated using a correlation coefficient of one type or another. For purposes of learning research, the major characteristics of good scales include:
• Test-retest Reliability:
The test-retest reliability method is one of the simplest ways of testing the stability and reliability of an instrument over time. In test-retest reliability the same test is administer to the same sample on two different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions.. The ability of an instrument to give accurate scores from one time to another. Also known as temporal consistency.
A test-retest reliability coefficient is obtained by administering the same test twice and correlating the scores. In concept, it is an excellent measure of score consistency because it allows the direct measurement of consistency from administration to administration. This coefficient is not recommended in practice, however, because of its problems and limitations. It requires two administrations of the same test with the same group of individuals. The amount of time allowed between measures is critical. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation If the time interval is short, people may be overly consistent because they remember some of the questions and their responses. If the interval is long, then the results are confounded with learning and maturation, that is, changes in the persons themselves
Most standardized tests provide equivalent forms that can be used interchangeably. For this purpose first have two parallel forms are created . One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability. These alternate forms are typically matched in terms of content and difficulty. The correlation of scores on pairs of alternate forms for the same examinees provides another measure of consistency or reliability. Even with the best test and item specifications, each test would contain slightly different content and, as with test-retest reliability, maturation and learning may confound the results..
• Split-half Reliability:
The consistency of items within a test. There are two types of item coherence: which assesses the consistency of items in one-half of a scale to the other half. In split-half reliability. As the name suggests, split-half reliability is a coefficient obtained by dividing a test into halves we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. by correlating the scores on each half, and then correcting for length The split can be based on odd versus even numbered items, randomly selecting items, or manually balancing content and difficulty. This approach has an advantage in that it only requires a single test administration. Its weakness is that the resultant coefficient will vary as a function of how the test was split. It is also not appropriate on tests in which speed is a factor
• Internal consistency reliability:
It estimates the consistency among all items in the instrument. Internal consistency. Internal consistency focuses on the degree to which the individual items are correlated with each other and is thus often called homogeneity. Several statistics fall within this category. The best known are Cronbach’s alpha, the Kuder-Richardson Formula 20 (KR-20) and the Kuder-Richardson Formula 21 (KR-21). The Kuder-Richardson Formula 20 (KR-20) first published in 1937 is a measure of internal consistency reliability for measures with dichotomous choices. It is analogous to Cronbach’s α, except Cronbach’s α is also used for non-dichotomous (continuous) measures. A high KR-20 coefficient (e.g., >0.90) indicates a homogeneous test.
• Inter-rater reliability:
inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. This type of reliability is assessed by having two or more independent judges score the test. The scores are then compared to determine the consistency of the raters estimates. One way to test inter-rater reliability is to have each rater assign each test item a score. For example, each rater might score items on a scale from 1 to 10. Next, you would calculate the correlation between the two rating to determine the level of inter-rater reliability. Another means of testing inter-rater reliability is to have raters determine which category each observations falls into and then calculate the percentage of agreement between the raters. So, if the raters agree 8 out of 10 times, the test has an 80% inter-rater reliability rate.
The degree to which different observers or raters give consistent scores using the same instrument, rating scale, or rubric. Also called Scoring agreement.
Suggestions for improving the reliability
The best suggestions for improving the reliability of classroom tests are:
- Start planning the test and writing the items well ahead of the time the test is to be given. A test written hurriedly at the last minute is not likely to be a reliable test
- · Write clear directions and use standard administrative procedures.
- Pay more attention to the careful construction of the test questions. Phrase each question clearly so that students know exactly what you want. Try to write items that discriminate among good and poor students and are of an appropriate difficulty level.
- Write longer tests. The number of items are needed in order to provide reliable measurement. depends on the quality of the items, the difficulty of the items, the range of the scores, and other factors. So include as many questions as you think the students can complete in the testing time available.
Validity is the quality of a test which measures what it is supposed to measure. It is the degree to which evidence, common sense, or theory supports any interpretations or conclusions about a student based on his/her test performance. More simply, it is how one knows that a math test measures students’ math ability, not their reading ability.
Validity like reliability also depends upon certain factors, they are -
1. Adequacy in sampling
2. Objectivity in scoring
Thus, a valid measurement tool does a good job of measuring the concept that it purports to measure. It is important to remember that the validity of an instrument only applies to a specific purpose with a specific group of people.
A test is valid when it
- produces consistent scores over time.
- correlates well with a parallel form.
- measures what it purports to measure.
- can be objectively scored.
- has representative norms.
Forms of Validity
• Construct validity:
Construct validity refers to the extent to which a test captures a specific theoretical construct or trait and it overlaps with. Construct validity establishes that the instrument is truly measuring the desired construct. This is the most important form of validity, because it really subsumes all of the other forms of validity.
To asses the test’s internal consistency. That is, if a test has construct validity, scores on the individual test items should correlate highly with the total test score. This is evidence that the test is measuring a single construct
also developmental changes. tests measuring certain constructs can be shown to have construct validity if the scores on the tests show predictable developmental changes over time.
and experimental intervention, that is if a test has construct validity, scores should change following an experimental manipulation, in the direction predicted by the theory underlying the construct.
• Convergent validity:
We can create 2 different methods to measure the same variable and when they correlate we have demonstrated convergent validity. A type of validity that is determined by hypothesizing and examining the overlap between two or more tests that presumably measure the same construct. In other words, convergent validity is used to evaluate the degree to which two or more measures that theoretically should be related to each other are, in fact, observed to be related to each other.
Comparison and correlation of scores on an instrument with other variables or scores that should theoretically be similar. A test has convergent validity if it has a high correlation with another test that measures the same construct
a test’s divergent validity is demonstrated through a low correlation with a test that measures a different construct. When we create 2 different unrelated methods to measure the same variable and when they do not correlate. We have demonstrated divergent validity.
The goal of divergent validity is that to demonstrate we are measuring one specific construct and not combining two different constructs.
• Discriminant validity:
Comparison of scores on an instrument with other variables or scores from which it should theoretically differ. Measures that should not be related are not. Discriminant validity examines the extent to which a measure correlates with measures of attributes that are different from the attribute the measure is intended to assess.
• Factor structure:
A statistical at the internal consistency of an instrument, usually one that has subscales or multiple parts. The items that are theoretically supposed to be measuring one concept should correlate highly with each other, but have low correlations with items measuring a theoretically different concept.
• Content validity:
Content validity of a test refers to the adequacy of sampling of the content across construct or trait being measured. Given the published literature or particular trait, are all aspects of that concept represented by items on the test. It establishes that the instrument includes items that comprise the relevant content domain. A test has content validity if it measures knowledge of the content domain of which it was designed to measure knowledge. Another way of saying this is that content validity concerns, primarily, the adequacy with which the test items adequately and representatively sample the content area to be measured. (For example, . a math achievement test would lack content validity if good scores depended primarily on knowledge of English, or if it only had questions about one aspect of math only or a test of English grammar should include questions on subject-verb agreement, but should not include items that test algebra skills)
• Face validity:
A subjective judgment about whether or not on the “face of it” the tool seems to be measuring what you want it to measure. : or when a test appears valid to examinees who take it, personnel who administer it and other untrained observers. It is perhaps the simplest type of validity. Face validity can refer to a single item or to all of the items on a test and it indicates how well the item reveals the purpose or meaning of the test item or the test itself. Face validity is not a technical sense of test validity; just because a test has face validity does not mean it will be valid in the technical sense of the word.
• Criterion-related validity also called Concurrent validity or Predictive validity
Refers to the comparison of scores on a test with some other external measure of performance .The other measure should be theoretically related to the first measure and their relationship can be assessed by a simple correlation coefficient .The instrument “behaves” the way it should given your theory about the construct This validity is a concern for tests that are designed to predict someone’s status on an external criterion measure. A test has criterion-related validity if it is useful for predicting a person’s behavior in a specified situation.
• Concurrent validity:
Comparison of scores on some instrument with current scores on another instrument. If the two instruments are theoretically related in some manner, the scores should reflect the theorized relationship. In concurrent validation, the predictor and criterion data are collected at or about the same time. This kind of validation is appropriate for tests designed to asses a person’s current criterion status.
In concurrent validity a proposed test is given to a group of participants who complete other theoretically related measures concurrently ( at the same point in time ).
• Predictive validity:
Comparison of scores on some instrument with some future behavior or future scores on another instrument. The instrument scores should do a reasonable job of predicting the future performance. In Predictive validation, the predictor scores are collected first and criterion data are collected at some later/future point. this is appropriate for tests designed to asses a person’s future status on a criterion
With predictive validity the new test is given to a group of partecipants who are followed overtime to see how well the original assessment predicts some important variable at a later point in time
Relationship between reliability and validity
· If a test is unreliable, it cannot be valid.
- For a test to be valid, it must reliable.
- However, just because a test is reliable does not mean it will be valid.
- Reliability is a necessary but not sufficient condition for validity!
Construction procedure of an Achievement Test:
If a test has to be really made valid, reliable and practical, then it will have to be suitably planned. For it, qualitative improvement in the test will have to be effected. For this, the following facts should be kept in view:
* The principles available tests will have to be kept in view so that a test can be
* Kill will have to be acquired in constructing and writing different types of questions. For it are required thoughtful thinking, determination of teaching objectives, analysis of content and types of questions to be given.
Ebel, in his book Measuring Educational Achievement, has suggested the following precautions in test construction:
- It should be decided when the test has to be conducted in the context of time and frequency.
- It should be determined how many questions have to be included in the test.
- It should be determined what types of questions have to be used in the test.
- Those topics should be determined from which questions have to be constructed. This decision is taken keeping in view the teaching objectives.
- The level of difficulty of questions should be decided at the beginning of the test.
- It should be determined if any correction has to be carried out for guessing.
- The format and type of printing should be decided in advance.
- It should be determined what should be the passing score.
- In order to control the personal bias of the examiner there should be a provision for central evaluation. A particular question should be checked by the same examiner.
- A rule book should be prepared before the evaluation of the scripts.
To construct an achievement test the steps referred below if followed will make the test objective, reliable and valid -
Selection of Teaching Objectives for Measurement: At first those teaching objectives should be selected from all teaching objectives of subject teaching which have to be made the basis for test construction. There can be several causes of selecting these teaching objectives which have to determine related teaching, such as how much content has been studied, what is the need of student’ what is the importance of specific topics in the content etc. For it, the following table can be used:
|Teaching Objectives||Selected Teaching Objectives||Reason for Selections|
|1. All objectives of the cognitive domain (knowledge,
comprehension, application, analysis, synthesis,
Assigning Weightage to Selected Objectives: After these objectives have been selected, a teacher assigns Weightage to these objectives keeping the tasks done and importance of objectives. It is desirable to use the following table.
|S. No.||Selected Teaching Objectives||Score||Percentage|
Weightage to Content: Content is used as the means of realizing objectives and questions have to be constructed on its basis. Therefore, it becomes necessary to give Weightage to it. There is distinction in the nature, importance and scope of each topic. Therefore, the Weightage should be given to these facts in view; else the test would not represent the whole subject.
|S. No.||Topics||Number of Items||Score||Percentage|
Giving Weightage to the Type of Items
In this step, a teacher determines the number of items, their types, their relative marks. For it, it would be convenient to use the following table:
|S. No.||Type of Items||Number of Items||Score||Percentage|
|Long answer type
Short answer type
Determining Alternatives: At this level, it is determined how many alternatives or options should be given according to the type of questions. Giving alternatives influences the reliability and validity of a test; therefore, it is suggested that alternatives should not be given in objective type questions, while in essay type questions only internal choice can be given.
Division of Sections: If the scope or types of questions is uniform, them it is not necessary to divide the test into sections. However, if it is diverse and different types of questions have been specified and the nature of the test seems to be heterogeneous, then a separate section should be made comprising each type of item.
|S. No.||Sections||Type of items||Score||Percentage|
Long answer type
Short answer type
Estimation of Time: At this step estimation of the total time the whole test is likely to calculate. Time is estimated on the basis of type and number of items. Some time should be reserved for distribution and collection of answer sheets. The following table can be used for convenience.
|S. No.||Type of Items||Number of Items||Time (in minutes)|
Long answer type
Short answer type
Preparation of Blueprint: A blueprint provides a bird’s eye view of the entire test. In it we can see the topics, teaching objectives, and types of questions, number of items and distribution of scores and their mutual relationships. A blueprint is the basis for test construction. A format is given below-
|Types of Question
L- Long Answers Type S- Short Answers Type O-Objective Answers Type
Preparation of score key:
A score key increases the reliability of a test So that the test constructer should provide the procedure for scoring the answer script. Directions must be given whether the scoring will be made by a scoring key (When the answer is recorded on the test paper) or by scoring stencil (when the answer is recorded on separate answer sheet) and how marks will be awarded to the test items.
In case of essay type items it should be indicated whether to score with ‘point method’ or with the ‘rating method’. In the point method each answer is compared with a set of ideal answers in the scoring key. Then a given number of points are assigned. In the rating method the answers are rated on the bases of degrees of quality and determines the credit assigned to each answer.
When the students do not have sufficient time to answer the test or the students are not ready to take the test at that particular time. They guess the correct answer. In that case to eliminate the effect of gusseting some measures must be employed..But there is lack of agreement among psycho-matricians about the value of correction formula so far as validity and reliability are concerned In the words of Ebel ;neither the instruction nor penalties will remedy the problem of guessing. Keeping in view the above opinioned and to avoid this situation the test constructor should give enough time for answering the test idea.
Thus in order to bring objectivity in a test , it is essential that a tester should be fully clear about the type of answer expected from a question. For this, if they are acquainted with the right answers. Then diversity in scoring can be eradicated.
1.For Objective Type:
|S. No.||Item Serial||Answer||Score|
2.For Long Answer and Short Answer Type:
|S. No.||Item Serial||Outline of Answer||Score||Remarks|
First try-out of the test:
At this stage the initial format of the test is administered on a small representative sample. After that the process of item analysis be used to calculate difficulty level and discriminative value. There are a variety of techniques for performing an item analysis, which is often used, for example, to determine which items will be kept for the final version of a test. Item analysis is used to help “build” reliability and validity are “into” the test from the start. Item analysis can be both qualitative and quantitative. The former focuses on issues related to the content of the test, eg. content validity. The latter primarily includes measurement of item difficulty and item discrimination.
An item’s difficulty level is usually measured in terms of the percentage of examinees who answer the item correctly. This percentage is referred to as the item difficulty index, or “p”.
Item discrimination refers to the degree to which items differentiate among examinees in terms of the characteristic being measured (e.g., between high and low scorers). This can be measured in many ways. One method is to correlate item responses with the total test score; items with the highest test correlation with the total score are retained for the final version of the test. This would be appropriate when a test measures only one attribute and internal consistency is important.
This initial format is administered on small representative sample group. After that the process of item analysis is used in order to calculate the difficulty level, discriminative value and alternative (in multiple choice items).
First of all in the context of multiple choice items the appropriate choice is also selected and that alternative is rejected which has been opted by the least number of students.
Generally, a test is constructed for average students, Naturally the division according to ability grouping is an essentiality. Generally the ability distribution used in normal probability curve provided the basis for the distribution.
On the basis of N.P.C., it is advisable that a test must not be constructed for extreme cases, such as backward or gifted students. Therefore the items which have been solved by top gifted students and the items solved by the below most dullard students must be eliminated from the test as they must be treated as to difficult or to easy test items.
In the context of difficulty level, the following difficulty levels are suggested for the selection of questions as per Katz (1959) also recommendation-
|S.N0.||Types of items||Difficulty Level %|
|Long answer type
In the same way which may be measuring the same content area twice as (e.g. who was the founder of Mughal empire in India? And Which empire was formed by Baber in India?) both questions refer to one content area that Baber established the mughal empire in india. Out of these two questions one question be treated as bogus and must be excluded from the test.
In the same way the discriminating power of the items be calculated and the questions with least discriminating power must be excluded from the test. Generally the items having 25 hundred discriminating value are considered suitable from a test.
Preparation of final test:
The test will provide useful information about the students’ knowledge of the learning objectives. Considering the questions relating to the various learning objectives as separate subtests, the evaluator can develop a profile of each student’s knowledge of or skill in the objectives.
The final test is constructed after the above analysis for this a suitable format is prepared and norms are specified. Also, instructions for examinees be prepared.
The test constructed in accordance to the above referred procedure will definitely assumes a purpose or an idea of what is good or desirable from the stand point of individual or society or both.
- Bean, K.L.: Construction of Education & Personal Tests, McGraw-Hill Book Co., New York, 1953.
- Chassell, J.M.: Test for Originality, Journal of Educational Psychology, 1916.
- Hawkes, H.F.: The Construction & Use of Achievement Examinations, Houghton Mifflin, Boston, 1936.
- Kelley, T.L.: Interpretation of Educational Measurement, World Book Co., Yonkers, 1939.
- Micheels, W.J.: Measuring Educational Achievement, McGraw-Hill Book Co., New York, 1950.
- Walker, H.M.: Elementary Statistical Methods, Henry Holt & Co., New York, 1943.
- Whitney, F.I.: Elements of Research, Prentice-Hall, New York, 1950.