The Concept of Measurement in Education

Dr. V.K.Maheshwari, M.A(Socio, Phil) B.Sc. M. Ed, Ph.D

Former Principal, K.L.D.A.V.(P.G) College, Roorkee, India

Test is used to gather information. That information is presented in the form of measurement. That measurement is then used to make evaluation. It is the process of obtaining a numerical description of the degree to which an individual possesses a particular characteristic. Measurement answers the question, “How much?”.

The English word measurement originates from the Latin mēnsūra and the “The action of measuring something: “accurate measurement is essential”.

The size, length, or amount of something, as established by measuring verb metiri through the Middle French mesure.

This  is  a  broad  term  that  refers  to  the  systematic  determination  of outcomes or characteristics by means of some sort of assessment device. It is a systematic process of obtaining the quantified degree to which a trait or an attribute is present in an individual or object. In other words it is a systematic assignment of numerical values or figures to a trait or an attribute in a person or object.

Brief Historical Retrospect of Testing and Measurement

Testing  is essential as its feedback helps in increasing the learning and `performance of children. This is why, the history of testing started very early, it has grown from the test of individual differences to almost all aspects of education .

There is no aspect of life that can be mentioned where there is no form of measurement . This is because test form the best means of detecting characteristics in a reasonable objective manner. They help to gain the kinds of information about learners and learning that is required to help students learning.

The history of measurement can be traced with the invention of tests to measure individual differences in skills . In January 1796, the astronomer royal of Greenwich observatory in England – Maskelyne, was recorded to have dismissed his assistant, Kinnebrook, for recording the movement of stars across the telescope with eight-tenths of a second later than his. According to Tuckman (1975) between 1820 and 1823, a German astronomer –Bessel improved on the work of Maskelyne by demonstrating the variability in personal equations and observations. He argued that fluctuations existed from situation to situation  and from individual to individual, as there is a variation in the simple reaction time or a measure of the time required to react to a simple stimulus.

In 1863, Sir Francis Galton worked onthe testing of individual differences. His work was regarded as the beginning of mental tests.  In 1884, Galton opened an anthropometric laboratory to collect the characteristic measurements of individuals . Mckeen Cathel, an American Psychologist was also studying individual differences in primary physical terms. These were the earliest recorded history of testing. But  the early measurement approaches in history both written and oral, were informal. The first written tests were the informal examinations used by the Chinese to recruit people into the civil service. This was about 2200BC.

The oral examinations conducted by Socrates in the 5th century B.C was also  informal. In America, before 1815, educational achievement tests were used for assessment through oral examinations. Galton, James Cattel plays significant  roles in the development  of test . There are others, like Karl Pearson,who developed the Pearson product-moment correlation coefficient which is useful in checking the reliability and validity of standardized tests.

By 1904, Alfred Binet studied the differences between bright and dull children. In 1904, he developed a test for measuring intelligence of children . This test is called Binet – Simons intelligence test. In 1916, Louis Terman and his associates at Stanford University revised the Binet-Simon scale and brought out the Stamford-Binet version. Group-tests development started during the World War I when the need to measure the intelligence of soldiers. As a result group of psychologists including Yerkes, R.M and Otis, A. developed the Army Alpha, which is a written group intelligence test, and Army Beta, which is the individual nonverbal intelligence test.  David Wechsler also developed series of individual intelligence scales from 1939 to 1967.

George Fisher, developed the first standardized objectives test of achievement in 1864 and J.M. Rice,  developed the standard spelling objective scale in 1897.The above references are not the total contributers in the field of testing,they are actually the pioneers.

Within the last few decades, educational evaluation has grown into a separate, independent discipline, though with some leanings on the ideas of psychologists, psycho-metricians and statisticians. In recent  years, its development into a complex art and technology had taken place. Efforts of educational evaluators have been directed specifically towards using precision, objectivity and mathematical vigour of psychological measurement in ways directly related to educational institutions, educational processes and purposes.

Definition of Measurement

Measurement may be defined as follows:

Measurements act as labels which make those values more useful in terms of details Values made meaningful by quantifying into specific units.

Measurement is an act or a process that involves the assignment of a numerical index to whatever is being assessed.

Measurement is collection of quantitative data. A measurement is made by comparing a quantity with a standard unit.

In education,  the  numerical  value  of  scholastics  ability,  aptitude, achievement etc can be measured and obtained using instruments such as paper and pencil test. It means that the values of the attribute are translated into numbers by measurement.

Measurement, beyond its general definition, refers to the set of procedures and the principles for how to use the procedures in educational tests and assessments.  Some of the basic principles of measurement in educational evaluations would be raw scores, percentile ranks, derived scores, standard scores, etc.

The process of obtaining a numerical description of the degree to which an individual possesses a particular characteristic.

  • Test is used to gather information.
  • That information is presented in the form of measurement.
  • That measurement is then used to make evaluation.

As a result of a test, a measure is obtained. An observation, a rating scale or any other device that allows us to obtain information in a quantitative form is a measurement.

Types of Measurement:

Generally, there are three types of measurement:

(i) Direct; (ii) Indirect; and Relative.

Direct;  To find the length and breadth of a table involves direct measurement and this is always accurate if the tool is valid.

Indirect; To know the quantity of heat contained by a substance involves indirect measurement for we have to first find out the temperature of the substance with the help of a thermometer and then we can calculate the heat contained by the substance.

Relative ; To measure the intelligence of a boy involves relative measurement, for the score obtained by the boy in an intelligence test is compared with norms. It is obvious that psychological and educational measurements are relative.

Levels and  Classification of Educational Measures

A students’ achievement may  be viewed at three different levels:

1.  Self-referenced how the student is progressing with reference to himself/herself.

2.  Criterion-referenced how the student is progressing with reference to the criteria set by the teacher. Criterion-referenced – individual scores are interpreted in terms of the student’s performance relative to some standard or criterion

3.  Norm-referenced how the student is progressing with   reference to his/her peer group. Norm-referenced – individual scores are interpreted relative to the scores of others in a well defined Norming group.

Classes of Educational Measures

There are three  classes of educational measures

1.            Cognitive or Non-cognitive

a.            Cognitive measures focus on what a person knows or is able to do mentally.

b.            Non-cognitive measures focus on affective traits or characteristics (e.g, personality traits, attitudes, values, interests, preferences, etc.)

2. Locally Developed Measures

c.            Commercially prepared measures are developed for widespread use with a focus on technical merit.

d.            Locally prepared measures are developed by a researcher for specific situations with some, but not extensive, concern for technical characteristics.

3.            Self-report or Observations by others

a.            Self-report measures require the subjects to supply the response (e.g., tests, questionnaires, interviews, etc.)

b.            Observations by others require subjects to be observed by others who record the data (e.g., observations, unobtrusive measures, etc.)

Classification of Educational Measures

There are three main classes of measurement

1-Cognitive or non-cognitive.

a-Cognitive measures focus on what a person knows or is able to do mentally.

b-Non-cognitive measures focus on affective traits or characteristics (e.g, personality traits, attitudes, values, interests, preferences, etc.)

2-Commercially prepared or locally developed.

a-Commercially prepared measures are developed for widespread use with a focus on technical merit.

b-Locally prepared measures are developed by a researcher for specific situations with some, but not extensive, concern for technical characteristics

3-Self-report or observations by others.

a-Self-report measures require the subjects to supply the response (e.g., tests, questionnaires, interviews, etc.)

b-Observations by others require subjects to be observed by others who record the data (e.g., observations, unobtrusive measures, etc.)

Types of Educational Measures used in Quantitative Research

There are four types of educational measures used in quantitative research.


A test is an instrument that requires subjects to complete a cognitive task by responding to a standard set of questions.

Score Interpretation

Norm-referenced - Individual scores are interpreted relative to the scores of others in a well defined norming group (e.g., John’s scores places him in the 95th percentile; Sally’s score is in the bottom quartile).

Standard scores are transformations of raw scores into easily interpreted standard metrics.

Z-score – the difference between a raw score and the mean in standard deviation units (i.e., z = (raw score – mean) / standard deviation).

Z-scores are algebraically transformed to standard scales such as percentiles, grade equivalents, SAT, ACT, GRE, etc.

All standard scores are interpreted relative to the scores of others in the norming group.

SAT score of 700 is very, very good relative to the scores of the norm group because it is two (2) standard deviations above the mean (i.e., in the 99th percentile).

Grade equivalent score of 3.0 is poor given that he is in the 6th grade and has scored at a level equal to that of third graders taking the test.

Criterion-referenced - individual scores are interpreted in terms of the student’s performance relative to some standard or criterion (e.g., Jeanne passed the Louisiana High School Graduate Exit Exam; Sammy did not make the cut off for being promoted to the 7th grade).


a-Standardized tests have uniform procedures for administration, scoring, and interpreting test scores

Types of standardized tests

1-Achievement – tests of content knowledge or skills

2-Aptitude - tests which are used to predict future cognitive performance

3-Standards-based - criterion-referenced tests based on established standards


Standardized tests V/S  Informal Teacher-made tests.

Standardized tests assess broad, general content while teacher-made tests tend to focus on specific objectives related to the instruction in a class

Standardized tests are more technically sound than teacher-made tests

Standardized tests are administered in “standardized” manners while teacher-made tests tend to be administered informally

Standardized tests are scored in consistent, reliable manners and produce sets of standard scores; teacher-made tests are scored in less reliable manners and generally are scored as the percentage of correct responses


A questionnaire is an instrument containing statements designed to obtain a subject’s perceptions, attitudes, beliefs, values, opinions, or other non-cognitive traits

Personality inventories

Personality inventories are concerned with,Psychological orientation (i.e., general psychological adjustment) and Educational orientation (i.e., traits such as self-concept or self-esteem that are related to learning and motivation)

Attitudes, values, or interests

Attitudes, values, or interests  are  affective traits that indicate some degree of preference toward something.


Scales are  continuum that describes subject’s responses to a statement.

Likert Scales

Response options require the subject to determine the extent to which they agree with a statement

An odd number of options provides for a middle or neutral response (e.g., strongly agree, agree, neutral, disagree, or strongly disagree)

An even number of options eliminates a response of neutral (e.g., strongly agree, agree, disagree, or strongly disagree)

Statements must reflect extreme positive or extreme negative positions like”I hate my teacher.The textbook has been a valuable resource.”A subject’s response positions them on a continuum. Strongly agreeing with the statement “I hate my teacher” indicates a very negative attitude. Strongly agreeing with the statement “The textbook has been a valuable resource” indicates a very positive attitude.

Semantic Differential

Response options reflect a continuum of bipolar adjectives related to some aspect of the trait being measured

Fair: __ __ __ __ :Unfair

Interesting: __ __ __ __ :Boring

Aspects of the traits being measured are usually stated in a few words (e.g, My teacher is … ; the textbook is … )

A subject’s response positions them on a continuum.Responses of “fair” and “interesting” to the statement “My teacher is ….” indicate a positive attitude.Responses of “unfair” and “boring” to the statement “My teacher is ….” indicate a negative attitude.


Checklists – responses require subjects to identify specific options from which they choose those options that appeal to them.

Ranked items

Ranked items – responses require students to place a limited number of items into sequential order.

Problems with Measuring Non-cognitive Traits

Difficulty in  clearly defining what is being measured (e.g., self-concept or self-esteem).

Response set – a tendency to respond the same way to all items (e.g., strongly agreeing with each statement).

Social desirability – a tendency to respond to items in a way that is socially desired or accepted.

Faking – a tendency to respond inaccurately (e.g., agreeing with statements because of the negative consequences associated with disagreeing).

Controlling problems

Equal numbers of positively and negatively worded statements. Alternating positive and negative statements and/or bipolar adjectives. Providing confidentiality or anonymity to respondents


Direct observation of behaviours in natural or controlled settings,structured or unstructured observations and detached or involved observers

Inference in Observation

Low inference – involves little if any inference on the observers part. Children are in their seats. Teacher uses math manipulative.

High inference – involves high levels of inference on the observers part. Children are happy. Teacher lectures effectively.

Laboratory Observation- Specified environment. Use of structured forms and procedures. Concern with demand characteristics.

Structured Field Observation is carried out in natural setting. Use of structured forms and procedures are generally in the form of frequency counts ,duration ,interval ,continuous and in  time sampling.

Advantages of Observations

  • Yields firsthand data without the contamination that can arise from tests, inventories, or other self-report instruments
  • Allows for the description of behavior as it occurs naturally
  • Allow for the consideration of contextual factors that can influence the interpretation and use of the results


Interviews involve orally questioning of subjects and recording their responses.In interview the types of questions used are of ,structured ,semi-structured,unstructured and leading type

Sources of Concern

Sources of concern in Interview are, ,bias, contamination, interviewer characteristics (e.g., age, race, gender, etc.), conduct of the interview and response recordings


ü  Establish rapport

ü  Enhance motivation

ü  Clarify responses through additional questioning

ü  Capture the depth and richness of responses

ü  Allow for flexibility

ü  Reduce “no response” and/or “neutral” responses


v  Time consuming

v  Expensive

v  Small samples

v  Subjective

Criteria for Evaluating instruments

v Validity evidence /Reliability evidence

v  Descriptions of the instruments

v  Administration procedures

v  Norming information for norm-referenced tests (NRTs)

v  Standards for criterion-referenced tests (CRTs)

v  Meaningful scores and score interpretations

v  Avoidance of response problems in non-cognitive measures

v  Training observers and interviewers

v  High standards for observers using high inference observations

v  Minimum interviewer effects

Scales of Measurement

A basic understanding of scales of measurement is essential in order to know something about presenting, interpreting and analysing data. . What a scale actually means  depends on what its numbers represent. Numbers can be grouped into 4 types or levels: nominal, ordinal, interval, and ratio.The scales are distinguished on the relationships assumed to exist between objects having different scale values The four scale types are ordered in that all later scales have all the properties of earlier scales plus additional properties. Nominal is the most simple, and ratio the most sophisticated…

Categorical or qualitative variables tend to be reported in nominal and ordinal scales and Quantitative variables are reported in interval or ratio scales.


Not really a ‘scale’ because it does not scale objects along any dimension, It simply labels objects. Categorical data are measured on nominal scales which merely assign labels to distinguish categories

Nominal is hardly measurement. It refers to quality more than quantity. A nominal level of measurement is simply a matter of distinguishing by name, e.g., 1 = male, 2 = female. Even though we are using the numbers 1 and 2, they do not denote quantity. The binary category of 0 and 1 used for computers is a nominal level of measurement. They are categories or classifications. Nominal measurement is like using categorical levels of variables,.

Nominal basically refers to categorically discrete data such as name of your school, type of car one drive or name of a book. This one is easy to remember because nominal sounds like name.

In nominal measurement the numerical values just “name” the attribute uniquely. A nominal scale tells you to which group a unit/individual belongs. A nominal scale can be used to categorise. For example, gender can be categorised as male or female, and religion can be categorised as Jewish, Muslim, Christian, Buddhist, and ‘other’. Sometimes a numerical code is assigned to nominal variables (e.g. 1 = female, 2 = male) but the code does not imply order.


Ordinal refers to order in measurement. In ordinal measurement the attributes can be rank-ordered. Here, distances between attributes do not have any meaning Ordinal refers to quantities that have a natural ordering. For example, we often using rating scales (Likert questions). This is also an easy one to remember, ordinal sounds like order.  An ordinal scale indicates direction, in addition to providing nominal information. Low/Medium/High; or Faster/Slower are examples of ordinal levels of measurement.” Many psychological scales or inventories are at the ordinal level of measurement.

An ordinal scale extends the information of a nominal scale to show order, i.e. that one unit has more of a certain characteristic than another unit. For example, an ordinal scale can be used

•             to rank job applicants from the best to the worst,

•             to categorise people according to their level of education, or

to measure people’s feelings about some matter using a measure like ‘strongly agree’, ‘agree’, ‘neutral’, ‘disagree’, ‘strongly disagree’


An interval scale is a scale on which equal intervals between objects, represent equal differences.

Interval scales provide information about order, and also possess equal intervals. Equal-interval scales of measurement can be devised for opinions and attitudes. Constructing them involves an understanding of mathematical and statistical principles. But it is important to understand the different levels of measurement when using and interpreting scales.

Interval data is like ordinal except we can say the intervals between each value are equally split. The most common example is temperature in degrees Fahrenheit. The difference between 29 and 30 degrees is the same magnitude as the difference between 78 and 79 .With attitudinal scales and the Likert questions,  are rarely interval, although many points on the scale likely are of equal intervals.

Interval scales are not simply ordinal. They give a deeper meaning to order. An interval scale is a scale of measurement in which the magnitude of difference between measurements of any two units is meaningful. If weights are measured in kilograms (kg), then the difference in weights between two people whose weights are respectively 82 kg and 69 kg is the same as that between people whose respective weights are 64 kg and 51 kg. That is, the ‘intervals’ are the same (13 kg) and have the same meaning. Further, someone who weighs 100 kilograms is twice as heavy as someone who weighs 50 kilograms. Consequently, most interval scales are also meaningful on a ratio scale.


A ratio scale is a special form of interval scale that has a true zero. For some interval scales, measurement ratios are not meaningful. For example, 40° C does not represent a temperature which has twice the heat of 20° C because the zero on the Celsius scale is arbitrary, and does not represent an absence of heat. However, when we consider the metric system for temperature (known as ‘degrees Kelvin’), then there is a true zero (called ‘absolute zero’). Therefore, a measure of 40K (i.e. 40 degrees Kelvin) is twice as hot as 20K.

Finally, in ratio measurement there is always an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable.

In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero (a point where none of the quality being measured exists) Ratio data is interval data with a natural zero point. Using a ratio scale permits comparisons such as being twice as high, or one-half as much. Reaction time (how long it takes to respond to a signal of some sort) uses a ratio scale of measurement — time. Although an individual’s reaction time is always greater than zero, we conceptualize a zero point in time, and can state that a response of 24 milliseconds is twice as fast as a response time of 48 milliseconds.

The Relationship between Numbers , Nominal, Ordinal, Interval and Ratio scales

It’s important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or ordinal).

Numbers can be used to represent measurements on any of the four scales mentioned in this section. However, the relative values of these numbers have a deeper meaning as the scale goes progressively through nominal, ordinal, interval and ratio scales. For example, suppose the numbers 1, 2, and 3 represent 3 measurements on any one of those scales. On a nominal scale, the numbers could have been replaced equally by the same numbers in a different order such as 3, 1, 2 or three arbitrarily chosen different numbers such as 6, 4, 8. On an ordinal scale, the order of the numbers 1, 2, 3 is important, but the order tells us nothing about the magnitude of difference between 1 and 2 and 2 and 3. However, on an interval scale, the difference between 1 and 2 is the same as that between 2 and 3 and half of that between 1 and 3.


The level of measurement for a particular variable is defined by the highest category that it achieves. For example, categorizing someone as extroverted (outgoing) or introverted (shy) is nominal. If we categorize people 1 = shy, 2 = neither shy nor outgoing, 3 = outgoing, then we have an ordinal level of measurement. If we use a standardized measure of shyness  we would probably assume the shyness variable meets the standards of an interval level of measurement. As to whether or not we might have a ratio scale of shyness, although we might be able to measure zero shyness, it would be difficult to devise a scale where we would be comfortable talking about someone’s being 3 times as shy as someone else.

Measurement at the interval or ratio level is desirable because we can use the more powerful statistical procedures available for Means and Standard Deviations. To have this advantage, often ordinal data are treated as though they were interval; for example, subjective ratings scales (1 = terrible, 2= poor, 3 = fair, 4 = good, 5 = excellent). The scale probably does not meet the requirement of equal intervals — we don’t know that the difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent). In order to take advantage of more powerful statistical techniques, researchers often assume that the intervals are equal.






This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.