Kapitel 03: How assessment measures are made

Coaley, K. (2009): An Introduction to Psychological Assessment and Psychometrics. Sage.


• How do publishers design a psychological measure?
Careful planning, setting clear aims for what the test is for, defining relevant attribute(s), writing a plan, writing items, selecting items, standardizing, final preparations incl. administration guide, statistics etc.

• Distinctions between different types of construction?
Criterion-keyed selects items based on their discriminability between a criterion group and a control. Item content is irrelevant and the approach is atheoretical.
Factor analysis is based on analysis of correlations between variables, and is used to check whether items relate to the same trait/theme/factor.
In construction using classic test theory the aim is to generate a pool of items of various difficulty measuring the same thing (item homogeneity), which is checked by item/total correlations.
Using IRT and Rasch scaling items first undergoe factor analysis to ensure they all measure the same trait and are then analysed with focus on item difficulty and how well the item facilitates elicitation of the wanted trait.

• What are the principles of test standardization and percentile norms?
Standardization allows for comparison of scores between people, ex based on norms, means and standard deviations, and percentile norms shows how many people perform worse than some score, ex a percentile rank of 75 means that 75% of people in the sample perform worse.

• What are the differences between norm-referenced, criterion-referenced and self-referenced data?
Norm-ref. compares an individuals scores with other groups of people, ex the general population, women, children aged 2-4 etc.
Criterion-ref. compares results with an external standard, ex job performance in some area, ex a person with this result is good at communicating with customers.
Self-ref. is not used for comparison between people, but looks at variations within an individual.


• From a large item-pool with various items, those items which can discriminate between a criterion group and a control are selected (regardless of their content).
• A purely empirical approach, not based on theory. (A limitation)
• ex MMPI (Minnesota Multiphasic Personality Inventory), CPI (California Personality Inventory)
• Based upon true-false or yes-no items.

• No understanding of why such a test works due to the lack of theory behind item selection.
• Likely to produce scales of poor reliability, as scales would be measuring a mix of different possible attitudes.
• The test would be specific only to the group used in construction.


• A multivariate data reduction tool which enables us to simplify correlations between sets of variables.
• Based upon correlations between variables. In test construction factor analysis is used to check whether items relate to a common theme or factor.
• Mathematical and objective.
• Used to generate assessments which only measure one factor. With oblique analysis factors are correlated. With orthogonal analysis factors are not related by correlation, ie. factors are independent.
• ex 16PF, Eysenck Personality Inventory.

• Needs large samples.
• Complex technical problems, therefore good knowledge of its procedures is highly important.


• A test is generated by analysing items (100+) in a pilot sample to make sure they measure only one factor. The end result is a pool of items of differing difficulty which measure only one thing.
• This is done by item-whole/total correlation (correlating every single item with total scores.
• Two criterions for items: item homogeneity (they all measure the same thing) and difficulty indicator.

A mix of CTT construction and factor analysis appears to be a good solution.


• A very large sample is needed.
• Items are constructed and an initial factor analysis is carried out to make sure that all items measure a single trait.
• In order to computate the Rasch parameters the sample data is split into groups of high and low scorers, providing different levels of the difficulty level.
• If the facility of an item for eliciting the trait is the same for both groups, it is seen as conforming to the model and is chosen for the test.
• On completion of item selection an item-free measurement of all individuals is needed, ex by checking whether sub-groups of items (the hardest vs. the easiest) generate the same scores for each person. If items fit the model, each individual will score the same on both tests.


• Standard procedures for administration
• Standard procedures for scoring
• Development of norms.

NORMS = Lists of mean scores/raw scores for specific samples of people, ex according to age, profession etc.


Quality of norms depends on:
o Size of sample
o Representativeness of sample

Choosing a standardization sample by, ex: random sampling, stratified sampling, stratified random sampling, sampling items (administering different sets of items to different (random) groups).

Standardization enables scores to be converted to relative scores, ex percentiles, and allows for comparison between different people.


Norm referencing
• Most psychological tests are norm referenced.
• Comparing a person’s scores/performance to others who have done the same assesment.
• A set of norms should also include the mean and standard deviation (an indicator of how adequately the mean represents the data set).

Criterion referencing
• Relating test results to some external factor/criterion, ex job performance.
• Can be used to distinguish between those who have or have not developed the skills, knowledge and abilities for a specific activity, ex ”A person with scores above 75 will work well in …”.
• The BIG question: How good are the external standards set?

Self referencing
• Ipsative scales/measures
• Ex forced choice: “Choose the one which describes you best: Active or Passive, Competitive or Uncompetitive etc.”
• Scores tell us only about the person doing the test and nothing about comparison with others.


• Based on a rank ordering, ie. an ordinal scale.
• Ex a person with a score in the 75th percentile, has a better score than 75 % of the people in the sample.

Calculating percentiles from raw scores

• For each score the percentage/proportion of people who have done less well is calculated.
• To calculate the percentile rank you have to interpolate (take the mean) between the lower and upper limits, ex a raw score of 7 has a lower limit of 17,4% (ie. 17,4% get 6 or less) and an upper limit of 25,5 (ie. 25,5% get 7 or less). The percentile rank for a score of 7 is therefore: (17,4 + 25,5)/2 = 21,45.

The main disadvantage of using percentile norms is that they are based on an ordinal scale, which means that the units are not equal on all parts of the scale.

Skriv et svar

Udfyld dine oplysninger nedenfor eller klik på et ikon for at logge ind:

WordPress.com Logo

Du kommenterer med din WordPress.com konto. Log Out /  Skift )

Facebook photo

Du kommenterer med din Facebook konto. Log Out /  Skift )

Connecting to %s