Flanagan, D. P. & Kaufman, A. S. (2009): Essentials of WISC-IV assessment


Historical and contemporary views of the Wechsler scales



Wechsler scales are the most used and popular methods of assessing human intelligence and cognitive abilities ever. Even critics acknowledge the Wechsler scales contributions to research and practice.

Brief history of intelligence test development


Late half of 19th century: A beginning interest in intelligence testing. Francis Galton developed the first intelligence test and is regarded as the father of the testing movement. Focus on simple sensory and motor tasks (as he believed that intelligence was a matter of how well-developed ones senses were).

End of 19th century: Alfred Binet and colleagues develop tasks to measure intelligence of children in Paris public schools. Finding Galton’s task to be undiscriminatory between children and adults and too uncomplex, Binet focused on language, judgment, memory, comprehension and reasoning tasks.

Later Binet’s tests were translated and adapted for use in America, where they were further developed on by Terman, and finally the Stanford-Binet test (now the SB5) led the field as the most popular test for 40 years.

Around 1917 focus changed from assessment of children to assessment of adults, primarily due to the beginning of WW1. A group-administered IQ test, the Army Alpha test, with verbal content similar to the Stanford-Binet test was developed to aid selection of officers. A non-verbal group-administered test, the Army Beta, was developed to assess immigrants who spoke little english. Many of the non-verbal tasks had names that we can recognize in modern IQ tests, ex Picture Completion, Digit Symbol etc.

Mid-1930s saw the introduction of David Wechsler to the field of assessment. The first test was the Wechsler-Bellevue Intelligence Scale, which focused on both verbal tasks (like Army Alpha) and performance tasks (like Army Beta).

In 1949 the first WISC was introduced and soon became the most widely used measure of intellectual functioning in children along with the Wechsler Preschool and Primary Scale of Intelligence (WPPSI). In 1955 the original Wechsler-Bellevue was revised and renamed WAIS.

Other measures incl. the Woodcock-Johnson Test of Cognitive Ability (now WJ III), the Kaufman Assessment Battery (now KABC-II) and the Differential Abilities Scale (now the DAS-II), the Cognitive Assessment System (CAS), the Universal Nonverbal Intelligence Test (UNIT) and the Reynolds Intellectual Ability Scale (RIAS).

Common to these recently developed and revised tests are their reliance on theory, especially the Cattell-Horn-Carroll (CHC) theory.

Cattell-Horn-Carroll (CHC) theory


CHC theory of cognitive abilities is an amalgamation of two similar theories about the content and structure of human cognitive abilities. The first of these two theories is Gf-Gc theory (Cattell, 1941; Horn 1965), and the second is Carroll’s (1993) Three-Stratum theory.

There are 10 broad stratum abilities and over 70 narrow abilities below these. The broad abilities are Crystallized Intelligence (Gc): includes the breadth and depth of a person’s acquired knowledge, the ability to communicate one’s knowledge, and the ability to reason using previously learned experiences or procedures. Fluid Intelligence (Gf): includes the broad ability to reason, form concepts, and solve problems using unfamiliar information or novel procedures. Quantitative Reasoning (Gq): is the ability to comprehend quantitative concepts and relationships and to manipulate numerical symbols.

The other stratum abilities incl. Long-Term Storage and Retrieval, Short-Term Memory, Reading and Writing abilities, Visual Processing, Auditory Processing, Processing Speed, Decision/Reaction Time/Speed.



  • The different approaches that have been used to interpret an individual’s performance on the Wechsler scales can be described in four waves:

1) The first wave: Quantification of general level


Characterised by offering an objective method of differentiating groups of people on the basis of their general intelligence (ex Stanford-Binet scale).

The overall IQ score was the main focus and the prevailing idea was that of intelligence as a global entity – the ”g” factor (ex Spearman).

  • Focus: The global IQ and practical considerations regarding the need to classify people into separate groups.
2) The second wave: Clinical profile analysis The analysis of patterns of subtest scaled scores. Clinical profile analysis was a method aiming to go beyond the general IQ score and to interpret on more specific aspects of a person’s cognitive. The Wechsler-Bellevue was the first test to go beyond global IQ with the grouping of subtests into Verbal and Performance composites.

  • Focus: Patterns of high and low subtest scores, which could presumably reveal diagnostic and psychotherapeutic considerations.
3) The third wave: Psychometric profile analysis Improvements in technology and statistics (ex Cohen’s factor-analysis and other work on WISC and WAIS) revealed that the Wechsler subtests showed very poor specificity, questioning the former clinical individual subtest analysis. Instead of Wechsler’s Verbal-Performance dichotomy, factor analysis revealed a 3-factor structure of Verbal Comprehension, Perceptual Organisation and Freedom from Distractibility.

  • Focus: Psychometric precision and methods in profile analysis, rather than the loose interpretative attempts of clinical profile analysis. However, the lack of empirical support and a theoretical background makes this approach controversial and lacking in validity.
4) The fourth wave: Application of theory Following the lack of succes of the (second and) third wave approaches, it became clear that there was a need to integrate theory and research in the intelligence test interpretation process and enhancing interpretation by reorganizing subtests into clusters specified by a particular theory, ex the Cross-Battery approach based on CHC theory.

  • Focus: Grounding intelligence testing and interpretation of scores on a theoretical basis. The most popular theory in test development and interpretation is the CHC (Cattell-Horn-Carroll theory).
Debate about the utility of intraindividual (ipsative) analysis The critics: (ex Glutting et al.) Ipsative scores have poor reliability, are not stable over time, and do not add anything to the prediction of achievement after ”g” is accounted for.

The comeback: Flanagan et al. include an intraindividual analysis in their interpretation approach, and warrant it by making clear that, the ipsative approach is only one part of a larger guide to interpretation, and that the use of ipsative scores can be warranted when accompanied by theory, research and psychometrically defensible interpretive procedures.

They suggest the following ways of improving the practice of ipsative analysis:

  • To interpret test data within the context of a well-validated theory, ex CHC theory.
  • To use composites or clusters, rather than subtests, in intraindividual analysis.
  • To examine scores associated with an individual relative weakness in comparison to other people, ex if an individual obtains a score that is low compared to the individual’s other scores, but falls into the average range of ability compared to other people, it cannot be said to be clinically meaningful.
  • The lack of stability in an individual’s scaled score profile over an extended period of time (ex 3 years), does not imply that ipsative analysis is not valid. It is not unusual that changes occur over long periods of time, ex due to effects of intervention, developmental changes, regression towards the mean etc.

But always remember that diagnosis should not be made solely based on intraindividual/ipsative analysis!



  • WISC-IV consists of 15 subtests – 10 core-battery subtests and 5 supplemental subtests.
Structure of the WISC-IV Structural changes from WISC-III to WISC-IV:

  • Deleted subtests = Picture Arrangement, Object Assembly and Mazes.
  • New subtests = Word Reasoning, Matrix Reasoning, Picture Concepts, Letter-Number Sequencing and Cancellation.
  • VIQ and PIQ dropped and replaced by 4 indexes: WMI, PSI, VCI and PRI.
    • WHY? Because the difference/discrepancy between the 2 was overused, and its meaningfulness and clinical utility was never made clear in the litterature.
  • FSIQ has changed dramatically in content and concept and now consists of merely 5 (out of 10) subtests (Similarities, Comprehension, Vocabulary, Block Design and Coding).
  • Norms updated.
  • Items added to improve floors and ceilings.


Sample = 2200 children resembling the 2002 Census data on variables of age, gender, geographic region, ethnicity and socioeconomic status. The sample was divided into 11 age groups, each containing 200 children and was split equally between boys and girls.

The WISC-IV has also been standardized in a range of other countries.



Average internal consistency:

  • of subtests: Ranges from 0,72 (Coding, for ages 6-7) to 0,94 (Vocabulary, for age 15).
  • of indexes: VCI = 0,94, PRI = 0,92, WMI = 0,92 and PSI = 0,88.
  • of Full Scale IQ: FSIQ = 0,97.

Average test-retest coefficients:

  • VCI = 0,93, PRI = 0,89, WMI = 0,89, PSI = 0,86 and FSIQ = 0,93.

Practice effects:

  • In general practice effects are greatest for ages 6-7 and become smaller with increasing age. Coding and Symbol Search showed the largest gains (ages 6-7).


= an important indicator of the degree to which a subtest measures general intelligence.

  • VCI subtests generally have the highest g-loadings at every age, followed by the PRI, WMI and PSI subtests, except Arithmetics which loads more like VCI.
Floors, ceilings and Item Gradients Floors and ceilings for all WISC-IV subtests are excellent, which means that WISC-IV can be used with confidence in testing individuals who are functioning either in the gifted or mentally retarded ranges of functioning.

Item gradients refer to the spacing between items on a subtest. Generally these range from good to excellent at all ages in the WISC-IV. This means that the spacing between items is generally small enough to allow for reliable discrimination between individuals on the latent trait measured by the subtest.

Structural validity


The structural validity of the WISC-IV is supported by factor-analytic studies described in the WISC-IV Technical and Interpretive Manual.

Investigations (Keith et al., 2006) of whether the WISC-IV measures the same constructs across its 11-year age span (children from 6-16) produced positive results.

They also investigated the nature of these constructs and concluded that the WISC-IV measures Crystallized Ability, Visual Processing, Fluid Reasoning, Short-Term Memory and Processing speed.

Flanagan et al. along with other references have developed 8 new clinical clusters, which can be used in what they call Planned Clinical Comparisons to gain information about a child’s cognitive capabilities beyond the four indexes and FSIQ:

  1. Fluid Reasoning
  2. Visual Processing
  3. Nonverbal Fluid Reasoning
  4. Verbal Fluid Reasoning
  5. Lexical Knowledge
  6. General Information
  7. Long-Term Memory
  8. Short-Term Memory
Relationship to other Wechsler Scales The validity of WISC-IV is supported by factor analysis and content-validity research, but also by correlations with scores on other measures of cognitive ability (for both normal and special group samples).

Correlations with Full Scale IQ:

  • Not surprisingly, the WISC-IV FSIQ is highly correlated with the FSIQs of other Wechlser scales (ex WAIS-IV 0,90, WISC-III 0,89, WASI (Wechsler Abbreviated Scale of Intelligence) 0,86, WPPSI-III 0,89, WNV (Wechsler NonVerbal Scale of Ability) 0,76 etc.)
  • The substantial coefficients  of 0,89 between WISC-III and WISC-IV suggest a continuity of the construct measured by the Full Scale, which is notable because the WISC-IV FSIQ is dramatically different from its predecessors as it has only 5 subtests in common with previous measures of FSIQ.
Convergent-Discriminant Validity Coefficients


The WISC-IV also shows good to excellent convergent-discriminant validity evidence, ex for VCI and PRI.
Relationship to WIAT-II WIAT-II = Wechsler Individual Achievement Test – second edition.

The validity of the WISC-IV was investigated further through an examination of its relationship to academic achievement, by correlating FSIQ and indexes of WISV-IV with WIAT-II scores.

  • Conclusions: FSIQ is a robust predictor of academic achievement in normal and clinical samples. Indications that FSIQ explain 56-60% of the variance in domains like Oral Language and Reading and Math.
Ethnic Differences on the WISC-IV


Historically, Whites have scored about one SD higher than African Americans  on Wechsler’s scales, ex FSIQ differing by 14,9 points on the WISC-III and 2/3 SD higher than Hispanics, differing by 9,9 points. On WISC-IV FSIQ differences between Whites and African Americans are reduced to ¾ SD (= 11,5 points), while there remains a 10 point difference between Whites and Hispanics.

Socioeconomic status (ex conceptualised by parental education) has been shown to account for 18,8 % of the variance between Whites and African Americans. This number is substantially larger than the variance by race alone (4,7 %).



A far-reaching influence

Wechsler’s intelligence scales has made substantial contributions to the science of intellectual assessment.

Practical, clinical values

His primary motivation for constructing his tests was to create an efficient, easy-to-use tool for clinical purposes. Grounding the tests in a specific theory was not of central importance.

Future directions

However, results from contemporary intelligence research cannot be overlooked, and the Wechsler scales must adopt the fourth-wave approach, which integrates contemporary theory, research and measurement principles.

6 thoughts on "Flanagan, D. P. & Kaufman, A. S. (2009): Essentials of WISC-IV assessment

