This material is in early beta: over 300 suggestions and corrections are waiting to be folded in, some quite significant. Changes should be in place by July 2018, at which times printed copies and downloadable electronic copies will be made available.

Cognitive Load

After reading this chapter, you will be able to

  • Define cognitive load and explain how consideration of it can be used to shape instruction.
  • Explain what faded examples are and construct faded examples for use in programming workshops.
  • Explain what Parsons Problems are and construct Parsons Problems for use in programming workshops.
  • Describe ways they differ from their own students and what effect those differences have on instruction.

In 2006, Kirschner, Sweller and Clark wrote:

Although unguided or minimally guided instructional approaches are very popular and intuitively appealingthese approaches ignore both the structures that constitute human cognitive architecture and evidence from empirical studies over the past half-century that consistently indicate that minimally guided instruction is less effective and less efficient than instructional approaches that place a strong emphasis on guidance of the student learning process. The advantage of guidance begins to recede only when learners have sufficiently high prior knowledge to provide “internal” guidance. ([Kirs2006])

Their paper set off a minor academic firestorm, because beneath the jargon the authors were claiming that allowing learners to ask their own questions, set their own goals, and find their own path through a subject, as they would when solving problems in real life, doesn’t actually work very well. This approach—called inquiry-based learning—is intuitively appealing, but Kirschner and colleagues argued that it overloads learners by requiring them to master a domain’s factual content and its problem-solving strategies at the same time.

More specifically, posits that people have to deal with three things when they’re learning:

Intrinsic load

is what people have to keep in mind in order to absorb new material. In a programming class, this might be understanding what a variable is, or understanding how assignment in a programming language is different from creating a reference to a cell in a spreadsheet.

Germane load

is the (desirable) mental effort required to link new information to old, which is one of the things that distinguishes learning from memorization. An example might be remembering that a loop variable is assigned a new value each time the loop executes.

Extraneous load

is everything else that distracts or gets in the way, such as knowing that tabs look like multiple characters but only count as one character when indenting Python code.

According to cognitive load theory, searching for a solution strategy is an extra burden on top of actually applying that strategy. We can therefore accelerate learning by giving learners worked examples that show them a problem and a detailed step-by-step solution, followed by a series of faded examples. The first example presents a nearly-complete use of the same problem-solving strategy just demonstrated, but with a small number of blanks for the learner to fill in. The next problem is of the same type, but has more blanks, and so on until the learner is asked to solve the entire problem. The material that isn’t blank is often referred to as scaffolding, since it serves the same purpose as the scaffolding set up temporarily at a building site.

For example, someone teaching Python might start by explaining how to calculate the total length of a list of words:

# total_length(["red", "green", "blue"]) => 12
def total_length(words):
    total = 0
    for word in words:
        total += len(word)
    return total

then ask learners to fill in the blanks in:

# word_lengths(["red", "green", "blue"]) => [3, 5, 4]
def word_lengths(words):
    lengths = []
    for ____ in ____:
    return lengths

The next problem might be:

# join_all(["red", "green", "blue"]) => "redgreenblue"
def join_all(words):
    result = ____
    for ____ in ____:
    return result

Learners would finally be asked to write an entire function on their own:

# acronymize(["red", "green", "blue"]) => "RGB"
def acronymize(words):

Faded examples work because they introduce the problem-solving strategy piece by piece: at each step, learners have one new problem to tackle, which is less intimidating than a blank screen or a blank sheet of paper (Section 9.10). It also encourages learners to think about the similarities and differences between various approaches, which helps create the linkages in their mental models that help retrieval.

The key to constructing a good faded example is to think about the problem-solving strategy it is meant to teach. For example, the series of problems are all examples of the accumulator pattern, in which the results of processing items from a collection are repeatedly added to a single variable in some way to create the final result.

Parsons Problems

Another kind of exercise that can be explained in terms of cognitive load theory is called a . If you are teaching someone to speak a new language, you could ask them a question, and then give them the words they need to answer the question, but in jumbled order. Their task is to put the words in the right order to answer the question grammatically, which frees them from having to think simultaneously about what to say and how to say it.

Similarly, when teaching people to program, you can give them the lines of code they need to solve a problem, and ask them to put them in the right order. This allows them to concentrate on control flow and data dependencies, i.e., on what has to happen before what, without being distracted by variable naming or trying to remember what functions to call. Multiple studies have shown that Parsons Problems take less time for learners to do, but produce equivalent educational outcomes [Eric2017].

Labelled Subgoals

[Marg2016,Morr2016] all found that students with labelled subgoals solved Parsons Problems for learning loops better than students without, i.e., that giving the steps names helps students learn them. The same benefit is seen in other problem domains [Marg2012], and can also be explained by cognitive load theory: naming the steps reduces the germane load of figuring out what to do next.

While faded examples take cognitive load into account in a scalable way, a much older model of learning uses the same ideas on a more personal scale. emphasizes the process of a master passing on skills and insights situationally to an apprentice; the master provides models of performance and outcomes, then supports novices as they take their first steps by explaining what they’re doing and why [Coll1991,Casp2007]. The apprentice reflects on their own problem solving, e.g., by thinking aloud or critiquing their own work, and eventually explores problems of their own choosing.

This model tells us that we should have at least a second example when presenting a new idea so that learners can see what to generalize in their schema, and that we should vary the form of the problem to make it clear what are and aren’t superficial features (because learners get hung up on those). We should also induce self-explanation, which is discussed in Section 5.1

Split Attention

Research by Mayer and colleagues on the split-attention effect is closely related to cognitive load theory [Maye2003]. Linguistic and visual input are processed by different parts of the human brain, and linguistic and visual memories are stored separately as well. This means that correlating linguistic and visual streams of information takes cognitive effort: when someone reads something while hearing it spoken aloud, their brain can’t help but check that it’s getting the same information on both channels.

Learning is therefore more effective when redundant information is not presented simultaneously in two different channels. For example, people find it harder to learn from a video that has both narration and on-screen captions than from one that has either the narration or the captions but not both.

The key word in the previous paragraph is “redundant”. It turns out that it’s more effective to draw a diagram piece by piece while teaching rather than to present the whole thing at once. If parts of the diagram appear at the same time as things are being said, the two will be correlated in the learner’s memory. Pointing at part of the diagram later is then more likely to trigger recall of what was being said when that part was being drawn.

The split-attention effect does not mean that learners shouldn’t try to reconcile multiple incoming streams of information—after all, this is something they have to do in the real world [Atki2000]. Instead, it means that instruction shouldn’t require it while people are mastering unit skills; instead, using multiple sources of information simultaneously should be treated as a separate learning task.

Not All Graphics Are Created Equal

[Sung2012] presents an elegant study that distinguishes seductive graphics (which are highly interesting but not directly relevant to the instructional goal), decorative graphics (which are neutral but not directly relevant to the instructional goal), and instructive graphics (directly relevant to the instructional goal). Students who received any kind of graphic gave significantly higher satisfaction ratings to material than those who didn’t get graphics, but only students who got instructive graphics actually performed better.

Similarly, [Stam2013,Wies2014] found that having more information can actually lower performance. They showed children pictures, pictures and numbers, or just numbers for two tasks: fraction equivalence and fraction addition. For equivalence, having pictures or pictures and numbers outperformed having numbers only. For addition, however, having pictures outperformed pictures and numbers, which outperformed just having numbers.

FIXME (medium): include diagram from

Pattern Recognition

Section 3.2 said that short-term memory can only store 7 ± 2 items at a time, and recent research have suggested that its actual size might be as low as 4 ± 1 items [Dida2016]. In order to handle larger information sets, our minds create chunks. For example, most of us remember words as single items, rather than as sequences of letters. Similarly, the pattern made by five spots on cards or dice is remembered as a whole rather than as five separate pieces of information.

One key finding in cognition research is that experts have more and larger chunks than non-experts, i.e., experts “see” larger patterns, and have more patterns to match things against. This allows them to reason at a higher level, and to search for information more quickly and more accurately. However, chunking can also mislead us if we mis-identify things: newcomers really can sometimes see things that experts have looked at and missed.

Given how important chunking is to thinking, it is tempting to try to teach patterns directly. In fact, supporting this is one of the reasons programmers have been so enthusiastic about design patterns. In practice, though, most pattern catalogs are too large to flick through and too dry to memorize directly. Giving names to a small number of patterns, though, does seem to help with teaching, primarily by giving the learners a richer vocabulary to think and communicate with [Kuit2004,Byck2005,Saja2006]. We will return to this in Section 7.6.

Minimal Manuals

The most extreme use of cognitive load theory may be the “minimal manual” method introduced in [Carr1987]. Its starting point is a quote from a user: “I want to do something, not learn how to do everything.” Carroll and colleagues therefore redesigned training to present every idea as a single-page self-contained task: a title describing what the page was about, step-by-step instructions of how to do something really simple (like how to delete a blank line in a text editor), and then several notes how to recognize and debug common problems.

Carroll and colleagues found that rewriting training materials this way made them shorter overall, and that people using them learned faster. Later studies like [Lazo1993] confirmed that this approach outperformed the traditional approach regardless of prior experience with computers.

Looking back, [Carr2014] summarized this work by saying:

Our “minimalist” designs sought to leverage user initiative and prior knowledge, instead of controlling it through warnings and ordered steps. It emphasized that users typically bring much expertise and insight to this learning, for example, knowledge about the task domain, and that such knowledge could be a resource to instructional designers. Minimalism leveraged episodes of error recognition, diagnosis, and recovery, instead of attempting to merely forestall error. It framed troubleshooting and recovery as learning opportunities instead of as aberrations.

He goes on to say that at the time, instruction decomposed skills into sub-skills hierarchically and then drilled people on the sub-skills. However, this meant context was lost: the goals weren’t apparent until people had learned the pieces. Since people want to dive in and do real tasks, well-designed instruction should help them do that. Interestingly, this follow-up also reports that people progressed more rapidly when the system rejected errors without doing anything (i.e., left them in the pre-error state).

A Final Thought

Cognitive load theory has been criticized as being unfalsifiable: since there’s no way to tell in advance of an experiment whether something is germane or not, any result can be justified after the fact by labelling things that hurt performance as extraneous and things that don’t germane. However, there is no doubt that instruction based on these principles is effective: for example, [Maso2016] redesigned a conventional introduction to databases course to remove split attention and redundancy effects, and provide worked examples and sub-goals. The new course reduced exam failure rate by 34% on an identical final exam and increased student satisfaction.

Part of the problem is deciding what we mean by “learning”, which turns out to be pretty complicated once you start looking beyond the standardized Western classroom. Within the broad scope of educational psychology, two specific perspectives have primarily influenced my teaching. The first is cognitivism, which focuses on things like pattern recognition, memory formation, and recall. It is good at answering low-level questions, but generally ignores larger issues like, “What do we mean by ‘learning’?” and, “Who gets to decide?” The second is situated learning, which focuses on bringing people into a community, and recognizes that teaching and learning are always rooted in who we are and who we aspire to be. We will discuss it in more detail in Chapter 13.

The Learning Theories website and [Wibu2016] have good summaries of these and other perspectives. Besides cognitivism, those encountered most frequently include behaviorism (which treats education as stimulus/response conditioning), constructivism (which considers learning an active process during which learners construct knowledge for themselves), and connectivism (which emphasizes the social aspects of learning, particularly those made possible by the Internet). It would help if their names were less similar, but setting that aside, none of them can tell us how to teach on their own because in real life, several different teaching methods might be consistent with what we currently know about how learning works. We therefore have to try those methods in the class, with actual learners, in order to find out how well they balance the different forces in play.

Doing this is called . If educational psychology is the science, instructional design is the engineering. For example, there are good reasons to believe that children will learn how to read best by starting with the sounds of letters and working up to words. However, there are equally good reasons to believe that children will learn best if they are taught to recognize entire simple words like “open” and “stop”, so that they can start using their knowledge sooner.

The first approach is called “phonics”, and the second, “whole language”. The whole language approach may seem upside down, but more than a billion people have learned to read and write Chinese and similar ideogrammatic languages in exactly this way. The only way to tell which approach works best for most children, most of the time, is to try them both out. These studies have to be done carefully, because so many other variables can have an impact on rules. For example, the teacher’s enthusiasm for the teaching method may matter more than the method itself, since children will model their teacher’s excitement for a subject. (With all of that taken into account, phonics does seem to be better than other approaches [Foor1998].)

As frustrating as the maybes and howevers in education research are, this kind of painstaking work is essential to dispel myths that can get in the way of better teaching. One well-known myth characterizes learners as visual, auditory, or kinesthetic according to whether they like to see things, hear things, or do things. This scheme is easy to understand, but as [DeBr2015] explains, it is almost certainly false. Unfortunately, that hasn’t stopped a large number of companies from marketing products based on it to parents and school boards.

Similarly, the learning pyramid that shows we remember 10% of what we read, 20% of what we hear, and so on? Myth. The idea that “brain games” can improve our intelligence, or at least slow its decline in old age? Also a myth, as are the claims that the Internet is making us dumber or that young people read less than they used to. Just as we need to clear away our learners’ misconceptions in order to help them learn, we need to clear away our own about teaching if we are to teach more effectively.


Create a Faded Example (pairs/30 minutes)

It’s very common for programs to count how many things fall into different categories: for example, how many times different colors appear in an image, or how many times different words appear in a paragraph of text.

  1. Create a short example (no more than 10 lines of code) that shows people how to do this, and then create a second example that solves a similar problem in a similar way, but has a couple of blanks for learners to fill in. How did you decide what to fade out? What would the next example in the series be?
  2. Define the audience for your examples. For example, are these beginners who only know some basics programming concepts? Or are these learners with some experience in programming but not in Python?
  3. Show your example to a partner, but do not tell them what level it is intended for. Once they have filled in the blanks, ask them what level they think it is for.

If there are people among the trainees who don’t program at all, try to place them in different groups, and have them play the part of learners for those groups. Alternatively, choose a different problem domain and develop a faded example for it.

Create a Parsons Problem (pairs/20 minutes)

Write five or six lines of code that does something useful, jumble them, and ask your partner to put them in order. If you are using an indentation-based language like Python, do not indent any of the lines; if you are using a curly-brace language like Java, do not include any of the curly braces. Again, if your group includes people who aren’t programmers, try using a different problem domain, such as making guacamole.

Minimal Manuals (individual/20 minutes)

Write a one-page guide to doing something simple that your learners might encounter in one of your classes, such as centering text horizontally or printing a number with a certain number of digits after the decimal points. Try to list at least three or four incorrect behaviors or outcomes the learner might see, and include a one- or two-line explanation of why each happens and how to correct it (i.e., go from symptoms to cause to fix).

Critiquing Graphics (individual/15 minutes)

[Maye2009] presents six principles for designing good instructional graphics. As summarized in [Mill2016a], they are:


visually highlight the most important points that you want students to retain so that they stand out from less-critical material.

Spatial contiguity:

if using captions or other text to accompany graphics, place them as close to the graphics as practical to offset the cost of shifting between the two. If using diagrams or animations, place captions right next to relative components instead of putting them in one big block of text.

Temporal contiguity:

present spoken narration and graphics as close in time as practical—presenting both at once is better than presenting them one after another.


when presenting a long sequence of material or when students are inexperienced with the subject, break up the presentation into shorter segments and let students control how quickly they advance from one part to the next.


if students don’t know the major concepts and terminology used in your presentation, set up a module just to teach those concepts and terms and make sure they complete that module beforehand.


students learn better from pictures plus audio narration than from pictures plus text, unless there are technical words or symbols, or the students are non-native speakers.

Choose a lesson you have recently taught (or recently been taught) that uses slides or other static presentations, and rate its graphics as “low”, “medium”, or “high” according to these six criteria.

Cognitive Apprenticeship (pairs/15 minutes)

Pick a small coding problem (something you can do in two or three minutes) and think aloud as you work through it while your partner asks questions about what you’re doing and why. As you work, do not just comment on what you’re doing, but also on why you’re doing it, how you know it’s the right thing to do, and what alternatives you’ve considered but discarded. When you are done, swap roles with your partner and repeat the exercise.