This material is in early beta: over 300 suggestions and corrections are waiting to be folded in, some quite significant. Changes should be in place by July 2018, at which times printed copies and downloadable electronic copies will be made available.

Pedagogical Content Knowledge

After reading this chapter, you will be able to

  • Learners can summarize research reporting how well (or poorly) students are doing in introductory computing classes today.
  • Learners can explain at least three factors that influence how well or how quickly people learn how to program, and give examples of each.
  • Learners can define the term “notional machine” and give an example of what one is.
  • Learners can summarize shortcomings in the ways that students typically test their software and the pros and cons of using unit tests to grade student programs.
  • Learners can explain when and how to use program visualization to teach introductory programming courses.

FIXME (medium): Caution about limits to research, including Henr2010.

We don’t know as much about how people learn to program as we do about how they learn to read, play a sport, or do basic arithmetic. But we do know some things, and this chapter attempts to sum them up and explain their practical implications.

Most of what this chapter presents comes from studying school children and computer science undergraduates at university, both because those are the populations that researchers have easiest access to, and because those are the ages at which people most often learn to program. (As a reminder of this, we use the word “student” in this chapter instead of “learner”.) Much less is known about how adults learn to programming in free-range settings or about people who aren’t intending to be computer scientists, but what we do know is reported here.

[Ihan2016] summarizes the methods most often used to mine and analyze data in these studies. As in all empirical research, it is important to remember that correlation is not causation, and that theories may change as more and better data becomes available.


Like any specialty, computing education research has its jargon. The term CS1 is often used to mean an introductory semester-long programming course in which students meet variables, loops, and functions for the first time, while CS2 refers to a second semester-long course that covers basic data structures like stacks and queues. A CS1 course is often useful for undergraduates in other disciplines, though as we will discuss below, it’s more effective if it uses relevant examples. A CS2 course designed for computer science students is usually less relevant for artists, ecologists, and other end-user programmers, but is sometimes the only next step available.

What’s Our Baseline?

How hard is it to learn to program? It’s easy to ask students how much they have learned—universities do it all the time—but study after study has shown that students’ teaching evaluations don’t correlate with actual learning outcomes [Star2014]. Instead, we have to turn to data.

A decade apart, [Benn2007] and [Wats2014] sought to answer this question by looking at how many students pass their first computer science course. They got very similar answers: two-thirds of post-secondary students pass their introductory course, with some variations depending on class size and so on. There were no significant differences over time or based on language (although failure rates were highest for courses using C and C++).

The more important question is, How well are they learning? [McCr2001] was a multi-site international study of how well students can program after their introductory course, replicated more than a decade later by [Utti2013]. The original study reported, “the disappointing results suggest that many students do not know how to program at the conclusion of their introductory courses.” More specifically, “For a combined sample of 216 students from four universities, the average score was 22.89 out of 110 points on the general evaluation criteria developed for this study.”

Switching domains, [Park2015] collected data from an online HTML editor during an introductory web page development course. Nearly all students made syntax errors that remained unresolved weeks into the course. 20% of these errors related to the relatively complex rules that dictate when it is valid for HTML elements to be nested in one another, while 35% related to the simpler tag syntax determining how HTML elements are nested.

How much does prior experience matter? [Wilc2018] compared the performance and confidence of students with and without prior programming experience in CS1 and CS2. They found that students with prior experience outscored students without by 6% on exams and 10% in CS1, but those differences disappeared by the end of CS2. Female students with prior exposure outperformed their male peers in all areas, but were consistently less confident in their abilities.

How do newcomers think about programming? [Simo2006] asked students in introductory CS and economics classes to explain how they would sort a list of numbers. The majority of the CS students could describe a plausible algorithm, while less than a third of other students could. However, fewer CS students provided a correct answer after their first course because they were trying to put in too many code-level details.

Are they mastering concepts or just mechanics? [Muhl2016] analyzed 350 concept maps drawn by students and compared those who had done a CS course and those who had not, and found that the maps drawn by those who had looked more like the maps experts would draw. For example, “program” was a central concept in both sets of concept maps, but the next most central concepts for those with prior CS exposure were “class” and “data structure”, while for those without, they were “processor” and “data”.

What Do Students Misunderstand?

The biggest misconception novices have—sometimes called the “superbug” in coding—is the belief that the computer understands intention the way that a human being would [Pea1986]. As paradoxical as it sounds, it’s crucial to teach people that programs are meaningless, i.e., that calling a variable “cost” doesn’t guarantee that it actually contains a cost.

A short and immediately applicable summary of novice misconceptions is [Sorv2018]. It presents over 40 specific misconceptions, many of which are also discussed in [Qian2017]’s lengthier survey. One common misconception is the belief that variables in programs work the same way they do in spreadsheets, i.e., that after executing:

grade = 65
total = grade + 10
grade = 80

the value of total will be 90 rather than 75 [Kohn2017]. This is an example of the way in which novices construct a plausible-but-wrong mental model by making analogies (Chapter 2), so lessons should include formative assessments early on to detect and correct this.

Another misconception is the belief that a program’s correctness is the sum of the correctness of its parts. [Koli2008] found that novies rarely consider programs absolutely incorrect, but are instead much more likely to consider them “partially correct” if they have any correct operations. This is probably a result of thinking in terms of grading schemes, and again, exercises should be given early to show that a program is only as correct as its least correct part.

Other misconceptions that should be addressed early include:

  • A variable holds the history of all the values it has been assigned.
  • Two objects with the same value for a name or id attribute are guaranteed to be the same object.
  • Functions are executed as they are defined, or are executed in the order in which they are defined.
  • A while loop’s condition is constantly evaluated, and the loop stops as soon as it becomes false. Conversely, the conditions in if statements are also constantly evaluated, and their statements are executed as soon as the condition becomes true, no matter where the flow of control is at the time.
  • Assignment moves values, i.e., after a = b, the variable b is empty.
  • Pretty much anything to do with variable scope.

[Qian2017] lists factors that contribute to these misconceptions, including:

  • Task complexity and cognitive load.
  • Students being confused by the special technical meaning of jargon terms (such as “or” meaning “either or both” rather than “one or the other”).
  • Variable assignment looking like an algebraic expression (although [Alta2015] found this to be less important than many educators think).
  • Inadequate patterns and strategies, characterized by phrases like, “I don’t know where to start” or “I can’t think at that level of abstraction”.
  • Confusing syntax, such as using + for addition and concatenation.
  • Teachers’ explanations. For example, saying that a variable is like a box may imply that many things can be put in it.

Some of their recommendations for fixing this will be familiar by now: present lots of examples, teach programming strategies explicitly, and provide tools for visualizing program execution.

What Are We Teaching Them Now?

What topics do introductory courses cover? [Luxt2017] surveyed the topics included in a variety of introductory programming courses, and analyzed those courses’ assessments to see which bits of syntax and semantics students actually had to master. After identifying and counting dozens of individual topics, they broke them down into a dozen categories:

Topic Number of Courses (%)
Programming Process 90 (87%)
Abstract Programming Thinking 65 (63%)
Data Structures 41 (40%)
Object-Oriented Concepts 37 (36%)
Control Structures 34 (33%)
Operations & Functions 27 (26%)
Data Types 24 (23%)
Input/Output 18 (17%)
Libraries 15 (15%)
Variables & Assignment 14 (14%)
Recursion 10 (10%)
Pointers & Memory Management 5 (5%)

But this paper does more than catalog concepts: it presents dependency graphs showing how they are connected. For example, it’s impossible to explain how operator precedence works without first explaining a few operators, and hard to explain those in a meaningful way without first introducing variables (because otherwise you’re comparing constants in expressions like 5<3, which is confusing).

Similarly, [Rich2017] reviewed a hundred articles to find learning trajectories for computing classes in elementary and middle schools, and presented results for sequencing, repetition, and conditionals. These are essentially collective concept maps, as they combine and rationalize the implicit and explicit thinking of many different educators.

FIXME (medium): reproduce Rich2017 diagrams

Do computing courses have to include programming? No. [Shel2017] reports that having students work in small groups on computational creativity exercises improves grades at several levels. Some of these exercises include:

  • Identify an everyday object (such as nail clipper, a paper clip, Scotch tape) and describe the object in terms of its inputs, outputs and functions.
  • Devise a three-step encoding scheme to transfer the alphabet letters into digits and encode questions for other teams to compete to decode.
  • Design a calendar for a planet with two suns, four different cultural groups with different resource constraints and industrial needs.

Each exercise comes with an explicit list of insights (which the authors call “light bulb moments”) and some prompts for reflection after the exercise is done, such as:

  • Did your group have to redesign segments in order to meet the testing requirements?
  • If so, identify the reasons for why the initial designs failed; if not, identify the reasons why the initial designs succeeded.
  • What considerations were (or should have been) considered during the initial designs in order to meet the testing requirements?

As Silicon Valley finally (grudgingly) acknowledges that insight and reasoning ability matter more than mastery of obscure technical details for programmers at all stages of their careers, non-coding work like this will become more important.

Do Languages Matter?

The short answer is “yes”: novices learn to program faster using blocks-based tools like Scratch that make syntax errors impossible. And its interface encourages exploration in a way that typing text does not; like all good tools, Scratch can be learned accidentally.

Scratch works well because it has been designed and refined over many years with usability and learnability as its primary goals [Malo2010]. We know a great deal about how it is used and misused; for example, [Aiva2016] analyzed over 250,000 Scratch projects and found that:

  • Most are small, but it’s hard to say whether this is because the authors are young or because of the medium.
  • There is very little use of abstractions like custom blocks (the equivalent of procedures), but programs that use them at all tend to use a fair number of them (11 to 12), which could be a signal that they are being used by people who have already mastered abstraction. (The fact that most procedures are only called once may be another sign of this.)
  • Simple “stack” blocks (i.e., linear control flow) are by far the most common. Cloned scripts within a single project are also common, and about 28% of projects have some unreachable code (i.e., something that will never run because it’s never called or triggered). The authors hypothesize that users may be using them as a scratchpad to keep bits of code they don’t (yet) want to throw away.

[Mlad2017] studied 207 students learning about loops in Scratch, Logo, and Python, and found that misconceptions about loops are minimized when using a block-based language rather than a text-based language. What’s more, as tasks become more complex (such as using nested loops) the differences become larger.

Blocks are not a panacea. [Grov2017] studied 100 middle-school children and found that it was easier for them to assemble programs with blocks than with text. However, hard concepts are still hard: repeat-until loops with variables that change value inside the loop (rather than the loop doing exactly the same thing each time) and or (which is often interpreted as “one or the other” rather than “one or both”) were both still difficult.

What about the transition from blocks to text? [Wein2017] studied students using a tool that allowed them to switch between blocks and text for programming. They found that students tend to migrate from blocks to text over time, but there are interesting exceptions. In two thirds of the cases where learners shifted from text to blocks, their next action was to add a new type of command; this may be because browsing available commands is easier in blocks mode, or because syntax errors with unfamiliar new commands are not possible. Learners also shifted from text to blocks when adding complex control (e.g., an if with an else), either because syntax errors are harder, or because the flow of control is immediately visible. The authors say, “While it is often claimed that blocks-based programming environments offer the advantage of reducing syntax errors, our findings suggest that blocks also offer information about what is possible in the space and provide a low-stakes means of exploring unfamiliar code.”

What about explicit typing? Programmers argue a lot about whether variables’ data types should have to be declared or not. One recent non-educational finding is [Gao2017], which selected fixed bugs from public JavaScript projects, checked out the code from version control just prior to the fix, manually added type annotations to the buggy code, and then tested whether strongly-typed variants of JavaScript reported an error. They found that about 15% of bugs are caught, which is either high or low depending on what answer you wanted in the first place.

However, programming and learning to program are different activities, and results from the former don’t necessarily apply to the latter. [Endr2014] and other studies show that declared types do add some complexity to programs, but it pays off fairly quickly by acting as documentation hints for a method’s use, in particular by preventing questions about what we have and and what we can do with it.

What about object-oriented programming? Objects and classes are power tools for experienced programmers, but power tools aren’t always suitable for beginners. [Mill2016b] found that most students had difficulty with the self object in Python (which refers to “this object”): they omitted it in method definitions, failed to use it when referencing object attributes, or both. Object reference errors were also more common than other errors; the authors speculate that this is partly due to the difference in syntax between obj.method(param) and def method(self, param).

[Rago2017] found something similar in a study of 86 high school students. Only 45% of students understood when to use this, only 60% understood when not to, only 24% could define it clearly, and these figures probably overestimate understanding because respondents might simply be reciting memorized answers. They also looked at 48 high school teachers, and drily observe that they, “expressed a considerable lack of clarity in accurately characterizing the correctness of students’ answers.” And [Mill2014] found that novice programmers often refer to an object when they mean an attribute or property of that object or vice versa. Students seem to make errors of this kind more often for identifying attributes (like “Canada”) than for descriptive attributes (like “Canadian”), probably because they think of nouns and adjectives as things and pointers to things respectively.

Does it have to hurt this much? No. [Stef2013] has shown that programming language designers needlessly make programming languages harder to learn by not doing basic usability testing. For example, “the three most common words for looping in computer science, for, while, and foreach, were rated as the three most unintuitive choices by non-programmers.” More fundamentally, their work shows that C-style syntax (as used in Java and Perl) is just as hard for novices to learn as a randomly-designed syntax, but that the syntax of other languages such as Python and Ruby is significantly easier to learn, and the syntax of their own language, Quorum, is easier still, because they are testing each new feature before adding it to the language.

Is it going to get better? Not any time soon. [Pere2013] compared Git’s actual operation with its users’ conceptual model, highlighting and explaining the many errors and confusion that result from the differences. [Pere2016] then used that work to design a more user-friendly alternative to Git. The result? “In sharing our research with colleagueswe have discovered a significant polarization. Experts, who are deeply familiar with the product, have learned its many intricacies, developed complex, customized workflows, and regularly exploit its most elaborate features, are often defensive and resistant to the suggestion that the design has flaws. In contrast, less intensive users, who have given up on understanding the product, and rely on only a handful of memorized commands, are so frustrated by their experience that an analysis like ours seems to them belaboring the obvious.”

Does Variable Naming Style Matter?

[Kern1999] says, “Programmers are often encouraged to use long variable names regardless of context. This is a mistake: clarity is often achieved through brevity.” Lots of programmers believe this, but is it true? Early studies like [Lawr2006] tried to find out, but didn’t distinguish between short and long names. More recently, [Hofm2017] found that using full words in variable names led to an average of 19% faster comprehension compared to letters and abbreviations, with no significant difference in speed between single letters and abbreviations, but didn’t look at which names were abbreviated.

For that, we have to turn to [Beni2017], which found that using single-letter variable names doesn’t affect novice programmers’ ability to modify code. This may be because novices’ programs are shorter than professionals’, but it may also be because some single-letter variable names have implicit types and meanings: most programmers assume i, j, and n are integers, and s is a string, while x, y, and z are either floating-point numbers or integers more or less equally. (e doesn’t have a strong implicit meaning, but “exception” wasn’t one of the options in the study.)

How important is this? [Bink2012] reported a series of studies that found that reading and understanding code is fundamentally different from reading prose: “the more formal structure and syntax of source code allows programmers to assimilate and comprehend parts of the code quite rapidly independent of style. In particularbeacons and program plans play a large role in comprehension.” It also found that experienced developers are relatively unaffected by identifier style (although again, they didn’t explore which variables), and that beginners found CamelCase easier to read than pothole_case. This is surprising because word spacing improves readability in conventional tasks. Digging deeper, “camel casing produces more accurate results. However, this correctness comes at a cost as the camel-case style significantly increases the time needed to correctly detect the correct identifier.”

More recently, [Floy2017] reports an fMRI study of 29 people that found that the same brain regions are involved in reading code and prose, and that while the two are distinct activities, they become more similar as people develop expertise.

How Do Students Program?

[Solo1984,Solo1986] pioneered the exploration of novice and expert programming strategies. The key finding is that experts have both the ability to plan a program and sufficient syntactic knowledge to implement it. Novices lack both, but we often mistakenly focus on gaps in the latter. For example, bugs are often related to planning errors (i.e., lack of a strategy for solving the problem) rather than to lack of knowledge about the language. Teachers should therefore emphasize the “how” of program construction as much as the “what”. Having lots of plans or goals when programming isn’t always a good thing—as [Spoh1985] found, merging plans and/or goals can yield bugs because of goals being dropped or fragmented—but not having plans is always harmful.

Harder or Less Familiar?

[Solo1986] introduced the Rainfall Problem, which is simple to state: write a program that repeatedly reads in positive integers until it reads the integer 99999. After seeing 99999, the program should print out the average of the numbers seen.

The Rainfall Problem has been used in many subsequent studies of programming. For example, [Fisl2014] found that students made fewer low-level errors when solving the problem in a pure functional language, but [Sepp2015] still found that success rates were disappointingly low: very few studies found even half of students able to solve it correctly. However, [Simo2013] argues that the Rainfall Problem is harder for novices than it used to be because they’re not used to handling keyboard input, and “run until you see a sentinel” isn’t a pattern today’s novice programmers are familiar with. Direct comparison with past cohorts may therefore be unfair.

The most important recommendation in this chapter is therefore to teach solution patterns, i.e., show learners over and over again how to tackle problems. [Mull2007b] is just one of many studies proving the benefits of this.

One of the most useful (and sadly under-used) pieces of work I’ve found helpful in describing programming plans to novices is the single-variable design patterns in [Kuit2004,Byck2005,Saja2006]. Consistent with everything we know about worked examples and subgoals, they found that labelling the parts of students’ programs gave students a vocabulary to think with, and implicitly a set of programming plans for constructing code of their own. Their patterns are:

Fixed value:

A data item that does not get a new proper value after its initialization.


A data item stepping through a systematic, predictable succession of values.


A data item traversing in a data structure.

Most-recent holder:

A data item holding the latest value encountered in going through a succession of unpredictable values, or simply the latest value obtained as input.

Most-wanted holder:

A data item holding the best or otherwise most appropriate value encountered so far.


A data item accumulating the effect of individual values.


A data item that gets its new value always from the old value of some other data item.

One-way flag:

A two-valued data item that cannot get its initial value once the value has been changed.


A data item holding some value for a very short time only.


A data structure storing elements that can be rearranged.


A data structure storing elements that can be added and removed.

The Roles of Variables website has examples of all of these, and I use these terms frequently in my own teaching. (I also wonder what a tool like Scratch would look like if users had to create variables with these roles, rather than plain variables.)

Does step size matter? Maybe. [Blik2014] found that, “more experienced students were more likely to adopt an incremental coding strategy (trying to debug and advance their code without external help through myriad trial-and-error attempts), whereas novices would update their code in larger batches, copying and adapting code from sample programs and other external sources.” However, when they looked at whether the amount of tinkering correlated with course performance, the answer was negative despite repeated re-slicing of the data to try to find an effect. They then looked at changes in the frequency of updates rather than update size, hypothesize that low-performing students wouldn’t change their update patterns, but high-performing students would, and again struck out.

On the other hand, [Cart2017] showed that students at different levels approach programming tasks differently, and that these differences can be detected automatically. Their model categorizes student activity in a two-dimensional space where one axis is the student’s current activity (e.g., editing or debugging) and the other is the correctness of the student’s most recently compiled program. This gives states like “editing syntactically correct code, last debug successful”, and allows them to construct activity sequences like, “Running a semantically incorrect program outside of debug mode.” The authors caution that a given sequence of state transitions could correspond to several different problem-solving activities, but found that high-performing students spent a lot of time in testing modes, while low-performing students spent much more time working on code with errors.

[Kaze2017] analyzed character-level edit and execution data from participants in an undergraduate course to see if incremental development and procrastination correlate with solution correctness, completion time, or total work time. Projects where the author started editing earlier were more likely to submit their projects earlier and to earn higher scores for correctness, and starting to write tests earlier was also associated with higher correctness scores. However, the authors found no significant relationship between incremental test writing or incremental checking of work and higher scores.

Does order matter? Probably. [Ihan2011] describes a tool for 2D Parsons Problems (i.e., ones in which code can be dragged horizontally as well as vertically). They found that experienced programmers often drag the method signature to the beginning, then add the majority of the control flow (i.e., loop statements, assignments, conditional statements), and only then add details like variable initialization and handling of corner cases. This out-of-order authoring is foreign to novices, who read and write code in the order it’s presented on the page; one of the benefits of live coding (Section 8.3) is that it gives them a chance to see the sequence that more advanced programmers actually use.

What Mistakes Do Learners Make?

The short answer is, “Teachers don’t know as much as many of us think we do.” [Brow2014] looked at eighteen types of errors, from mismatched parentheses to discarding the result of a non-void method. They found that “educators formed only a weak consensus about which mistakes are most frequent, that their rankings bore only a moderate correspondence to the students in thedata, and that educators’ experience had no effect on this level of agreement.” For example, mistaking = (assignment) and == (equality) in loop condition tests wasn’t nearly as common as most teachers believed.

[Alta2015] then looked at what errors novices actually make in Java. Unsurprisingly, mistakes that produce compiler errors are fixed much faster than ones that don’t. Mismatched quotes and parentheses are the most common type of error, but also the easiest to fix, while some mistakes (like putting the condition of an if in {} instead of ()) are most often made only once. However, some mistakes are made many times, like invoking methods with the wrong arguments (e.g., passing a string instead of an integer). Interestingly, another common error is reaching the end of a non-void method without returning a value. Python and other languages permit this, and return none if no value is explicitly specified; so far as I know, nobody has done a study to see if this default behavior masks faults in novices’ mental models.

These findings aren’t specific to any particular language: [Herm2016] gave 61 novice Scratch programmers a comprehension task, and found that students working with smelly code did not take more time to solve problems, but had lower correctness rate. Similarly, [Keun2017] looked at code quality issues in students’ Java programs using a subset of [Steg2016a]’s rubric for code quality. They found that students usually don’t fix issues, particularly issues related to modularization. One caution from their work is how important it is to distinguish mistakes from work in progress: for example, an empty if statement or a method that’s defined but not yet used may be a sign of incomplete code rather than an error.

[Edwa2017] studied nearly 10 million static analysis errors in over 500,000 program submissions—things like checking for null after a pointer is used instead of before. They found that formatting and Javadoc issues are the most common, and that coding flaws at any point when developing a solution resulted in significantly lower scores on the assignment. Students produce fewer errors with experience, but the errors that are most frequent are consistent between both computer science majors and non-majors and across experience levels.

How Do Students Test and Debug?

A decade ago, [McCa2008] wrote, “It is surprising how little page space is devoted to bugs and debugging in most introductory programming textbooks.” They describe many reasons for bugs that can be demonstrated in class or checked with formative assessment exercises, including:

Using the natural-language meaning of terms

like or meaning “either one or the other” rather than “either or both”.

Off-by-one errors when counting or indexing,

e.g., looking up from the last element of a list, or down from the first.

Putting code in the wrong place,

such as placing a print statement inside a loop when it should only be executed once after the loop finishes.

(How) do students debug? [Fitz2008,Murp2008] looked at how undergraduate students debugged their code. Most students who were good debuggers were good programmers, but not all good programmers were good at debugging. Those who were traced execution, wrote tests, re-read the spec, and used a debugger. However, tracing was sometimes used ineffectively: for example, a student might put the same print statement in both parts of an if-else. Students would also comment out lines that were actually correct in an attempt to isolate the problem, and didn’t seem to realize when they were stuck.

(How well) do students test their code? [Bria2015] describes a tool that scores a student’s program by how many teacher-provided tests cases pass, and conversely scores the student’s test cases by how many of the bugs in a model solution deliberately seeded with errors they catch. They found that students’ tests often have low coverage (i.e., they don’t test most of the code) and that students misunderstand the “unit” part of unit tests: their tests often exercise many things at once, which makes it hard to pinpoint the causes of errors.

[Edwa2014b] dug a little deeper, with sobering results. They had students write their own software tests, which were graded in part on branch coverage (i.e., how many of the possible paths through the code their tests exercised). They then looked at all of the bugs in all of the students’ code submissions combined and identified those detected by each student-written test suite. The result was that students’ tests had an average of 95.4% branch coverage on their own code, but only detected an average of 13.6% of the faults present in the entire program population. What’s more, 90% of the students’ tests were very similar, which indicates that students mostly write tests to confirm that code is doing what it’s supposed to rather than to uncover situations where it isn’t.

[Alqa2017] collected data from 142 novices doing their second programming course to categorize their debugging activities. They picked eight bugs, such as accidentally creating a loop without a body (because of a misplaced semi-colon) or an array bounds error, and then created one program for each bug that contained only that bug, and that could be fixed by modifying a single line. Unsurprisingly, students with more experience solved the problems significantly faster, but times varied widely: 4–10 minutes is a typical range, and overall times ranged from 4.5 minutes to two hours, which means that some learners will need 2–3 times longer than others to get through the same material.

Multiple studies have shown that reading code is the most effective way to find bugs [Basi1987,Keme2009,Bacc2013]. The value of tracing code has been studied less often for professional programmers, but it is a key skill for learners. [List2004] tested students’ ability to predict the output of short pieces of code and to select the correct completion of the code from a set of possibilities when told what it was supposed to do. Many students were weak at these tasks, which suggests that not being able to trace program execution is part of the explanation for poor programming ability.

[List2009] returned to this subject, and found once again that students who perform well at writing code can usually also trace and explain code. [Harr2018] later found that the gap between being able to trace code and being able to write it has largely closed by CS2, but that students who still have a gap (in either direction) are likely to do poorly in the course.

[Chi1989] found that some learners simply halt when they hit an unexplained step (or a step whose explanation they don’t understand) when doing mechanics problems in a physics class. Others pause their “execution” of the example to generate an explanation of what’s going on, and crucially, these people learn faster. This suggests that as well as asking learners to trace the execution of programs, we should show them programs’ output and ask them to explain why it is what it is.

Do Error Messages Matter?

The answer to the question in the section title is “yes”. [Hugh2010] traced student responses to non-literal errors, such as reports of missing parentheses that are in fact caused by missing + for string concatenation. They found that 8% of compilation errors and 100% of runtime exceptions in novices’ first Java programs were caused by string formatting problems exacerbated by poor error reports about non-literal errors. The problem is that misleading error messages erode novices’ trust: poor error messages are effectively behavioral conditioning to ignore all error messages. A common response from students is to move on and revisit the problem later, which is occasionally effective.

Can we do better? [Beck2016] tried writing better error messages for the Java compiler, so that instead of:

C:\stj\ error: cannot find symbol
        public static void main(string[ ] args){
1 error
Process terminated ... there were problems.

students would see:

Looks like a problem on line number 2.
If "string" refers to a datatype, capitalize the 's'!

Novices given these made fewer errors overall and fewer repeated errors.

[Beck2018b] measured the effect of enhanced compiler error messages by having students remove syntax errors from non-compiling code they did not write. They found a significant positive effect on the overall number of errors rectified, as well as the number of certain specific error types, but no significant effect on the number of non-compiling submissions or student scores. These results suggest that the apparently contradictory findings of other recent studies are not actually in conflict: enhanced error messages may be effective, but also that the signal is relatively weak.

[Bari2017] went further and used eye tracking to show that developers really do read error messages—people spend 13–25% of their time on task doing this. Doing so is as difficult as reading source code, and how difficult it is to read the error messages strongly predicts task performance. The inescapable conclusion is that it really is important to get this right, and that error messages should be usability tested the same way as any other interface.

Since interpreting and responding to error message is important, instructors should give learners exercises that do just that. [Marc2011] has a rubric for responses to error messages that can be useful in grading such assignments:

  1. Learner deletes the problematic code wholesale.
  2. Learner makes a change that is unrelated to the error message and does not help.
  3. Learner makes a change that is unrelated to the error message but correctly addresses a different error or makes progress in some other way.
  4. Learner’s change shows that they have understood the error message (though perhaps not wholly) and is trying to take an appropriate action (though perhaps not well).
  5. Learner’s change fixes the actual error (though other errors might remain).

[Sirk2012] catalogued the errors that students made when using a simple execution visualization tool, several of which echo the findings of [Mill2014,Mill2016b,Rago2017] discussed above. These mistakes can all motivate the design of particular exercises, and are presented in one of the challenges below.

Does Program Visualization Matter?

The idea of visualizing programs is perennially popular, but that doesn’t mean it’s effective. We have known for over 20 years that people learn more from constructing visualizations of algorithms than they do from viewing visualizations constructed by others [Stas1998,Ceti2016]. That said, [Guo2013] (a web-based tool for visualizing the execution of Python programs) and Loupe (which shows how JavaScript’s event loop works) are both great teaching aids.

Does drawing pictures help learners understand programs? Yes. [Cunn2017] replicates an earlier study of the kinds of sketching students do when tracing code execution, and correlations between different kinds and effectiveness. Different rates of sketching for different problems indicated that students use it to externalize cognition; as the study says, “Students frequently re-wrote pieces of code from questions on their scratch sheets, perhaps to manage issues related to the split attention effect.” Not sketching at all correlates with lower success, while tracing changes to variables’ values by writing new values near their names as they change was the most effective strategy.

One possible confounding effect they checked was time: since sketchers take significantly more time to solve problems, do they do better just because they think for longer? The answer is no: there was no correlation between the time taken and the score achieved. This is once again the kind of study that should lead to tool improvements: to the best of my knowledge, nobody provides a debugger that shows variables in rows with successive values laid out in columns using the horizontal axis for time.


One often-overlooked result from [Scan1989] is that students understand flowcharts better than pseudocode if both are equally well structured. Earlier work showing that pseudocode outperformed flowcharts used structured pseudocode and tangled flowcharts; when the playing field was levelled, novices did better with the graphical representation.

How Can We Help Them?

[Viha2014] examined the average improvement in pass rates of various kinds of intervention in programming classes. As they themselves point out, there are many reasons to take their findings with a grain of salt: the pre-change teaching practices are rarely stated clearly, the quality of change is not judged, and only 8.3% of studies reported negative findings, so either here is positive reporting bias or the way we’re teaching right now is almost the worst way possible and anything would be an improvement. It’s also worth remembering that like almost all of the studies discussed in this chapter, they were only looking at university classes: their findings may not generalize to other groups.

With all those caveats in mind, they found ten things instructors can do to improve outcomes. (The figures after each intervention are the number of studies reported and the average improvement they report.)

Collaboration (20/34%):

Activities that encourage student collaboration either in classrooms or labs.

Content Change (36/34%):

At least parts of the teaching material was changed or updated.

Contextualization (17/40%):

Activities where course content and activities were aligned towards a specific context such as games or media.

CS0 (7/43%):

Creation of a preliminary course to be taken before the introductory programming course; could be organized only for some (e.g., at-risk) students.

Game Theme (9/18%):

A game-themed component was introduced to the course.

Grading Scheme (11/29%):

A change in the grading schema; the most common change was to increase the amount of points rewarded from programming activities, while reducing the weight of the course exam.

Group Work (7/45%):

Activities with increased group work commitment such as team-based learning and cooperative learning.

Media Computation (10/48%):

Activities explicitly declaring the use of media computation (Chapter 10).

Peer Support (23/34%):

Support by peers in form of pairs, groups, hired peer mentors or tutors.

Other Support (9/33%):

An umbrella term for all support activities, e.g. increased teacher hours, additional support channels, etc.

This list highlights the importance of cooperative learning, and [Beck2013] looked at this specifically over three academic years in courses taught by two different instructors, and found significant benefits overall and for many subgroups: they not only had higher grades, they left fewer questions blank on the final exam, which indicates greater self-efficacy and willingness to try to debug things.

Pair programming is the most obvious kind of cooperative learning in a coding class, but there are lots of other things learners can do together. If they are writing programs to draw things using turtle graphics, for example, one can play the part of the loop, the second can be the conditional, and the third can be the turtle that is moving (or not) based on the instructions given by the other two.

One way to help learners is to give them consistent feedback that points them in the right direction. The most useful resource I have found for this is the code quality rubric developed in [Steg2014,Steg2016a] (which is online at [Steg2016b]). Its categories include variable names, comments, code layout, control flow, modularization, and so on; two examples of its levels are:


  1. names appear unreadable, meaningless, or misleading
  2. names accurately describe the intent of the code, but can be incomplete, lengthy, misspelled or inconsistent use of casing
  3. names accurately describe the intent of the code, and are complete, distinctive, concise, correctly spelled and consistent use of casing
  4. all names in the program use a consistent vocabulary

Control Flow

  1. there is deep nesting; code performs more than one task per line; unreachable code is present
  2. flow is complex or contains many exceptions or jumps; parts of code are duplicated
  3. flow is simple and contains few exceptions or jumps; duplication is very limited
  4. in the case of exceptions or jumps, the most common path through the code is clearly visible

How Should We Design Lessons?

What is our goal? The term computational thinking is bandied about a lot, in part because people can agree it’s important while meaning very different things by it. I find it more useful to think in terms of getting learners to understand a notional machine. The term was introduced in [DuBo1986], and means abstraction of the structure and behavior of a computational device. According to [Sorv2013], a notional machine:

  • is an idealized abstraction of computer hardware and other aspects of the runtime environment of programs;
  • serves the purpose of understanding what happens during program execution;
  • is associated with one or more programming paradigms or languages, and possibly with a particular programming environment;
  • enables the semantics of program code written in those paradigms or languages (or subsets thereof) to be described;
  • gives a particular perspective to the execution of programs; and
  • correctly reflects what programs do when executed.

For example, my notional machine for Python is:

  1. Running programs live in memory, which is divided between a call stack and a heap.
  2. Memory for data is always allocated from the heap.
  3. Every piece of data is stored in a two-part structure: the first part says what type the data is, and the second part is the actual value.
  4. Atomic data like Booleans, numbers, and character strings are stored directly in the second part. These values are never modified after they are created.
  5. The scaffolding for collections like lists and sets are also stored in the second part, but they store references to other data rather than storing those values directly. The scaffolding may be modified after it is created, e.g., a list may be extended or new key/value pairs may be added to a dictionary.
  6. When code is loaded into memory, Python parses it and converts it to a sequence of instructions that are stored like any other data. (This is why it’s possible to alias functions and pass them as parameters.)
  7. When code is executed, Python steps through the instructions, doing what each tells it to in turn.
  8. Some instructions make Python read data, operate on it, and create new data.
  9. Other instructions make Python jump to other instructions instead of executing the next one in sequence; this is how conditionals and loops work.
  10. Yet another instruction tells Python to call a function, which means temporarily switching from one blob of instructions to another.
  11. When a function is called, a new stack frame is pushed on the call stack.
  12. Each stack frame stores variables’ names and references to data. (Function parameters are just another kind of variable.)
  13. When a variable is used, Python looks for it in the top stack frame. If it isn’t there, it looks in the bottom (global) frame.
  14. When the function finishes, Python erases its stack frame and switches from its blob of instructions back to the blob that called it. If there isn’t a "beforehand", the program has finished.

I don’t try to explain all of this at once, but I draw on this mental model over and over again as I draw pictures, trace execution, and so on. After about 25 hours of class and 100 hours of work on their own time, I expect adult learners to be able to understand most of it.

[Sorv2014] lays out three cognitively plausible frameworks for the design of a first programming course, all of which lead to something like the notional machine described above. The first is Motivate-Isolate-Practice-Integrate:


start with a “large” project (by the learner’s standards).


highlight necessary skills.


work on individual skills.


bring skills to bear on the project.

The second, Head Straight for Objects, emphasizes object-oriented programming early: the teacher introduces ideas that objects can easily represent, like shapes or people’s roles, then shows how code can capture these ideas. The third approach, Explicit Program Dynamics, teaches programming by explaining how programs run. Copying values vs. copying references, stack frames, and other parts of the program’s state are first-class concepts, and visualization is typically used extensively and in varied ways.

Final Thoughts

As the introduction said, we don’t know as much about how people learn to program as we do about how they learn other things, and much of what we do know may only apply to certain groups. But that’s not the same as saying that we don’t know anything, and good lessons and teaching should draw on what knowledge we have. Conferences like SIGCSE, ITiCSE and ICER are home to a steady stream of rigorous, insightful studies with immediate practical application; while much of that work is sadly hidden behind exclusionary paywalls, sites like doi2bib and Sci-Hub can help you find what you need.


Checking for Common Errors (individual/20 minutes)

This list of common errors is taken from [Sirk2012]. Pick three, and write an exercise to check that learners aren’t making that mistake.

Inverted assignment:

The student assigns the value of the left-hand variable to the right-hand side variable, rather than the other way around.

Wrong branch:

Even though the conditional evaluates to False, the student jumps to the then clause.

Wrong False:

As soon as the conditional evaluates to False , the student returns False from the function.

Executing function instead of defining it:

The student believes that a function is executed as it is defined.

Unevaluated parameters:

The student believes the function starts running before the parameters have been evaluated.

Parameter evaluated in the wrong frame:

The student creates parameter variables in the caller’s frame, not in the callee’s.

Failing to store return value:

The student does not assign the return value in the caller.

Assignment copies object:

The student creates a new object rather than copying a reference.

Method call without subject:

The student tries to call a method from a class without first creating an instance of the class.

Mangled Code (pairs/15 minutes)

[Chen2017] describes exercises in which students reconstruct code that has been mangled by removing comments, deleting or replacing lines of code, moving lines, inserting extra unneeded lines, and so on. Student performance on these correlates strongly with performance on assessments in which students write code (i.e., whatever traditional assignments are measuring, these are measuring as well), but these questions require less (in-person) work to mark. Take the solution to a programming exercise you’ve created in the past, mangle it in two different ways, and swap with a partner.

The Rainfall Problem (pairs/10 minutes)

Solve the Rainfall Problem in the programming language of your choice in two different ways. Compare your solutions with those of your partner.

Rate Your Tools (individual/15 minutes)

[Koll2016] proposes a set of heuristics to be used in evaluating programming systems for novices, based in large part on the authors’ work developing several generations of such systems [Koll2015]. The heuristics are listed below; rate the programming system you are using in your teaching as “low”, “medium”, “high”, or “not applicable” for each one.


The system should engage and motivate the intended audience of learners. It should stimulate learners’ interest or sense of fun.


The system should not appear threatening in its appearance or behaviour. Users should feel safe in the knowledge that they can experiment without breaking the system, or losing data.

Minimal language redundancy:

The programming language should minimize redundancy in its language constructs and libraries.

Learner-appropriate abstractions:

The system should use abstractions that are at the appropriate level for the learner and task. Abstractions should be driven by pedagogy, not by the underlying machine.


The model, language and interface presentation should be consistent—internally, and with each other. Concepts used in the programming model should be represented in the system interface consistently.


The user should always be aware of system status and progress. It should be simple to navigate to parts of the system displaying other relevant data, such as other parts of a program under development.

Secondary notations:

The system should automatically provide secondary notations where this is helpful, and users should be allowed to add their own secondary notations where practical.


The presentation should maintain simplicity and clarity, avoiding visual distractions. This applies to the programming language and to other interface elements of the environment.

Human-centric syntax:

The program notation should use human-centric syntax. Syntactic elements should be easily readable, avoiding terminology obscure to the target audience.

Edit-order freedom:

The interface should allow the user freedom in the order they choose to work. Users should be able to leave tasks partially finished, and come back to them later.

Minimal viscosity:

The system should minimize viscosity in program entry and manipulation. Making common changes to program text should be as easy as possible.


Preference should be given to preventing errors over reporting them. If the system can prevent, or work around an error, it should.


The system should provide timely and constructive feedback. The feedback should indicate the source of a problem and offer solutions.

Roles of Variables (pairs/15 minutes)

Take a short program you have written (5–15 lines) and classify each of its variables using the categories defined by Sajaniemi et al. Compare your classifications with those of a partner: where did you agree? When you disagreed, did you understand each other’s view?

Choose Your Own Adventures (individual/10 minutes)

Which of the three approaches described in [Sorv2014] (Section 7.3) do you use when teaching? Or is your approach best described in some other way?

What Are You Teaching? (individual/10 minutes)

Compare the topics you teach to the list developed in [Luxt2017] (Section 7.3). Which topics do you cover? What extra topics do you cover that aren’t in their list?

Beneficial Activities (individual/10 minutes)

Look at the list of interventions developed by [Viha2014] (Section 7.11). Which of these things do you already do in your classes? Which ones could you easily add? Which ones are irrelevant?

Visualizations (individual/10 minutes)

What visualization do you most like to use when teaching? Is it a static image or an animation? Do you show it to your learners, do they discover it on their own, or something in between?

Misconceptions and Challenges (small groups/15 minutes)

The Professional Development for CS Principles Teaching site includes a detailed list of student misconceptions and challenges. Working in small groups, choose one section (such as data structures or functions) and go through their list. Which of these misconceptions do you remember having when you were a learner? Which do you still have? Which have you seen in your learners?