Speed and accuracy of performance are central to many theoretical accounts of cognitive processing. In recent years, several integrated performance measures have been proposed. A comparative study of the available measures [Vandierendonck, A. (

Performance on cognitive tasks can be expressed in terms of speed and accuracy of execution, typically measured by reaction time (RT) and proportion of errors (PE). Several measures that combine speed and accuracy into a single measure have been proposed (e.g.,

The oldest of the integrated measures and the only one available until the beginning of the twenty-first century is the _{c}

This measure is an estimation of RT adapted for the frequency of incorrect responses. For example, an average correct RT of 600 ms obtained while committing a proportion of 0.10 errors, produces an IES of 667 (ms). Experience with the usage of this measure has provided mixed results (e.g.,

Three decades after the introduction of IES, Woltz and Was (

where (RT_{a}_{c}_{a}

Still more recently, Hughes, Link, Bowles, Koeth and Bunting (

where _{i}_{e}

A further integrated measure was proposed based on the consideration that there is a need for a combined measure in which the two performance aspects are weighted so as to contribute equally to the composite score (

Where RT_{c}_{RT} refers to the standard deviation of the correct RTs, and _{PE} to the standard deviation of PE; all these values are calculated per condition._{PE} and/or PE are 0, LISAS is simply equal to RT_{c}

Whether or not such integrated speed-accuracy measures are useful depends on theoretical and methodological considerations. Given the assumptions about the relationship between speed and accuracy of performance, a number of cases can be distinguished. A first possibility is that RT and PE are assumed to have different origins, prohibiting a meaningful combination of both. In categorisation tasks, for instance, as trials progress, RT becomes faster, reflecting confidence in acquired knowledge, whereas errors result from not knowing the correct categorisation rule (e.g.,

Obviously, the question about the utility of integrated measures of RT and PE is confined to the latter case where RT and PE effects occur in the same direction. However, also methodological considerations play a role in deciding whether or not combined speed-accuracy measures are useful. Acceptance of an integrated measure depends on how well it is able to detect at least the same effects that are detected by speed and accuracy measures and on the extent to which it is more statistically powerful than the composing measures. Furthermore, integrated scores that assign a much larger weight to one component (e.g., accuracy) at the expense of the other component are very efficient in the detecting effects that rely on the strongly weighted component, but are deficient in their ability to detect effects that rely more on the other component. For that reason, useful integrated measures will combine the two aspects in a fair or balanced way. In this vein, a first and important consideration concerns the measurement scales of RT and PE. Whereas RT is typically based on a fine-grained millisecond-scale, PE is usually based on a coarse scale, not only when all-or-none (correct or incorrect) scoring is used, but also when error is expressed as a degree of error. Because of this difference between the RT and PE scales, the statistical power of the RT measurements outweighs that of the PE scores, so that in practice RT usually yields more reliable and more dependable observations than PE. On this basis, some researchers may decide to rely only or most heavily on RT. Note though, that if the theory’s prediction concerns both measurement aspects, the latter choice may be disputable.

Second, the balance between speed and accuracy depends on cognitive control. As a consequence, a person can modify the importance assigned to each performance aspect at any time during a series of actions (e.g.,

Third, irrespective of SAT, effects of instructions or task variations may result in different effects in the two measures: they may both increase, they may both decrease, or one measure may grow while the other shrinks.

A series of Monte Carlo simulations that tested the properties and the utility of IES, RCS, LISAS and 4 variations of the bin measure showed that the utility of the bin measures is rather limited due to the arbitrary penalisation of PE (

Monte Carlo simulation provides an excellent method to examine the question whether integrated speed-accuracy measures are useful. However, simulations typically require numerous assumptions that might not be representative of the cognitive processes being modelled. Therefore, as a follow-up on the earlier simulation studies (

The integrated measures were studied in two sets of data that were collected in the context of research projects on task switching in the author’s lab. In task switching research, RT and PE differences are expected between task repetition and task switching conditions and these differences are expected to be in the same direction, at least when there are no strong speed-accuracy trade-offs. In the remainder of the paper, these data sets, which include as well published as unpublished data, will be used to assess the utility of the three integrated measures that were found to be acceptable in the comparative Monte Carlo simulations of Vandierendonck (

In line with the approach taken in that study, the focus is on the efficiency with which integrated measures are able to detect effects by picking up information from the two components, speed and accuracy. The rationale is the following. The integrated measures combine RT and PE scores into a single measure, so that the combined score reflects information present in both components. These integrated measures can replace the two component measures if he information present in these components is represented equally well or even better than when only RT and PE would be used. In other words, if RT and PE reveal effects in the same direction, it is expected that the integrated measures show the combined effect, and the larger the observed effect on RT and PE, the larger the observed effect on the integrated measures is expected to be. Conversely, if the observed effect on RT and PE tends to be small, the effect on the integrated measure is also expected to be rather small. It is important to note that these relationships are situated at the level of the data sample.

Switching from one task to another involves a performance cost that has been attributed to the requirement to configure another task set and shield it from interference due to overlap with the previously relevant task set on switch trials as compared to trials where the same task is repeated (for reviews, see

Example of a stimulus consisting of digits (2, 3, 5, or 6) in an arrangement where 2, 3, 5, or 6 digits are shown in a playing card pattern.

Figure

All participants in the experiments gave written consent, and were either first-year psychology students who participated for course requirements and credit or paid volunteers recruited from the subject panel of the Department of Experimental Psychology at Ghent University. Table

Specifications of the experiments in Data Set 1: number of participants, design, and specific features of the experiment.

Experiment | N | Design | Features |
---|---|---|---|

1 | 20 (14) | T × D | CSI = 0; CCI = 0 |

2 | 20 (17) | S × T × D | CSI = 300 or 1000 ms; CCI = 0 |

3 | 22 (18) | S × T × D | Same as Exp. 2; global-local stimuli |

4 | 22 (19) | S × T × D | Same as Exp. 2; both tasks with same hand |

5 | 48 (36) | C × S × T × D | Cue: Dimension-Left; Dimension first |

6 | 45 (41) | C × S × T × D | Cue: Dimension-Left; Task first |

7 | 48 (43) | C × S × T × D | Cue: Dimension-Centre; Dimension first |

8 | 47 (36) | C × S × T × D | Cue: Dimension-Centre; Task first |

The second data set includes experiments that were performed in the context of a research project regarding the differences between explicit cues (signalling which task has to be performed) and transition cues (signalling whether the task remains the same or changes) in task switching. Five experiments were conducted, three of which were published (

Whereas Experiment 1 was designed to simply compare the two types of cueing, it involved one further within-subject factor, namely response congruency. When the two tasks require the same response to the present stimulus, the response is congruent. For example, with the response mappings small-left (large-right) and odd-left (even-right), the digit 3 requires a left response for both tasks (congruent), whereas the digit 4 requires different response for the magnitude than for the parity task. Experiments 2–5 used the so-called double registration procedure (

All participants in the experiments were first-year psychology students who participated for course requirements and credit. More details about the experiments are shown in Table

Specifications of the experiments in Data Set 2: number of participants, design, and specific features of the experiment.

Experiment | N | Design | Features |
---|---|---|---|

1 | 24 (15) | C × T × I | I refers to response (in)congruency (do tasks require same response or not) |

2 | 24 (20) | C × T | 500 ms delay between indication response and task stimulus |

3 | 20 (19) | C × T | Same as Exp. 2 with no delay |

4 | 19 (16) | R × T × I | Indication: choice or simple manual response |

5 | 18 (17) | R × T × I | Indication: choice or simple vocal response |

In all experiments of both data sets data analyses were performed on each of five measures, namely RT_{c}_{c}

In Data Set 1, two analyses were performed on each measure: an ANOVA involving all the factors of the design and their interactions as shown in Table

In Data Set 2, an ANOVA tested the main effects and interactions of the design as specified in Table

Because all the designs in the present paper involve within-subject factors, all analyses were performed by means of the multivariate general linear model with contrasts in the dependent variables to avoid problems with the sphericity assumption of the ANOVA model. For all the effects tested in all these analyses,

The utility of the integrated measures was assessed by comparing the results of the data analyses on these measures with the outcomes of the composing measures. A first analysis examined to which degree effects in these measures are recovered by each of the three integrated measures. Considering that RT is a statistically more sensitive measure than PE and has typically larger effect sizes than PE, plotting the effect sizes of the integrated speed-accuracy (ISA) measures against the RT effect sizes, allows for an examination of the consistency between the measures. Figure

Effect sizes of all the effects in both Data Sets in LISAS

Effect sizes of all the effects in both Data Sets in LISAS

Product-moment correlations and partial correlations (holding either PE or RT constant) of the integrated measure with RT and PE and their 0.95 confidence intervals.

Correlations | Partial Correlations | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

RT | PE | RT-PE | PE-RT | |||||||||

ISA | Value | CI | Value | CI | Value | CI | Value | CI | ||||

LISAS | 0.981 | 0.976 | 0.985 | 0.738 | 0.681 | 0.786 | 0.972 | 0.965 | 0.978 | 0.592 | 0.513 | 0.662 |

IES | 0.956 | 0.945 | 0.965 | 0.765 | 0.713 | 0.808 | 0.930 | 0.913 | 0.944 | 0.592 | 0.512 | 0.661 |

RCS | 0.911 | 0.890 | 0.929 | 0.719 | 0.659 | 0.770 | 0.835 | 0.797 | 0.867 | 0.371 | 0.268 | 0.465 |

Both figures further indicate that the variability of the data points around the linear regression line tended to be smaller in LISAS than in IES, and was larger in RCS than in the two other measures. In order to test the strength of these observed differences, the RT effect sizes were partitioned into deciles. The averages and the standard deviations of the corresponding ISA effect sizes were calculated per decile for each of the two data sets separately. These data were subjected to a 2 (Data Set) × 3 (Effect Level: deciles 1–3, 3–6, and 7–10) × 3 (ISA measures) orthogonal ANOVA with repeated measures on the last factor. The analysis based on the means revealed no differences except for Effect Level. The ANOVA on the standard deviations per decile, revealed a significant main effect of Data Set (M = 0.054 and 0.120 for respectively set 1 and 2),

A rather striking feature in Figures

Per experiment, the entire set of effects was partitioned on the basis of the effect size of RT and PE into a category with small and one with large effects. The small category contained the effects for which the effect sizes of RT and PE were both smaller than a predefined criterion. The other effects (either RT or PE or both at least as large as the criterion) were assigned to the large category. The choice of a criterion is to some extent arbitrary, but considering that the integrated measures are based on the same information as RT and PE, it may be expected that these measures report a reliable effect when either RT or PE or both effects are reliable. For that reason, the significance threshold was used, and the criterion value was defined to correspond to the size of an effect with one degree of freedom at the significance level of

For each experiment, Table

Proportions of effects detected by LISAS, IES, and RCS when a critical effect was detected by either RT or PE (hits) and when neither RT and PE detected a critical effect (false alarms) in both datasets.

Exp | Effects (N) | Large | Small | LISAS | IES | RCS | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

H | FA | H | FA | H | FA | |||||||

Data Set 1 | ||||||||||||

1 | 7 | 4 | 3 | 0.90 | 0.13 | 2.43 | 0.90 | 0.13 | 2.43 | 0.90 | 0.38 | 1.60 |

2 | 15 | 6 | 9 | 0.79 | 0.05 | 2.44 | 0.79 | 0.15 | 1.83 | 0.79 | 0.15 | 1.83 |

3 | 15 | 5 | 10 | 0.92 | 0.05 | 3.07 | 0.92 | 0.05 | 3.07 | 0.92 | 0.32 | 1.86 |

4 | 15 | 7 | 8 | 0.81 | 0.06 | 2.48 | 0.81 | 0.06 | 2.48 | 0.69 | 0.06 | 2.08 |

5 | 35 | 15 | 20 | 0.91 | 0.07 | 2.78 | 0.78 | 0.07 | 2.24 | 0.84 | 0.17 | 1.98 |

6 | 35 | 14 | 21 | 0.70 | 0.02 | 2.52 | 0.77 | 0.11 | 1.94 | 0.70 | 0.30 | 1.06 |

7 | 35 | 15 | 20 | 0.53 | 0.07 | 1.54 | 0.78 | 0.07 | 2.24 | 0.78 | 0.07 | 2.24 |

8 | 35 | 11 | 24 | 0.79 | 0.02 | 2.87 | 0.79 | 0.02 | 2.87 | 0.71 | 0.38 | 0.85 |

1 | 16 | 13 | 3 | 0.82 | 0.13 | 2.07 | 0.96 | 0.13 | 2.95 | 0.89 | 0.63 | 0.92 |

2.1 | 7 | 6 | 1 | 0.79 | 0.25 | 1.47 | 0.93 | 0.25 | 2.14 | 0.93 | 0.75 | 0.79 |

2.2 | 7 | 5 | 2 | 0.75 | 0.17 | 1.64 | 0.75 | 0.17 | 1.64 | 0.75 | 0.17 | 1.64 |

3.1 | 7 | 6 | 1 | 0.79 | 0.25 | 1.47 | 0.79 | 0.25 | 1.47 | 0.93 | 0.75 | 0.79 |

3.2 | 7 | 4 | 3 | 0.70 | 0.13 | 1.67 | 0.90 | 0.13 | 2.43 | 0.90 | 0.38 | 1.60 |

4.1 | 17 | 11 | 6 | 0.71 | 0.07 | 2.01 | 0.79 | 0.07 | 2.28 | 0.71 | 0.07 | 2.01 |

4.2 | 17 | 14 | 3 | 0.97 | 0.13 | 2.98 | 0.90 | 0.13 | 2.43 | 0.90 | 0.13 | 2.43 |

5.1 | 13 | 9 | 4 | 0.85 | 0.10 | 2.32 | 0.75 | 0.10 | 1.96 | 0.75 | 0.10 | 1.96 |

5.2 | 13 | 8 | 5 | 0.83 | 0.08 | 2.35 | 0.83 | 0.08 | 2.35 | 0.83 | 0.08 | 2.35 |

Total | 296 | 153 | 143 | 0.82 | 0.02 | 3.03 | 0.86 | 0.04 | 2.85 | 0.84 | 0.20 | 1.82 |

A 2 (Data Set) × 3 (Measures) ANOVA with repeated measures on the last factor was performed on the ^{2} statistic confirmed this difference, ^{2}(1) = 41.41,

If it is the case that RCS effect sizes tend to be larger than those obtained by the other measures, it is possible that the larger number of false alarms in RCS is the result of a tendency in RCS to produce higher effect sizes than the composing measures, with as a consequence that the higher effect size just exceeds the critical value for the large category. If so, the false alarms would be the result of a positive feature of RCS. Is it indeed the case that the occurrences of false alarms constitute an artefact?

In order to test this speculation, the RCS effect sizes in the hit, miss, false alarm and correct rejection categories were compared to the maximum effect size of RT and PE. Table

Mean effect size (standard errors between brackets) of RCS and the maximum of RT and PE as a function of hits, misses, false alarms and correct rejections in RCS.

Hit | Miss | False Alarm | Correct Rejection | |
---|---|---|---|---|

Max (RT, PE) | 0.586 (0.019) | 0.217 (0.027) | 0.058 (0.007) | 0.057 (0.005) |

RCS | 0.576 (0.020) | 0.065 (0.013) | 0.191 (0.017) | 0.039 (0.004) |

The question may be raised why the performance of RCS deviates from that of the other two measures. One factor which may contribute to these differences is the fact that RCS is based on all RTs whereas IES and LISAS are calculated from correct RTs only. In order to examine whether this factor accounts for the observed difference, a new version of RCS (RCS-c) was calculated which is based on the correct RTs completely in the same way as the other two measures. The scatterplots of the effect sizes against the effect size of RT are shown in panel D of Figure

Average ^{2}(1) = 0.35,

The 4 (detection Category) × 2 (Measures) ANOVA was repeated with RCS-c instead of RCS. If anything, this analysis revealed the same significant effects as the analysis on RCS, except that most of the effects were even stronger. Measures interacted with the contrast of the hit and miss categories,

All these analyses confirm that the problems with RCS (and the throughput measure) are not due to the fact that all RTs instead of only correct RTs are used to calculate the score. Using a score which is effectively the inverse of IES did not remove the variability in the generated effect sizes nor the occurrence of rather deviant false alarm effects (on average 3 times the size of the maximum of RT and PE).

The findings can be summarised as follows. First, the effect sizes of the integrated speed-accuracy measures showed a strong linear relationship with RT effect sizes and a somewhat less strong relationship with PE effect sizes. Although the correlations were high for all three measures, they were different across the three integrated measures. Second, the scatter plots revealed important differences in effect size variability among the three measures with the smallest degree of variability in LISAS and the largest in RCS, with IES in between. Moreover, at the lower end of the RT effect size continuum, RCS produced an important number of rather large effects. Third, signal detection analysis confirmed that the sensitivity (

These findings confirm that at least some of the integrated speed-accuracy measures are useful in situations where speed and accuracy effects are more or less pointing in the same direction. Yet, not all integrated measures that have been proposed thus far appear to be recommendable. Previous work has already elucidated the drawbacks of the binning procedure for the variant proposed by Hughes et al. (

With respect to IES, the present findings are somewhat mixed. On the one hand, it is clear that effect size in IES correlated less strongly with RT effects, but equally strongly with PE effects as LISAS did. On the other hand, in the signal detection analysis IES performed at the same level as LISAS. In the simulations reported by Vandierendonck (

So far this discussion seems to suggest that usage of RCS (and throughput) is best avoided, and that for IES one should take care not to use it with error proportions above 0.10. Although such issues have not been raised so far for LISAS, the advice formulated by Bruyer and Brysbaert (

The present application of the integrated measures to existing data sets suggests that such analyses in addition to or in follow-up of simulation studies is useful. Note that the shortcomings noted here with respect to RCS remained under the radar of earlier simulation work (

In recent studies of integrated measures (except for the study by

Clearly, further research regarding the influence of SAT and the utility of integrated measures in correlational research is more than welcome. As the usage of integrated measures has recently started to increase, these questions do not seem to be merely academic concerns but may result in new methodological approach to help shape the field of cognitive research.

Of the integrated measures available for use, at present only LISAS and IES are sufficiently reliable and efficient for use with the provision that IES is best avoided when error proportions exceed 0.10, while RCS introduces to much variability resulting in spurious effects. In view of the variability of effects present in RT and PE, it remains recommendable to check the RT and PE effects before applying integrated measures. However, one should also bear in mind that (a) integrated measures tend to yield larger effect sizes than RT and PE, and if integrated measures are used, there is no need to separately report RT and PE analyses, which reduces the number of statistical tests by 50 percent.

The data used in the present article are accessible at

If the number of observations per condition is too small, more stable estimates of the standard deviation can be obtained by grouping conditions, or by taking the standard deviation per subject. In the present paper, the standard deviations are calculated per condition.

Although the situation considered here raises the same problems as a speed-accuracy trade-off, it is different because opposing effects in speed and accuracy may occur apart from strategical attempts to adapt speed and accuracy. The problem of speed-accuracy trade-off will be addressed below.

Four planned contrasts were used, namely 3 orthogonal contrasts testing complete repetition v. one or more changes (loose structure), change in one v. two components (componential view), task switch v. dimension switch, and a fourth contrast for the hierarchical view contrasting a complete switch with a task switch only.

The present analyses are based on real data for which it is not known whether an effect is true or not. This may be considered as an important drawback. However, it should be noted that although the true effect is known in analyses based on simulations, the analysis of the data generated in these simulations faces the same problems, namely that in some samples the RT and/or PE effect is not reliable although it is known to be present. In order to evaluate whether an integrated measure adequately integrates the RT and PE information available in the sample, the effect sizes of the integrated measures have to be compared to those of RT and PE. In these circumstances, a relatively large and reliable effect size in an integrated measure, while there is no trace of effect in RT and PE would no doubt be considered as suspect. In other words, although the real effect is known, the analyses based on the data in the samples has to take into account whether the effect was detected by the composing measures, speed and accuracy, just like in situations where the real effect is not known.

The author has no competing interests to declare.

The data reported in this article were collected by Evelien Christiaens and Bjørn Van Loy in the context of research projects of the author.