Individual Differences in Inhibitory Control: A latent Variable Analysis

Inhibitory control represents a central component of executive functions and focuses on the ability to actively inhibit or delay a dominant response to achieve a goal. Although various tasks exist to measure inhibitory control, correlations between these tasks are rather small, partly because of the task impurity problem. To alleviate this problem, a latent variable approach has been previously applied and two closely related yet separable functions have been identified: prepotent response inhibition and resistance to distractor interference. The goal of our study was a) to replicate the proposed structure of inhibitory control and b) to extend previous literature by additionally accounting for speed-accuracy trade-offs, thereby potentially increasing explained variance in the investigated latent factors. To this end, 190 participants completed six inhibitory control tasks (antisaccade task, Stroop task, stop-signal task, flanker task, shape-matching task, word-naming task). Analyses were conducted using standard scores as well as inverse efficiency scores (combining response times and error rates). In line with previous studies, we generally found low zero-order correlations between the six tasks. By applying confirmatory factor analysis using standard reaction time difference scores, we were not able to replicate a satisfactory model with good fit to the data. By using inverse efficiency scores, a two-related-factor and a one-factor model emerged that resembled previous literature, but only four out of six tasks demonstrated significant factor loadings. Our results highlight the difficulty in finding robust inter-correlations between commonly used inhibitory control tasks, even when applying a latent variable analysis and accounting for speed-accuracy trade-offs.

ABSTRACT Inhibitory control represents a central component of executive functions and focuses on the ability to actively inhibit or delay a dominant response to achieve a goal. Although various tasks exist to measure inhibitory control, correlations between these tasks are rather small, partly because of the task impurity problem. To alleviate this problem, a latent variable approach has been previously applied and two closely related yet separable functions have been identified: prepotent response inhibition and resistance to distractor interference. The goal of our study was a) to replicate the proposed structure of inhibitory control and b) to extend previous literature by additionally accounting for speed-accuracy trade-offs, thereby potentially increasing explained variance in the investigated latent factors. To this end, 190 participants completed six inhibitory control tasks (antisaccade task, Stroop task, stop-signal task, flanker task, shape-matching task, word-naming task). Analyses were conducted using standard scores as well as inverse efficiency scores (combining response times and error rates). In line with previous studies, we generally found low zero-order correlations between the six tasks. By applying confirmatory factor analysis using standard reaction time difference scores, we were not able to replicate a satisfactory model with good fit to the data. By using inverse efficiency scores, a two-related-factor and a one-factor model emerged that resembled previous literature, but only four out of six tasks demonstrated significant factor loadings. Our results highlight the difficulty in finding robust inter-correlations between commonly used inhibitory control tasks, even when applying a latent variable analysis and accounting for speed-accuracy trade-offs.

*Author affiliations can be found in the back matter of this article
Individual Differences in Inhibitory Control: A latent Variable Analysis journal of cognition

INTRODUCTION
Inhibitory control represents a central component of executive functions. Although various terms and taxonomies exist, a common working definition is that inhibitory control focuses on the ability to actively inhibit or delay a dominant response to achieve a goal (Friedman & Miyake, 2004;Miyake et al., 2000;Nigg, 2000). Importantly, research has shown that inhibitory control represents a core ability that is associated with various types of executive functions, e.g., working memory updating and shifting (Miyake & Friedman, 2012). Not surprisingly, the construct is widely used in numerous research domains and has been proposed as an underlying mechanism implicated in different skills and cognitive achievements, for example attention (Friedman et al., 2007), working memory span and reading comprehension (De Beni et al., 1998;Gernsbacher, 1993), problem solving (Pasolunghi et al., 1999), general cognitive ability (Dempster & Corkill, 1999) as well as emotion regulation (Tabibnia et al., 2011). Deficient inhibition-related processes have been postulated in several forms of psychopathology and mental disorders, for example rumination (De Lissnyder et al., 2010) and depression (Joormann, 2010), externalizing behavior (Young et al., 2009), ADHD (Barkley, 1997;Nigg, 2001), substance use disorders (Nigg et al., 2006), schizophrenia (Westerhausen et al., 2011), autism (Geurts et al., 2014, and obsessive-compulsive disorder (van Velzen et al., 2014).
Despite the high relevance, some commonly used tasks to measure inhibitory control such as the Stroop task or stop-signal task often show low construct validities (Rabbitt, 1997; for a meta-analysis, see Duckworth & Kern, 2011) and poor reliabilities (Enkavi et al., 2019;recent review in Hedge et al., 2018). Furthermore, although a number of tasks have been used to tap inhibitory control, quite often only a single task is used per study (albeit for obvious reasons such as limited time and resources). Given that no tasks are pure measures of inhibitory control, it remains unclear whether the observed effects rely rather on idiosyncratic task requirements instead of inhibitory control. This well-known task impurity problem (that is related to all executive functions) indicates that, since any target inhibitory control process must be embedded in a specific context, systematic variance is attributable to non-inhibitory control abilities (Miyake et al., 2000). This and random measurement error make it difficult to purely measure inhibitory control variance. Consequently, low zero-order and often insignificant correlations between commonly used inhibitory control tasks have been reported and likely result from these problems (e.g. Enge et al., 2014;Singh et al., 2018).
As several studies pointed out previously, using multiple tasks and applying a latent variable analysis provides a more fruitful and reliable measurement of inhibitory control (e.g., Aichert et al., 2012;Miyake et al., 2000;Stahl et al., 2014). By extracting common variance that is shared by all tasks, latent variables provide purer measures, thereby reducing measurement error and the task impurity problem. In their seminal study regarding the unity and diversity of inhibition-related functions, Friedman and Miyake (2004) investigated the structure of three inhibition-related functions: prepotent response inhibition, resistance to distractor interference and resistance to proactive interference. By using structural equation modeling, the authors demonstrated that prepotent response inhibition and resistance to distractor interference are closely related to each other (r = .67) but separable from resistance to proactive interference. A study by Kane and colleagues (2016) confirmed the pattern that individual differences in inhibition-related functions represent distinguishable yet empirically related constructs. They found a robust association between attention restraint (e.g., antisaccade and Stroop task) and attention constraint abilities (e.g., flanker task) with a correlation (.60) similar to the study by Friedman and Miyake (.68), but also that these skills were distinguishable and not identical.
The first goal of this study was to replicate the finding by Friedman and Miyake (2004) of two latent variables ('prepotent response inhibition' and 'resistance to distractor interference') using a latent variable approach. We were especially interested whether prepotent response inhibition and resistance to distractor interference are in fact closely related, given that several studies emphasize conceptual differences between both types of inhibitory control (Dempster, 1995;Harnishfeger, 1995;Nigg, 2000) or even suggest both constructs to be empirically independent (Tiego et al., 2018). In detail, it has been postulated that resistance to distractor interference relates to an initial perceptual stage of information processing and focuses on the selection of relevant vs. irrelevant information. In contrast, prepotent response inhibition has been associated with a later stage of information processing, focusing on the inhibition of motor responses and behavioral impulses. Because of several methodological problems for resistance to proactive interference (e.g., the dependent variable in the respective tasks represents a difference score that results from only one measurement value, cf. Friedman & Miyake, 2004), we focused only on the relationship between prepotent response inhibition and resistance to distractor interference.
The second goal of the study was to account for the speed-accuracy trade-off that is inherently related to all tasks that rely on instructions that emphasize to respond as fast and as accurately as possible. However, given the inverse relationship between speed and accuracy in both animals and humans, known as the speed-accuracy trade-off (Bogacz, 2013;Wickelgren, 1977), performance measures based on either reaction times (e.g., Stroop effect) or error rates alone may be difficult to interpret (cf. Enge et al., 2014). For example, Draheim et al. (2019) extensively discussed the problems and alternatives of using standard reaction time difference scores in differential and developmental research in their recent review. Because previous work on integrated speed-accuracy measures has been based on simulation data or applied in more experimental paradigms (Heitz, 2014;Vandierendonck, 2017), this study investigates whether one such integrated measure empirically improves the measurement issue of more conventional reaction time (RT) difference scores for individual differences research. Therefore, we combined error rates and reaction times into inverse efficiency scores (IES; Bruyer & Brysbaert, 2011;Townsend & Ashby, 1983) by dividing the mean RT of correct responses by the proportion of correct responses. This was done in a previous study that applied a latent variable approach to executive function tasks (Wolff et al., 2016) and has the advantage that reaction time and accuracy are combined into a single performance measure. Specifically, we expected higher correlations and a better fit for estimated models with the IES compared to the standard outcome measures.
In sum, by using a latent variable analysis on six inhibitory control tasks, we aimed at replicating the general pattern of two closely related latent factors (cf. Friedman & Miyake, 2004): prepotent response inhibition (antisaccade task, Stroop task, stop-signal task) and resistance to distractor interference (Eriksen flanker task, shape-matching task, word-naming task; see below). In addition, we tried to extend previous literature by additionally considering speedaccuracy trade-offs using inverse efficiency scores, thereby potentially increasing explained variance in the investigated latent factors.

METHODS
We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study (Simmons et al., 2012). Data and analysis routines can be found at: https://osf.io/2fwm4.

PARTICIPANTS
The sample comprised 190 healthy adults (97 female; age = 18-39 years, M = 23.8 years, SD = 4.7 years) recruited at the TU Dresden. This sample size permitted a participant-to-parameter ratio of more than five in all models (as recommended by Hatcher & O'Rourke, 2014; see also Kline, 2016). Furthermore, this sample fits about the minimum sample size for the model structure with two latent and six observed variables (cf. Cohen, 1988;Westland, 2010), at an alpha level of 0.05, a power (1-beta) of 0.80, and an anticipated effect size of 0.38 according to the initial model of inhibition related functions in Friedman and Miyake (2004; calculated with the online calculator by Soper, 2018).
In a semi-structured interview for psychiatric and neurological disorders or treatment, none of the participants reported any current or past (in the last year) medical, neurological or psychiatric illness or treatment that might influence cognition or motor performance. All participants were non-smokers, reported German as their mother tongue, had normal or corrected to normal vision and no color blindness, and reported no regular substance or alcohol use. The study design was approved by the ethics committee of the TU Dresden (EK 357092014). The study was conducted in accordance with the Declaration of Helsinki and followed the ethical PROCEDURE Upon arrival, participants were briefly familiarized with the laboratory setting, informed about the upcoming experiment, provided demographic information and ratings on their current mood. Afterwards, participants performed six inhibitory control tasks (see below) in randomized order. Finally, participants were debriefed, reimbursed and thanked. The session lasted approximately 90 minutes. To ensure undisturbed testing, the sessions were carried out in testing booths. Participants were allowed breaks of self-chosen duration following completion of each task inside the testing booth. To prevent that breaks were skipped completely, participants were instructed to pause by leaving the testing booth for 5 minutes after completing the first three tasks. Since circadian variation might impact on cognitive performance (Bratzke et al., 2012;Hasher et al., 1999;Schmidt et al., 2007) all sessions were conducted between 9 am and 5 pm. Because the study was part of a larger project, all participants returned for a second session investigating general emotion regulation ability. These data are not reported here. A complete list of all measures in the larger project can be found at https://osf.io/2fwm4.

Inhibitory control battery
The task battery comprised six computerized reaction time tasks, three for prepotent response inhibition (antisaccade task, Stroop task, stop-signal task) and three for resistance to distractor interference (Eriksen flanker task, shape-matching task, word-naming task). Since we followed the approach taken by other authors in previous work on individual differences in inhibitory control, the tasks were adapted from Friedman and Miyake (2004) and Enge and colleagues (2014), respectively. Whereas most of the tasks were identical to the tasks by Friedman and Miyake (2004), our implementation differed slightly with regard to the Stroop task (where we used a color-word conflict instead of number-denotation conflict), and the stop-signal task (where we used a standard response format per button press instead of an auditory version). However, as can be seen in the work by Enge et al. (2014), these tasks are equally suitable for measuring inhibitory control. All tasks were preceded by written on-screen instructions and at least 20 practice trials. A QUERTZ layout keyboard, and a microphone with audio cable, respectively, was used to enter responses. In each task, both error rate and response time were recorded.
Antisaccade task. During each trial of the antisaccade task (described in Friedman & Miyake, 2004), a fixation cross appeared in the middle of a white screen with a jitter of 1500-3500 ms in 250 ms intervals, followed by a visual cue on one side of the screen for 175 ms, followed by a target stimulus (an arrow inside an open box) on the opposite side of the screen for 150 ms, followed by a gray mask that remained on the screen until the participant indicated the direction of the previously shown leftward, rightward or downward pointing arrow per button press (leftward, rightward, and downward pointing arrows on the keyboard, respectively). After 22 practice trials, participants received 90 target trials.
Stroop task. During each trial of the classical color Stroop task (described in Enge et al., 2014), a fixation cross was presented for 500 ms on a white screen, followed by different color names ("GREEN", "RED", "BLUE") or a neutral stimulus ("+ + + +") in varying font colors (green, red, or blue) for up to 1000 ms. Participants were instructed to identify the color of the presented stimulus by button press (red: leftward pointing arrow; green: downward pointing arrow; blue: rightward pointing arrow). Three types of trials were administered: congruent trials (matched font color and word meaning), incongruent trials (mismatched font color and word meaning), and neutral trials (neutral stimulus presented in one of the font colors). The three conditions were presented intermixed in a fixed random order. After 24 practice trials, participants received 240 target trials (80 per condition).
Stop-signal task. During the stop-signal task (described in Enge et al., 2014), a fixation cross was presented for 500 ms on a white screen, followed by a series of black capital letters for up to 1000 ms. Participants were instructed to discriminate between vowels and consonants per button press (go trial; vowels: leftward pointing arrow, consonants: rightward pointing arrow). On the minority of trials (25%), a letter appeared in red font color or changed its color after a few milliseconds from black to red (stop signal). Here, participants were instructed to suppress their response (stop trial). The delay between the stimulus and the stop signal (stop-signal delay, SSD) varied from 0 to 500 ms in 100 ms intervals (resulting in six steps that varied randomly). We assessed the stop-signal reaction time (SSRT) as the estimated time at which the stopping process finishes. As recommended by Logan (1994) and also pursued by Friedman and Miyake (2004), we used the common estimation method based on the horse-race model with the SSRT assumed to be a constant: For each SSD, all RTs for the go trials were rank ordered. Then, the number of the SSD was subtracted from the nth RT, where n was the number of all go trial RTs multiplied with the probability of responding at that delay. After 40 practice trials, participants received 440 target trials.
Eriksen flanker task. During each trial of the Eriksen Flanker task (described in Friedman & Miyake, 2004), a blank white screen was presented for 1000 ms, followed by a fixation cross for 500 ms, followed by a centrally presented letter. Participants were instructed to indicate by button press whether the target letter was H or K (CTRL right) and S or C (CTRL left), respectively. The letter was presented alone (no-noise condition, "H") or flanked by three noise letters on each side, resulting in another three conditions: 1) noise same as target ("HHHHHHH"), 2) noise compatible ("KKKHKKK"), and 3) noise incompatible ("SSSHSSS"). The stimuli remained on the screen until the participant responded. The four conditions were presented intermixed in a fixed pseudorandom order (no more than three successive trials of the same condition). After 20 practice trials, participants received 160 target trials (40 per condition).
Shape-matching task. During each trial of the shape-matching task (described in Friedman & Miyake, 2004; without negative priming trials), a fixation cross was presented on a black screen for 500 ms, followed by a green target shape on the left for 3000 ms (maximum), followed by a gray mask for 100 ms. Participants were instructed to indicate per button press as fast and accurately as possible whether the target shape matched (rightward pointing arrow) or mismatched (leftward pointing arrow) with a white shape on the right, ignoring the red distractor shape layering the target shape when present (distractor trial vs. no-distractor trial). A third of the trials (56) were no-distractor trials; the other 112 distractor trials. The stimuli of the task were a set of eight abstract shapes and exactly the same as used in the study by Friedman and Miyake (2004). Targets appeared equally often in each position. After 24 practice trials, participants received 168 target trials.
Word-naming task. During each trial of the word-naming task (described in Friedman & Miyake, 2004; without negative priming trials), a fixation cross was presented on a black screen for 500 ms, followed by a green target word on the top or bottom of the screen for 225 ms, followed by a gray mask for 100 ms, and a black screen until the participant responded. Participants were instructed to name aloud the target word and ignore the red distraction word on the opposite direction (top or bottom) when present (distractor vs. nodistractor trial). A third of the trials (56) were no-distractor trials; the other 112 distractor trials. Following the protocol by Friedman and Miyake (2004), the words were selected from eight German four-letter nouns ("TREE", "HOUSE", "SAND", "RING", "SONG", "DOG", "POT", "CLOTH", "SHIRT"), were matched in frequencies and did not rhyme. Targets appeared equally often in each position. After individual voice-key calibration and 24 practice trials, participants received 168 target trials.

STATISTICAL PROCEDURES Data trimming and outlier analysis
In order to most closely adhere to the original analysis protocol, data trimming and outlier analysis fully followed the steps by Friedman and Miyake (2004), based on the recommendations by Wilcox and Keselman (2003) for robust data analysis. For the RT-based measures, all RTs from errors (voice key or other) and all RTs less than 200 ms were eliminated. The percentage of the trials eliminated was less than 12.5% in all of these tasks. To prevent extreme RTs unreasonably influencing the means of each participant, RTs were trimmed the following way: First, following the trimming procedure by Friedman and Miyake (2004), the following upper and lower criteria were used for each task, and any values exceeding those criteria were replaced with those values: 400 ms and 2000 ms for the Stroop task, 200 ms and 1000 ms for the word-naming task, 200 ms and 2000 ms for the shape-matching, stop-signal and antisaccade task, and 200 and 1500 ms for the flanker task. This procedure affected no more than 10% of observations for the task, except the word-naming task (33%). Second, for each participant and each task, RTs farther than 3 SD from the mean for each condition were replaced with the respective value 3 SD above/below the mean (see Wilcox & Keselman, 2003). This procedure affected no more than 2% of observations for any task. Data for the stop-signal task were not subject to this procedure because the dependent measure was not influenced by extreme RTs. Afterwards, all between-participant distributions were examined for extreme scores. For each variable used in further analyses, observations farther from 3 SD from the group mean were replaced with the respective value. This final trimming procedure affected no more than 2.5% of observations for any task. This data trimming procedure was set up before data analysis and aimed at closely replicating the procedure by Friedman and Miyake (2004). Tables 1 and 2 (see "Results") depict the descriptive statistics of the outcome measures. To further ensure that extreme values did not influence the results, we checked for outliers and influential cases using leverage, studentized residuals, and Cook's D values. These values assess the influence of a single variable on the correlations. Extreme values were defined by leverage values >.05; studentized residuals > |3.00|; and D much larger than for the rest of the observations. Although some observations were indicated as extreme values, the correlations did not change when these observations were removed. In addition, we report robust Spearman rank correlations because this test does not rely on any assumptions about the distribution of the data, thereby providing a more conservative measure for potential associations. In all tasks, lower scores indicate better performance.
Participants received the standard instruction to respond as fast and as accurately as possible (Enge et al., 2014;Friedman & Miyake, 2004). The dependent variables for the analyses with the standard RT differences were: 1) the proportion of errors in the antisaccade task, 2) the reaction time difference between incongruent and congruent trials in the Stroop task, 3) the SSRT in the stop-signal task, 4) the reaction time difference in the no-noise versus noise incompatible condition in the Eriksen flanker task, 5) and 6) the reaction time difference between the distractor versus no-distractor condition in the shape-matching and word-naming task, respectively.
Because of the related speed-accuracy trade-off, error rates (ERs) and RTs were combined into inverse efficiency scores (IES; Bruyer & Brysbaert, 2011;Townsend & Ashby, 1983) by dividing the mean RT of correct responses by the proportion of correct responses (RT/[1-ER]). In the antisaccade task, the mean RT of correct responses in the target trials was divided by the proportion of correct responses during these target trials. In the Stroop task, the mean RT of  correct responses in the incongruent trials was divided by the proportion of correct responses during incongruent trials, and the mean RT of correct responses in congruent trials was divided by the proportion of correct responses during congruent trials, and the quotients were subtracted by each other (i.e., (RT inc /[1-ER inc ]) -(RT con /[1-ER con ]. In the Eriksen flanker task, the mean RT of correct responses during noise incompatible trials was divided by the proportion of correct responses during noise incompatible trials, and the mean RT of correct responses in no-noise trials was divided by the proportion of correct responses during no-noise trials, and the quotients were subtracted by each other (i.e., (RT inc /[1-ER inc ]) -(RT no-noise /[1-ER no-noise ]. In the shape-matching and word-naming task, the mean RT of correct responses in the distractor trials was divided by the proportion of correct responses during distractor trials, and the mean RT of correct responses in no-distractor trials was divided by the proportion of correct responses during no-distractor trials, and the quotients were subtracted by each other (i.e., (RT dis /[1-ER dis ]) -(RT no-dis /[1-ER no-dis ]. Because RTs are expressed in milliseconds (ms) and divided by proportions, IES are equally expressed in ms. IES were not used for the stop-signal task, because the SSRT already accounts for accuracy (cf. Logan et al., 2014 for further details). Descriptive statistics of the IES outcome measures, reaction times, error rates, and IES per condition and per task are given in Tables 1 and 2. All analyses were conducted using both standard response time outcomes as well as IES to examine possible differences between both measures.

Model estimation
Models were estimated with AMOS (Arbuckle, 2014) using the maximum likelihood (ML) estimation based on the covariance matrix (cf. Friedman & Miyake, 2004). As a prerequisite of ML estimation, we checked multivariate normality with Mardia's coefficient and Mahalanobis d 2 . Mardia's coefficient of multivariate skewness and kurtosis was significant and several multivariate outliers were indicated by significant Mahalanobis d 2 values. The results were the same when these outliers were removed. For this reason, all subjects were included in further analyses. Nevertheless, to critically evaluate the stability of parameter estimates, we bootstrapped the data 5000 times non-parametrically with replacement. This has been shown to generate less biased estimates compared to standard ML estimation for sample sizes around N = 200 (Nevitt & Hancock, 2001) with only moderate skewness (≤ 2) and kurtosis (≤ 7) (Gao et al., 2008). Bias-corrected standard errors and p-values were obtained by bootstrapping with N = 5000 samples (see Supplementary Table A1).
Model fit was evaluated using multiple indices according to the recommendation of Hu and Bentler (1999): chi-square statistic, the standardized root mean square residual (SRMR), the root mean square error of approximation (RMSEA), Bentler's comparative fit index (CFI), and the normed fit index (NFI). In addition, Akaike's information criterion (AIC) was examined (Burnham & Anderson, 2003). The chi-square statistic measures the "badness of fit" of the model compared with a saturated model, that is, the degree to which the covariances predicted by the model differ from the observed covariances (small values indicate no statistically meaningful differences and are therefore preferable). Compared to chi-square, the AIC takes the model complexity into account and was used to compare different models in order to determine the most adequate one (models yielding the lowest AIC are preferred). SRMR is an index of the average of standardized residuals between the observed and the predicted covariance matrixes; lower values indicate closer fit, values less than .08 indicate fair fit and less than .05 indicate good fit. RMSEA is an index of the difference between the observed covariance matrix per degree of freedom and the hypothesized covariance matrix which denotes the model (Chen, 2007). It also takes model complexity into account; lower values indicate closer fit, values less than .08 indicate an acceptable fit, less than .05 good fit, and less than .01 excellent fit. The CFI quantifies the extent to which the model is better than a baseline model (e.g., with covariances set to 0), and values above .95 indicate good fit, although .90 is also commonly used. The NFI measures the discrepancy between the chi-squared value of the hypothesized model and the chi-squared value of the null model; values above .95 indicate good fit. All analyses used an alpha level of .05. Table 1 depicts the descriptive statistics for response times, error rates and IES for single conditions of the six inhibitory control tasks.

9
Gärtner and Strobel Journal of Cognition DOI: 10.5334/joc.150 Table 2, the reliability estimates for the outcome measures of the six inhibitory control tasks were only moderate (high for antisaccade and shape-matching task, moderate for stop-signal and Stroop task, and low for flanker and word-naming task).

As shown in
Inter-correlations of reaction times and error rates for the six inhibitory control tasks are depicted in Table 3. There were mostly significant positive correlations between mean reaction times and error rates among the tasks. For example, Stroop mean RT correlated with stopsignal reaction time (SSRT) and antisaccade mean RT, and Stroop error rate correlated with stop-signal error rate and antisaccade error rate (see Table 3).
Bivariate zero-order correlations between the tasks are shown in Table 4. The magnitudes of these correlations were generally low (.29 or smaller). Using standard reaction time scores, there were significant positive correlations between performance in the antisaccade task and the stop-signal task and between performance in the Stroop task and the shape-matching task. Furthermore, there were correlations with p < .10 between performance in the Stroop task and the stop-signal task and between performance in the flanker task and the shapematching task. Using IES, there were still correlations between performance in the antisaccade task and the stop-signal task and between performance in the Stroop task and the shapematching task, and a correlation with p < .10 between performance in the Stroop task and the stop-signal task. Compared to the standard reaction time scores, there were now significant correlations between performance in the antisaccade task and the shape-matching task as well as performance in the antisaccade task and the word-naming task. Furthermore, there was a correlation with p < .10 between performance in the stop-signal task and the wordnaming task.  Table 3 Inter-correlations of reaction times and error rates for the six inhibitory control tasks.

THE TWO-FACTOR MODEL OF THE INHIBITION-RELATED FUNCTIONS
We constructed the measurement model of the two inhibition-related functions for both RT scores ( Figure 1A) and IES ( Figure 1B). Table 5 also presents the fit of the null model (all covariances among the tasks are hypothesized to equal zero, but variances of the tasks are allowed to vary freely). Given the low zero-order correlations between the tasks, one might speculate that there is not much to be modelled and that the fit of any model would be adequate. However, the fit of the null model was poor, 2 (15, N = 190) = 40.12, p < .001, RMSEA = .094, SRMR = .090, CFI < .01, AIC = 64.12, NFI < .01. Therefore, the covariances are substantial enough to support model-fitting procedures. As shown in Table 5, the fit of the depicted model with two related factors (prepotent response inhibition and resistance to distractor interference) was poor, 2 (8, N = 190) = 12.65, p > .05, RMSEA = .055, SRMR = .053, CFI = .82, AIC = 50.65, NFI = .685. Furthermore, only three tasks demonstrated significant factor loadings (antisaccade task, Stroop task, shape-matching task) and the two factors were not significantly related to each other. Table 5 also presents the fit statistics for alternative theoretical models that we considered (two factors unrelated, one factor). The fit statistics of these models were comparable (one factor) or even worse (two factors unrelated). Supplementary Table A1 contains the bootstrapped p-values and standard errors of the depicted model in Figure 1. Note that the fit of all models was evaluated according to the fit criteria reported in Table 5.

Figure 1
The two-factor model of inhibition-related functions using RT scores (A) and inverse efficiency scores (B), completely standardized solution. Numbers on the leftwards single-headed arrows are standardized factor loadings. Numbers on the rightwards smaller arrows depict error variances for each task, attributable to idiosyncratic task requirements and measurement error. The number on the curved double-headed arrow is the correlation between the latent variables. Bold-face type indicates significance at the .05 level.  As mentioned earlier, we constructed the same measurement models with IES and expected better model fit by taking reaction time and accuracy for each task into account. The fit for the model with two related factors was mediocre, 2 (8, N = 190) = 11.62, p > .05, RMSEA = .049, SRMR = .049, CFI = .87, AIC = 49.62, NFI = .730. The fit was preferable over the model with unrelated factors but comparable to the model with one factor. However, there were still two tasks that demonstrated no significant factor loadings (stop-signal task, flanker task).

DISCUSSION
In this study, we examined the relationship between six commonly used inhibitory control tasks and aimed at replicating the general pattern of two closely related latent variables (prepotent response inhibition, resistance to distractor interference). In addition, well-known speedaccuracy trade-offs were taken into account by considering inverse efficiency scores (IES). In line with previous studies (Aichert et al., 2012;Cheung et al., 2004;Enge et al., 2014;Enticott et al., 2006), we found generally low and non-significant zero-order correlations between the six tasks. By using standard reaction time difference scores, we were not able to replicate a satisfactory latent variable model with good fit to the data. In contrast, by using IES, both a two-related and a one-factor model with the latent variable response-distractor inhibition indicated mediocre fit to the data and resembles previous literature (Friedman & Miyake, 2004), although only four out of six tasks demonstrated significant factor loadings. The results highlight the difficulty in finding robust inter-correlations between inhibitory control tasks, even when accounting for speed-accuracy trade-offs, thereby possibly reflecting the consequence of the task impurity problem.
The magnitudes of the correlations between the six inhibitory control tasks were generally low (.29 or smaller), but are consistent with the results of previous studies and seem not to be restricted to college samples, but also present in samples with a wider age range and across different levels of intellectual abilities (Cheung et al., 2004;Enge et al., 2014;Enticott et al., 2006;Friedman & Miyake, 2004;Miyake et al., 2000;Shilling, Chetwynd, & Rabbitt, 2002;Singh et al., 2018;Wolff et al., 2016). This is why we applied a latent variable analysis: By extracting common variance that is shared by all tasks, latent variables provide purer measures, thereby reducing measurement error and the task impurity problem. Indeed, the fit for the null model (assuming that the covariances among all tasks are zero) was poor, indicating that although we found only low and mostly non-significant zero-order correlations, the covariances did support model-fitting procedures. However, applying the commonly used reaction time difference scores for the measurement model, we were not able to find a satisfactory fit for the twofactor model (prepotent response inhibition and resistance to distractor interference) or the alternative one-factor model (response-distractor inhibition) based on the findings of Friedman and Miyake (2004). Only three tasks demonstrated significant factor loadings (antisaccade task, Stroop task, shape-matching task).
Given that participants are generally instructed to respond as fast and accurately as possible when conducting these or similar executive function tasks, speed-accuracy trade-offs are likely to be expected. Indeed, negative correlations between mean reaction times and error rates were observed in our study, indicating enhanced speed at the expense of accuracy (see Table 3 for further details). Therefore, in a second step we computed a composite score combining reaction times and error rates in a single score (IES). Using IES, inter-correlations between tasks remained mostly the same (positive correlation between performance in the shape-matching and the Stroop task, as well as between the stop-signal and the antisaccade task) and two additional correlations were observed (positive correlations between the antisaccade task and the shape-matching and word-naming task, respectively). When constructing the measurement model, the fit for the model with two related factors was only moderate and two tasks still demonstrated non-significant factor loadings (stop-signal task, flanker task).
The failure to extract a satisfactory inhibitory control factor using latent variable analysis is consistent with a line of other studies (e.g., Friedman & Miyake, 2017;Huizinga et al., 2006;Logan et al., 2014;Singh et al., 2018;van der Sluis et al., 2007). Given the generally low zeroorder correlations, low factor loadings and high amount of unexplained variance (77-95%),

12
Gärtner and Strobel Journal of Cognition DOI: 10.5334/joc.150 one might conclude that the task measures for inhibitory control used in our and in other studies make it difficult to reliably measure a latent factor. This likely reflects the task impurity problem, that is, the fact that systematic variance is attributable to non-inhibitory abilities (e.g., specific task demands, differing task properties, measurement error). However, similar results regarding low zero-order correlations and factor loadings have also been found for tasks measuring working memory updating and shifting ability (Friedman & Miyake, 2017;Huizinga et al., 2006;van der Sluis et al., 2007), but these studies were able to successfully apply a latent variable approach and found higher factor loadings for the respective tasks. Therefore, another interpretation might be that, in contrast to updating and shifting, inhibitory control represents no common process. This assumption is supported by recent studies from Rey-Mermet and colleagues (2018) and Morra and colleagues (2018), emphasizing that the inhibition construct may need to be separated into different subtypes (see also Noreen & MacLeod, 2015). Instead, studies investigating inhibitory control as a latent variable often found that most of the variance can be accounted for by another factor, that is, basic naming speed (a non-executive processing demand in verbal tasks) and goal maintenance, respectively (Singh et al., 2018;van der Sluis et al., 2007;Friedman & Miyake, 2017). Although goal maintenance is a crucial prerequisite in all executive function tasks, it may be particularly important for inhibition tasks in which the main requirement is avoiding strong prepotent responses or conflicting information. This mechanism could explain why inhibitory control tasks often load on a common executive function factor, but not on an additional inhibition specific factor (Friedman & Miyake, 2017). However, the issue is likely more complicated, given that we found no latent factor (representing goal maintenance) for all tasks. Clearly, more research is needed to disentangle the effects of specific task demands (e.g., by using multiple versions of the same task), inhibitory control ability, and other involved processes like attention and basic naming speed.
A comprehensive study by Stahl et al. (2014) investigated behavioral components of impulsivity, among them resistance to distractor interference and prepotent response inhibition (which they called "stimulus interference" and "behavioral inhibition", respectively). In contrast to our findings, they were able to find two latent factors for both constructs using a structural equation modeling approach, but they were not significantly correlated (as opposed to Friedman & Miyake, 2004). The authors argued that this could be attributable to the applied tasks: In their view, the Stroop task and the flanker task involve both distractor-and responserelated interference, which might possibly reduce the amount of ability-specific variance in the respective latent factors. This is also in line with the study by Tiego et al. (2018), who classify the Stroop task among the flanker and shape-matching task as measures for distractor interference. Following this line of reasoning, the proposed unitary nature of the responsedistractor inhibition factor might be artefactual and possibly reflects a failure to use appropriate tasks, or task modifications, to circumvent the task impurity problem. Indeed, this interpretation seems to be partly supported by our data, with the strongest zero-order correlation observed between performance in the Stroop task and the shape-matching task (r = .29, p < .001), a correlation that would be expected if both were measuring resistance to distractor interference. Interestingly, similar correlations were found in the studies of Stahl et al. (2014;r = .21, p < .05) and Tiego et al. (2018;r = .299, p < .01). Furthermore, the study by Tiego et al. (2018) demonstrated that resistance to distractor interference and prepotent response inhibition were empirically unrelated when individual differences in working memory capacity were taken into account. Although the study was carried out in a developmental sample, it shows that the empirical overlap of both inhibitory control concepts might at least partly be explained by their common reliance on a limited-capacity attentional resource.
It should be noted that although we found generally low zero-order correlations between the tasks, there were mostly positive significant correlations between mean reaction times and error rates among the tasks. For example, Stroop mean RT correlated with stop-signal reaction time and antisaccade mean RT, and Stroop error rate correlated with stop-signal error rate and antisaccade error rate (see Table 3 for further information). The fact that error rate and mean RT were correlated in nearly all tasks provides support that the errors reflect an inability to inhibit prepotent responses and distractors, respectively, and that the mean RTs reflect general impulsivity. In contrast, the difference scores (e.g., Stroop effect, flanker effect) were not correlated. This is in line with research showing that difference scores are generally lower in reliability than their components. For example, Hedge and colleagues (2018) demonstrated that the total amount of variance is reduced in difference scores often by a 13 Gärtner and Strobel Journal of Cognition DOI: 10.5334/joc.150 factor of 3 or 4 relative to their components. Therefore, the authors concluded that "robust experimental effects do not necessarily translate to optimal methods of studying individual differences" (p.17), partly because experimental designs have been developed for providing robust effects, which means low between-participant variance (Hedge et al., 2018;see also Draheim, Mashburn, Martin, & Engle, 2019;and Liesefeld & Janczyk, 2019). Furthermore, the reliance on IES has also been debated, as Bruyer and Brysbaert showed that IES increase the variability of the measure when the respective error rate of the task exceeds 10 percent. This has a critical impact on the power of the experiment (Bruyer & Brysbaert, 2011). It remains to be seen whether current alternative statistical and methodological approaches, for example, reliance on accuracy-based measures (Draheim, Tsukahara, Martin, Mashburn, & Engle, 2019) or accounting for trial-by-trial variability (Rouder & Haaf, 2019), will prove promising. For example, Draheim et al. (2019) found that accuracy-based measures improve reliability and validity of attention measures. Using a hierarchical regression model, Rouder and Haaf (2019) showed improved reliability (but not validity). Similarly, Rey-Mermet et al. (2019) attempted to reduce variance associated with general processing speed when using difference scores.

LIMITATIONS AND FUTURE DIRECTIONS
Although all inhibitory control tasks were adopted from Friedman and Miyake (2004), there were some variations compared to their study (e.g., Stroop task with color-word conflict instead of number-denotation conflict; stop-signal task with standard response format per button press instead of an auditory version and without tracking method). At least regarding the Stroop task, this might explain why our mean Stroop effect was approximately 100 ms smaller (147 vs. 48 ms; stop-signal reaction time was comparable with 370 vs. 332 ms). However, as we have shown previously, the tasks are equally suitable for measuring inhibitory control (Enge et al., 2014) as they still provide meaningful interference effects. Therefore, these differences in implementation did not have a substantial effect on the results. However, a limitation might arise from the comparably low reliabilities of the word-naming and flanker tasks (.31 and .47, respectively). Although we used the same number of trials as Friedman and Miyake (2004) and wanted to stay as close as possible to their protocol, 40 trials per condition are few and might have contributed to the non-significant factor loadings. The word-naming task had more trials (168), but many had to be excluded during the trimming procedure (mostly due to technical artifacts with the microphone). Therefore, further studies should include a sufficiently large (as large as possible) number of trials to enhance reliability of the tasks.
A further limitation related to the antisaccade task is that because no eye-tracker was used in the study, we cannot rule out that direction errors were missed or wrongly detected. Furthermore, the visual angle was only about 2°. A study by Kane et al. (2001) has shown that a larger visual angle (around 11°) produces more reliable results. However, at least the general error rate is comparable to other studies (e.g., Friedman & Miyake, 2004).
With a sample size of 190, the present study also meets stricter criteria for a case-to-parameter ratio of 10-20:1 instead of 5:1 (Kline, 2016). However, this sample size may still be insufficient when applying χ 2 difference tests to decide between competing models with few degrees of freedom (Kenny et al., 2015). Although we wanted to stay as close as possible to Friedman and Miyake's latent variable analyses, further studies might apply Monte Carlo simulations (e.g., Muthén & Muthén, 2002) for determining adequate sample sizes for model comparisons. A larger sample size (>250) would also benefit the examination of robust intercorrelations (e.g., see Schönbrodt & Perugini, 2013).
Another general limitation of studies like ours regards sample composition. By investigating young healthy adults in an academic setting (students), it is possible that their general cognitive control ability is already in the upper range compared to the general population or clinical samples (e.g., patients with ADHD), resulting in relatively homogenous inhibitory control performance. This could make it even more difficult to find reliable interindividual differences and potentially underestimate the effect size. In contrast, it is reasonable to speculate that individual differences in inhibition could be found in clinical samples, or can be used to distinguish between clinical and non-clinical samples. Further studies should compare different samples, for example adults of the general population and clinical patients, in order to enhance heterogeneity in the cognitive control measures (but see Rey-Mermet et al., 2018, who studied inhibitory control in young and old adults but still found only weak evidence for inhibition as a psychometric construct).

CONCLUSION
In sum, our inhibition measures correlated only weakly. By accounting for speed-accuracy tradeoffs using inverse efficiency scores, we were able to extract a two-related-factor and a one-factor model, respectively, but only four out of six tasks demonstrated significant factor loadings in these models. Together, these results add to the growing body of research that calls into question whether individual differences in inhibitory control can be measured reliably and validly with the existing tasks. Future studies need to generate and test specific predictions on task demands, and think of alternative measures than difference scores when investigating individual differences, or develop new tasks that are able to tap more inhibition-related variance. Otherwise, the concept of inhibitory control as a common process may no longer withstand (cf. Noreen & MacLeod, 2015).

DATA ACCESSIBILITY STATEMENT
The dataset analyzed for this study and the analysis code can be found at the Open Science Framework [https://osf.io/2fwm4].

ETHICS AND CONSENT
The study design was approved by the ethics committee of the TU Dresden (EK 357092014). The study was conducted in accordance with the Declaration of Helsinki and followed the ethical guidelines of the German Psychological Association. All participants provided written informed consent.

FUNDING INFORMATION
Open Access Funding by the Publication Fund of the TU Dresden.

COMPETING INTERESTS
The authors have no competing interests to declare.