The Danger of Testing by Selecting Controlled Subsets, with Applications to Spoken-Word Recognition

When examining the effects of a continuous variable x on an outcome y, a researcher might choose to dichotomize on x, dividing the population into two sets—low x and high x—and testing whether these two subpopulations differ with respect to y. Dichotomization has long been known to incur a cost in statistical power, but there remain circumstances in which it is appealing: an experimenter might use it to control for confounding covariates through subset selection, by carefully choosing a subpopulation of Low and a corresponding subpopulation of High that are balanced with respect to a list of control variables, and then comparing the subpopulations’ y values. This “divide, select, and test” approach is used in many papers throughout the psycholinguistics literature, and elsewhere. Here we show that, despite the apparent innocuousness, these methodological choices can lead to erroneous results, in two ways. First, if the balanced subsets of Low and High are selected in certain ways, it is possible to conclude a relationship between x and y not present in the full population. Specifically, we show that previously published conclusions drawn from this methodology—about the effect of a particular lexical property on spoken-word recognition—do not in fact appear to hold. Second, if the balanced subsets of Low and High are selected randomly, this methodology frequently fails to show a relationship between x and y that is present in the full population. Our work uncovers a new facet of an ongoing research effort: to identify and reveal the implicit freedoms of experimental design that can lead to false conclusions.

Integer Linear Programs to choose subsets of words In the main text, we describe an algorithm to identify two sets A and B that are different with respect to an explanatory variable x (A comes from the part of the population with "low" x, and B from the "high" x subset) and such that A and B are balanced with respect to a given list of control variables. In Figure 2, we describe how we compute the balanced subsets A and B, using an integer linear program (ILP). For an introduction to integer linear programming, see Papadimitriou and Steiglitz (1982). In Figure 2, we use the following ILP, for randomly chosen weights a and b: Here the idea is that, after generating random weights a i for each "low" word and b i for each "high" word, we select the balanced sets of low and high words that are lightest with respect to the chosen weights. The weights a and b describe the "cost" of selecting particular words; by randomly choosing those weights differently from run to run of our ILP algorithm, different words have high cost in different runs of the algorithm, so the balanced pairs of subsets we compute consequently differ across the algorithm's runs.
In Figure 1, we carefully choose balanced sets to maximize the apparent effect of the explanatory variable by using the response variable as the guide to choosing sets, instead of randomly chosen weights: we seek the balanced sets of low and high words that are most different with respect to the response variable. To do so, we use a very similar ILP, but with a different objective function: where x(word i) denotes the response-variable value for word i. The remainder of the calculation is exactly as described in Figure 2.

Computational complexity of finding balanced subsets
We claim that finding balanced sets A and B is an intractable problem, in general.
Here is a precise statement and outline of a proof, using a reduction from SubsetSum, a standard NP-complete problem (Garey & Johnson, 1979;Kleinberg & Tardos, 2005). Thus we would not expect to identify an efficient algorithm to solve the balanced subset problem; hence, the Integer Linear Program is an appropriate approach to solving the problem.
The specific algorithmic problem that we wish to solve (as described in the main text) is the following: Definition 1. The BalancedSubset problem is defined as follows.
Input: two sets A ⊆ R d and B ⊆ R d of d-dimensional vectors, a positive integer k ∈ Z, and a tolerance δ ≥ 0.
Output: do there exist subsets A ′ ⊆ A and B ′ ⊆ B, with |A| = |B| = k, such that, for every dimension i ∈ {1, 2, . . . , d}, To demonstrate the hardness of the general BalancedSubset problem, we will prove the hardness of a special case of it. Specifically, we consider the EqualHalves problem, which is the special case of BalancedSubset in which: • d = 1: there is only one control dimension.
• δ = 0: the tolerance is zero (so we have to find subsets that match exactly in that one dimension).
• c 1 (x) > 0 for all x: the values of all points in that one control dimension are strictly positive.
• |A| = |B| = 2k: the given sets are identical in cardinality, precisely twice that of the desired subsets.
Here is the formal definition of EqualHalves: Definition 2. The EqualHalves problem is defined as follows.
Input: two sets of positive integers A, B with |A| = |B| = n = 2k. (We permit duplicates in A and B.) Output: do there exist subsets A ′ ⊂ A and B ′ ⊂ B, both with size k and with equal sums?
We will prove the hardness of EqualHalves via reduction from SubsetSum; from this fact, we conclude the hardness of its generalization BalancedSubset.
Theorem 3. The EqualHalves problem is NP-Complete.
Proof. Via reduction from SubsetSum. An instance of the SubsetSum problem consists of a set of positive integers X = {x 1 , x 2 , . . . , x m } and a target sum W . The goal is to determine whether there is an X ′ ⊆ X whose sum is W . Without loss of generality, we can assume that x∈X > W . SubsetSum is well known to be an NP-complete problem (Garey & Johnson, 1979;Kleinberg & Tardos, 2005). (We permit duplicates in X, which doesn't affect the hardness of the problem.) Given such an instance X, W of SubsetSum, construct an EqualHalves instance as follows. Define A to be the union of X and m − 2 zeroes. Define B to contain W and 2m − 3 zeroes. We claim that X, W is a yes-instance of SubsetSum if and only if A, B is a yes-instance of EqualHalves.
(=⇒) Suppose A ′ and B ′ is a solution to A, B . The set A ′ must have a positive sum because strictly fewer than half of the elements of A are zero, and thus B ′ must contain W . Therefore the sum of elements in A ′ is W . Removing any zeroes from A ′ yields a subset of X whose sum is W .
(⇐=) Suppose X ′ is a solution to X, W . Generate A ′ by adding zeroes to X ′ until |A ′ | = m − 1. Let B ′ be W plus m − 2 zeroes. Both sets have size m − 1 and sum W .