ACM Trans. Algorithms, Vol. 15, No. 1, Article 1, Publication date: September 2018.
DOI: https://doi.org/10.1145/3264434
We study the problem of testing whether an unknown $n$-variable Boolean function is a $k$-junta in the distribution-free property testing model, where the distance between functions is measured with respect to an arbitrary and unknown probability distribution over $\lbrace 0,1\rbrace ^n$. Our first main result is that distribution-free $k$-junta testing can be performed, with one-sided error, by an adaptive algorithm that uses $\tilde{O}(k^2)/\epsilon$ queries (independent of $n$). Complementing this, our second main result is a lower bound showing that any non-adaptive distribution-free $k$-junta testing algorithm must make $\Omega (2^{k/3})$ queries even to test to accuracy $\epsilon =1/3$. These bounds establish that while the optimal query complexity of non-adaptive $k$-junta testing is $2^{\Theta (k)}$, for adaptive testing it is $\mathrm{poly}(k)$, and thus show that adaptivity provides an exponential improvement in the distribution-free query complexity of testing juntas.
ACM Reference format:
Zhengyang Liu, Xi Chen, Rocco A. Servedio, Ying Sheng, and Jinyu Xie. 2018. Distribution-free Junta Testing. ACM Trans. Algorithms 15, 1, Article 1 (September 2018), 23 pages. https://doi.org/10.1145/3264434
Property testing of Boolean functions was first considered in the seminal works of Blum et al. (1993) and Rubinfeld and Sudan (1996) and has developed into a robust research area at the intersection of sub-linear algorithms and complexity theory. Roughly speaking, a property tester for a class $\mathcal {C}$ of functions from $\lbrace 0,1\rbrace ^n$ to $\lbrace 0,1\rbrace$ is a randomized algorithm that is given some form of access to the (unknown) input Boolean function $f$ and must with high probability distinguish the case that $f \in \mathcal {C}$ versus the case that $f$ is $\epsilon$-far from every function $g \in \mathcal {C}$. In the usual (uniform-distribution) property testing scenario, the testing algorithm may access $f$ by making black-box queries on inputs $x \in \lbrace 0,1\rbrace ^n$, and the distance between two functions $f$ and $g$ is measured with respect to the uniform distribution on $\lbrace 0,1\rbrace ^n$; the goal is to develop algorithms that make as few queries as possible. Many different classes of Boolean functions have been studied from this perspective (Blum et al. 1993; Alon et al. 2005; Bhattacharyya et al. 2010; Parnas et al. 2002; Diakonikolas et al. 2007; Goldreich et al. 2000; Fischer et al. 2002; Chakrabarty and Seshadhri 2013; Chen et al. 2014, 2015; Khot et al. 2015; Belovs and Blais 2016; Khot and Shinkar 2016; Chakrabarty and Seshadhri 2016; Baleshzar et al. 2016; Chen et al. 2017b, 2017c; Matulef et al. 2010, 2009; Blais et al. 2011; Blais and Kane 2012; Gopalan et al. 2011), and see other works referenced in the surveys by Ron (2008, 2010) and Goldreich (2010). Among these, the class of $k$-juntas—Boolean functions that depend only on (an unknown set of) at most $k$ of their $n$ input variables—is one of the best-known and most intensively investigated such classes (Fischer et al. 2004; Chockler and Gutfreund 2004; Blais 2008, 2009; Buhrman et al. 2013; Servedio et al. 2015), with ongoing research on junta testing continuing right up to the present (Chen et al. 2017a).
The query complexity of junta testing in the uniform distribution framework is now well understood. Improving on $\mathrm{poly}(k)/\epsilon$-query algorithms given in Fischer et al. (2004) (which introduced the junta testing problem), in Blais (2008), Blais gave a non-adaptive algorithm that makes $\tilde{O}(k^{3/2})/\epsilon$ queries, and in Blais (2009), Blais gave an $O(k\log k + k/\epsilon)$-query adaptive algorithm. On the lower bounds side, Fischer et al. (2004) initially gave an $\Omega (\sqrt {k})$ lower bound for non-adaptively testing $k$-juntas, which also implies an $\Omega (\log k)$ lower bound for adaptive testing. Chockler and Gutfreund improved the adaptive lower bound to $\Omega (k)$ in Chockler and Gutfreund (2004), and very recently Chen et al. (2017a) gave an $\tilde{\Omega }(k^{3/2})/\epsilon$ non-adaptive lower bound. Thus, in both the adaptive and non-adaptive uniform distribution settings, the query complexity of $k$-junta testing has now been pinned down to within logarithmic factors.
Distribution-free property testing. This work studies the junta testing problem in the distribution-free property testing model that was first introduced in Goldreich et al. (1998). In this model, the distance between Boolean functions is measured with respect to a distribution $\mathcal {D}$ over $\lbrace 0,1\rbrace ^n$, which is arbitrary and unknown to the testing algorithm. Since the distribution is unknown, in this model the testing algorithm is allowed (in addition to making black-box queries) to draw random labeled samples $(\mathbf {x},f(\mathbf {x}))$, where each $\mathbf {x}$ is independently distributed according to $\mathcal {D}$. The query complexity of an algorithm in this framework is the worst-case total number of black-box oracle calls plus random labeled samples that are used, across all possible distributions. (It follows that distribution-free testing of a class $\mathcal {C}$ requires at least as many queries as testing $\mathcal {C}$ in the standard uniform-distribution model.)
Distribution-free property testing is in the spirit of similar distribution-free models in computational learning theory such as Valiant's celebrated PAC learning model (Valiant 1984). Such models are attractive because of their minimal assumptions; they are well motivated both because in many natural settings the uniform distribution over $\lbrace 0,1\rbrace ^n$ may not be the best way to measure distances, and because they capture the notion of an algorithm dealing with an unknown and arbitrary environment (modeled here by the unknown and arbitrary distribution $\mathcal {D}$ over $\lbrace 0,1\rbrace ^n$ and the unknown and arbitrary Boolean function $f: \lbrace 0,1\rbrace ^n \rightarrow \lbrace 0,1\rbrace$). Researchers have studied distribution-free testing of a number of Boolean function classes, including monotone functions, low-degree polynomials, dictators (1-juntas) and $k$-juntas (Halevy and Kushilevitz 2007), disjunctions and conjunctions (monotone and non-monotone), decision lists, and linear threshold functions (Glasner and Servedio 2009; Dolev and Ron 2011; Chen and Xie 2016). Since depending on few variables is an appealingly flexible “real-world” property in comparison with more highly structured syntactically defined properties, we feel that junta testing is a particularly natural task to study in the distribution-free model.
Prior results on distribution-free junta testing. Given how thoroughly junta testing has been studied in the uniform distribution model, surprisingly little was known in the distribution-free setting. The adaptive $\Omega (k)$ and non-adaptive $\tilde{\Omega }(k^{3/2})/\epsilon$ uniform-distribution lower bounds from Chockler and Gutfreund (2004) and Chen et al. (2017a) mentioned earlier trivially extend to the distribution-free model, but no other lower bounds on distribution-free junta testing were known prior to this work. On the positive side, Halevy and Kushilevitz (2007) showed that any class $\mathcal {C}$ that has (i) a one-sided error uniform-distribution testing algorithm and (ii) a self-corrector, has a one-sided error distribution-free testing algorithm. As $\mathrm{poly}(k)/\epsilon$-query one-sided junta testers were given already in Fischer et al. (2004), and $k$-juntas have $O(2^{k})$-query self-correctors (Alon and Weinstein 2012), this yields a one-sided non-adaptive distribution-free junta tester with query complexity $O(2^k/\epsilon)$. No other results were known.
Thus, prior to this work there were major gaps in our understanding of distribution-free $k$-junta testing: Is the query complexity of this problem polynomial in $k$, exponential in $k$, or somewhere in between? Does adaptivity confer an exponential advantage, a sub-exponential advantage, or no advantage at all? Our results, described below, answer both these questions.
Our main positive result is a $\mathrm{poly}(k)/\epsilon$-query one-sided adaptive algorithm for distribution-free $k$-junta testing:
For any $\epsilon \gt 0$, there is a one-sided distribution-free adaptive algorithm for $\epsilon$-testing $k$-juntas with $\tilde{O}(k^2)/\epsilon$ queries.
Theorem 1.1 shows that $k$-juntas stand in interesting contrast with many other well-studied classes of Boolean functions in property testing such as conjunctions, decision lists, linear threshold functions, and monotone functions. For each of these classes, distribution-free testing requires dramatically more queries than uniform-distribution testing: for the first three classes the separation is $\mathrm{poly}(1/\epsilon)$ queries in the uniform setting (Parnas et al. 2002; Matulef et al. 2010) versus $n^{\Omega (1)}$ queries in the distribution-free setting (Glasner and Servedio 2009; Chen and Xie 2016); for $n$-variable monotone functions $\mathrm{poly}(n)$ queries suffice in the uniform setting (Goldreich et al. 2000; Khot et al. 2015), whereas Halevy and Kushilevitz (2007) show that $2^{\Omega (n)}$ queries are required in the distribution-free setting. In contrast, Theorem 1.1 shows that for $k$-juntas the query complexities of uniform-distribution and distribution-free testing are polynomially related (indeed, within at most a quadratic factor of each other).
Complementing the strong upper bound that Theorem 1.1 gives for adaptive testers, our main negative result is an $\Omega (2^{k/3})$-query lower bound for non-adaptive testers:
For $k \le n/200$, any non-adaptive algorithm that distribution-free $\epsilon$-tests $k$-juntas over $\lbrace 0,1\rbrace ^n$, for $\epsilon =1/3,$ must have query complexity $\Omega (2^{k/3}).$
Theorems 1.1 and 1.2 together show that adaptivity enables an exponential improvement in the distribution-free query complexity of testing juntas. This is in sharp contrast with uniform-distribution junta testing, where the adaptive and non-adaptive query complexities are polynomially related (with an exponent of only 3/2). To the best of our knowledge, this is the first example of a exponential separation between adaptive and nonadaptive distribution-free testers.
The algorithm. As a first step toward our $\tilde{O}(k^2)/\epsilon$-query algorithm, in Section 3, we first present a simple one-sided adaptive algorithm, which we call SimpleDJunta, that distribution-free tests $k$-juntas using $O({(k/ \epsilon)} + k \log n)$ queries. SimpleDJunta uses binary search and is an adaptation to the distribution-free setting of the $O({(k/ \epsilon)} + k \log n)$-query uniform-distribution algorithm, which is implicit in Blais (2009). The algorithm maintains a set $I$ of relevant variables: a string $x\in \lbrace 0,1\rbrace ^n$ has been found for each $i\in I$ such that $f(x)\ne f(x^{(i)})$ (we use $x^{(i)}$ to denote the string obtained by flipping the $i$th bit of $x$), and the algorithm rejects only when $|I|$ becomes larger than $k$. In each round, the algorithm samples a string $\mathbf {x}\leftarrow \mathcal {D}$ and a subset $\mathbf {R}$ of $\overline{I}:=[n]\setminus I$ uniformly at random. A simple lemma, Lemma 3.2, states that if $f$ is far from every $k$-junta with respect to $\mathcal {D}$, then $f(\mathbf {x})\ne f(\mathbf {x}^{(\mathbf {R})})$ with at least some moderately large probability as long as $|I|\le k$, where we use $\mathbf {x}^{(\mathbf {R})}$ to denote the string obtained from $\mathbf {x}$ by flipping every coordinate in $\mathbf {R}$. With such a pair $(\mathbf {x},\mathbf {x}^{(\mathbf {R})})$ in hand, it is straightforward to find a new relevant variable using binary search over coordinates in $\mathbf {R}$ (see Figure 1), with at most $\log n$ additional queries.
To achieve a query complexity that is independent of $n$, clearly one must employ a more efficient approach than binary search over $\Omega (n)$ coordinates (since most likely the set $\mathbf {R}$ has size $\Omega (n)$ for the range of $k$ we are interested in). In the uniform-distribution setting this is accomplished in Blais (2009) by first randomly partitioning the variable space $[n]$ into $s=\mathrm{poly}(k/\epsilon)$ disjoint blocks $B_1,\ldots ,B_s$ of variables and carrying out binary search over blocks (see Figure 2) rather than over individual coordinates; this reduces the cost of each binary search to $\log (k/\epsilon)$ rather than $\log n$. The algorithm maintains a set of relevant blocks: two strings $x,y\in \lbrace 0,1\rbrace ^n$ have been found for each such block $B$ that satisfy $f(x)\ne f(y)$ and $y=x^{(S)}$ with $S\subseteq B$, and the algorithm rejects when more than $k$ relevant blocks have been found. In each round the algorithm samples two strings $\mathbf {x},\mathbf {y}$ uniformly at random conditioned on their agreeing with each other on the relevant blocks that have already been found in previous rounds; if $f(\mathbf {x})\ne f(\mathbf {y}),$ then the binary search over blocks is performed to find a new relevant block. To establish the correctness of this approach, Blais (2009) employs a detailed and technical analytic argument based on the influence of coordinates and the Efron-Stein orthogonal decomposition of functions over product spaces. This machinery is well suited for dealing with product distributions, and indeed the analysis of Blais (2009) goes through for any product distribution over $\lbrace 0,1\rbrace ^n$ (and even for more general finite domains and ranges). However, it is far from clear how to extend this machinery to work for the completely unstructured distributions $\mathcal {D}$ that must be handled in the distribution-free model.
Our main distribution-free junta testing algorithm, denoted MainDJunta, draws ideas from both SimpleDJunta (mainly Lemma 3.2) and the uniform distribution tester of (Blais 2009). To avoid the $\log n$ cost, the algorithm carries out binary search over blocks rather than over individual coordinates, and maintains a set of disjoint relevant blocks $B_1,\ldots ,B_\ell$, i.e., for each $B_j$ a pair of strings $x^j$ and $y^j$ have been found such that they agree with each other over $\overline{B_j}$ and satisfy $f(x^j)\ne f(y^j)$. Let $w^j$ be the projection of $x^j$ (and $y^j$) over $\overline{B_j}$ and let $g_j$ be the Boolean function over $\lbrace 0,1\rbrace ^{B_j}$ obtained from $f$ by setting variables in $\overline{B_j}$ to $w^j$. For clarity, we assume further that every function $g_j$ is very close to a literal (i.e., for some $\tau \in \lbrace x_{i_j},\overline{x_{i_j}}\rbrace$, we have $g_j(x)=\tau$ for all $x\in \lbrace 0,1\rbrace ^{B_j}$ for some $i_j\in B_j$) under the uniform distribution. (To justify this assumption, we note that if $g_j$ is far from every literal under the uniform distribution, then it is easy to split $B_j$ further into two relevant blocks using the uniform distribution algorithm of (Blais 2009).) Let $I=\lbrace i_j:j\in [\ell ]\rbrace$. Even though the algorithm does not know $I$, there is indeed a way to draw uniformly random subsets $\mathbf {R}$ of $\overline{I}$. First, we draw a partition of $B_j$ into $\mathbf {P}_j$ and $\mathbf {Q}_j$ uniformly at random, for each $j$. Since $g_j$ is close to a literal, it is not difficult to figure out whether $\mathbf {P}_j$ or $\mathbf {Q}_j$ contains the hidden $i_j$, say it is $\mathbf {P}_j$ for every $j$. Then the union of all $\mathbf {Q}_j$’s together with a uniformly random subset of $\overline{B_1\cup \cdots \cup B_\ell }$, denoted by $\mathbf {R}$, turns out to be a uniformly random subset of $\overline{I}$. With $\mathbf {R}$ in hand, Lemma 3.2 implies that $f(\mathbf {x})\ne f(\mathbf {x}^{(\mathbf {R})})$ with high probability when $\mathbf {x}\leftarrow \mathcal {D}$, and when this happens, one can carry out binary search over blocks to increase the number of relevant blocks by one. In Section 4.1, we explain the intuition behind the main algorithm in more detail.
The lower bound. As we explain in Section 2, a $q$-query non-adaptive distribution-free tester is a randomized algorithm $A$ that works as follows. When $A$ is run on an input pair $(\phi ,\mathcal {D})$ 1 it is first given the result $(\mathbf {y}^1,\phi (\mathbf {y}^1)),\dots ,(\mathbf {y}^q,\phi (\mathbf {y}^q))$ of $q$ queries from the sampling oracle. Based on them, it queries the black-box oracle $q$ times on strings $\mathbf {z}^1,\dots ,\mathbf {z}^q.$ The $\mathbf {z}^j$’s may depend on the random pairs $(\mathbf {y}^i,\phi (\mathbf {y}^i))$ received from the sampling oracle, but the $j$th black-box query string $\mathbf {z}^j$ may not depend on the responses $\phi (\mathbf {z}^1),\dots ,\phi (\mathbf {z}^{j-1})$ to any of the $j-1$ earlier black-box queries.
As is standard in property testing lower bounds, our argument employs a distribution $ {\mathcal {YES}}$ over yes-instances and a distribution $ {\mathcal {NO}}$ over no-instances. Here, ${\mathcal {YES}}$ is a distribution over (function, distribution) pairs $(\mathbf {f},\boldsymbol {\mathcal {D}})$ in which $\mathbf {f}$ is guaranteed to be a $k$-junta; $ {\mathcal {NO}}$ is a distribution over pairs $(\mathbf {g},\boldsymbol {\mathcal {D}})$ such that with probability $1-o(1)$, $\mathbf {g}$ is $1/3$-far from every $k$-junta with respect to $\boldsymbol {\mathcal {D}}.$ To prove the desired lower bound against non-adaptive distribution-free testers, it suffices to show that for $q=2^{k/3}$, any deterministic non-adaptive algorithm $A$ as described above is roughly equally likely to accept whether it is run on an input drawn from ${\mathcal {YES}}$ or from $ {\mathcal {NO}}.$
Our construction of the ${\mathcal {YES}}$ and $ {\mathcal {NO}}$ distributions is essentially as follows. In making a draw either from ${\mathcal {YES}}$ or from $ {\mathcal {NO}}$, first $m=\Theta (2^k \log n)$ strings are selected uniformly at random from $\lbrace 0,1\rbrace ^n$ to form a set $\mathbf {S}$, and the distribution $\boldsymbol {\mathcal {D}}$ in both ${\mathcal {YES}}$ and $ {\mathcal {NO}}$ is set to be the uniform distribution over $\mathbf {S}$. Also, in both ${\mathcal {YES}}$ and $ {\mathcal {NO}}$, a “background” $k$-junta $\mathbf {h}$ is selected uniformly at random by first picking a set $\mathbf {J}$ of $k$ variables at random and then a random truth table for $\mathbf {h}$ over the variables in $\mathbf {J}$. We view the variables in $\mathbf {J}$ as partitioning $\lbrace 0,1\rbrace ^n$ into $2^k$ disjoint “sections” depending on how they are set.
In the case of a draw from ${\mathcal {YES}}$, the Boolean function $\mathbf {f}$ that goes with the above-described $\boldsymbol {\mathcal {D}}$ is simply the background junta $\mathbf {f}=\mathbf {h}$. In the case of a draw from $ {\mathcal {NO}}$, the function $\mathbf {g}$ that goes with $\boldsymbol {\mathcal {D}}$ is formed by modifying the background junta $\mathbf {h}$ in the following way (roughly speaking; see Section 5.1 for precise details): for each $z \in \mathbf {S}$, we toss a fair coin $\mathbf {b}(z)$ and set the value of all the strings in $z$’s section that lie within Hamming distance $0.4n$ from $z$ (including $z$ itself) to $\mathbf {b}(z)$ (see Figure 7). Note that the value of $\mathbf {g}$ at each string in $\mathbf {S}$ is a fair coin toss, which is completely independent of the background junta $\mathbf {h}.$ Using the choice of $m$ it can be argued (see Section 5.1) that with high probability $\mathbf {g}$ is $1/3$-far from every $k$-junta with respect to $\boldsymbol {\mathcal {D}}$ as $(\mathbf {g},\boldsymbol {\mathcal {D}})\leftarrow {\mathcal {NO}}$.
The rough idea of why a pair $(\mathbf {f},\boldsymbol {\mathcal {D}}) \leftarrow {\mathcal {YES}}$ is difficult for a ($q=2^{k/3}$)-query non-adaptive algorithm $A$ to distinguish from a pair $(\mathbf {g},\boldsymbol {\mathcal {D}}) \leftarrow {\mathcal {NO}}$ is as follows. Intuitively, in order for $A$ to distinguish the no-case from the yes-case, it must obtain two strings $x^1,x^2$ that belong to the same section but are labeled differently. Since there are $2^k$ sections but $q$ is only $2^{k/3}$, by the birthday paradox it is very unlikely that $A$ obtains two such strings among the $q$ samples $\mathbf {y}^1,\dots ,\mathbf {y}^q$ that it is given from the distribution $\boldsymbol {\mathcal {D}}$. In fact, in both the yes and no cases, writing $(\boldsymbol {\phi },\boldsymbol {\mathcal {D}})$ to denote the (function, distribution) pair, the distribution of the $q$ pairs $(\mathbf {y}^1,\boldsymbol {\phi }(\mathbf {y}^1)),\dots ,(\mathbf {y}^q,\boldsymbol {\phi }(\mathbf {y}^q))$ will be statistically very close to $(\mathbf {x}^1,\mathbf {b}_1),\dots ,(\mathbf {x}^q,\mathbf {b}_q)$, where each pair $(\mathbf {x}^j,\mathbf {b}_j)$ is independently drawn uniformly from $\lbrace 0,1\rbrace ^n \times \lbrace 0,1\rbrace$. Intuitively, this translates into the examples $(\mathbf {y}^i,\boldsymbol {\phi }(\mathbf {y}^i))$ from the sampling oracle “having no useful information” about the set $\mathbf {J}$ of variables that the background junta depends on.
What about the $q$ strings $\mathbf {z}^1,\dots ,\mathbf {z}^q$ that $A$ feeds to the black-box oracle? It is also unlikely that any two elements of $\mathbf {y}^1,\dots ,\mathbf {y}^q,\mathbf {z}^1,\dots ,\mathbf {z}^q$ belong to the same section but are labeled differently. Fix an $i \in [q]$, we give some intuition here as to why it is very unlikely that there is any $j$ such that $\mathbf {z}^i$ lies in the same section as $\mathbf {y}^j$ but has $f(\mathbf {z}^i) \ne f(\mathbf {y}^j)$ (via a union bound, the same intuition handles all $i \in [q]$). Intuitively, since the random examples from the sampling oracle provide no useful information about the set $\mathbf {J}$ defining the background junta, the only thing that $A$ can do in selecting $\mathbf {z}^i$ is to choose how far it lies, in terms of Hamming distance, from the points in $\mathbf {y}^1,\dots ,\mathbf {y}^q$ (which, recall, are uniform random). Fix $j \in [q]$, if $\mathbf {z}^i$ is within Hamming distance $0.4n$ from $\mathbf {y}^j$, then even if $\mathbf {z}^i$ lies in the same section as $\mathbf {y}^j$ it will be labeled the same way as $\mathbf {y}^j$, whether we are in the yes-case or the no-case. However, if $\mathbf {z}^i$ is farther than $0.4n$ in Hamming distance from $\mathbf {y}^j$, then it is overwhelmingly likely that $\mathbf {z}^i$ will lie in a different section from $\mathbf {y}^j$ (since it is very unlikely that all $0.4n$ of the flipped coordinates avoid the $k$-element set $\mathbf {J}$). We prove Theorem 1.2 in Section 5 via a formal argument that proceeds somewhat differently from but is informed by the above intuitions.
Organization. In Section 2, we define the distribution-free testing model and introduce some useful notation. In Section 3 we present SimpleDJunta and prove Lemma 3.2 in its analysis. In Section 4, we present our main algorithm MainDJunta and prove Theorem 1.1, and in Section 5, we prove Theorem 1.2.
Notation. We use $[n]$ to denote $\lbrace 1,\ldots ,n\rbrace$. We use $f$ and $g$ to denote Boolean functions, which are maps from $\lbrace 0,1\rbrace ^n$ to $\lbrace 0,1\rbrace$ for some positive integer $n$. We use the calligraphic font (e.g., $\mathcal {D}$ and ${\mathcal {NO}}$) to denote probability distributions, boldface letters such as $\mathbf {x}$ to denote random variables, and write “$\mathbf {x}\leftarrow \mathcal {D}$” to indicate that $\mathbf {x}$ is a random variable drawn from a distribution $\mathcal {D}.$ We write $\mathbf {x}\leftarrow \lbrace 0,1\rbrace ^n$ to denote that $\mathbf {x}$ is a string drawn uniformly at random. Given $S\subseteq [n]$, we also write $\mathbf {R}\leftarrow S$ to indicate that $\mathbf {R}$ is a subset of $S$ drawn uniformly at random, i.e., each index $i\in S$ is included in $\mathbf {R}$ independently with probability $1/2$.
Given a subset $B \subseteq [n]$, we use $\overline{B}$ to denote its compliment with respect to $[n]$, and $\lbrace 0,1\rbrace ^B$ to denote the set of all binary strings of length $|B|$ with coordinates indexed by $i\in B$. Given an $x \in \lbrace 0,1\rbrace ^n$ and a $B \subseteq [n]$, we write $x_B\in \lbrace 0,1\rbrace ^B$ to denote the projection of $x$ over coordinates in $B$ and $x^{(B)}\in \lbrace 0,1\rbrace ^n$ to denote the string obtained from $x$ by flipping coordinates in $B$. Given $x\in \lbrace 0,1\rbrace ^B$ and $y\in \lbrace 0,1\rbrace ^{\overline{B}}$, we write $x\circ y\in \lbrace 0,1\rbrace ^n$ to denote their concatenation, a string that agrees with $x$ over coordinates in $B$ and agrees with $y$ over $\overline{B}$. (As an example of the notation, given $x,y\in \lbrace 0,1\rbrace ^n$ and $B\subseteq [n]$, $x_B\circ y_{\overline{B}}$ denotes the string that agrees with $x$ over $B$ and with $y$ over $\overline{B}$.) Given $x,y \in \lbrace 0,1\rbrace ^n$, we write $d(x,y)$ to denote the Hamming distance between $x$ and $y$, and $\mathsf {diff}(x,y)\subseteq [n]$ to denote the set of coordinates $i\in [n]$ with $x_i\ne y_i$.
Given $f,g:\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$ and a probability distribution $\mathcal {D}$ over $\lbrace 0,1\rbrace ^n$, we write
We will often work with restrictions of Boolean functions. Given $f: \lbrace 0,1\rbrace ^n \rightarrow \lbrace 0,1\rbrace$, $S \subseteq [n]$ and a string $z\in \lbrace 0,1\rbrace ^B$, the restriction of $f$ over $B$ by $z$ , denoted by , is the Boolean function $g: \lbrace 0,1\rbrace ^{\overline{B}} \rightarrow \lbrace 0,1\rbrace$ defined by $g(x) = f(x\circ z)$ for all $x\in \lbrace 0,1\rbrace ^{\overline{B}}$.
Distribution-free property testing. Now, we can define distribution-free property testing:
We say an algorithm $A$ has oracle access to a pair $(f,\mathcal {D})$, where $f:\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$ is an unknown Boolean function and $\mathcal {D}$ is an unknown probability distribution over $\lbrace 0,1\rbrace ^n$, if it can (1) access $f$ via a black-box oracle that returns $f(x)$ when a string $x\in \lbrace 0,1\rbrace ^n$ is queried, and (2) access $\mathcal {D}$ via a sampling oracle that, upon each request, returns a pair $(\mathbf {x},f(\mathbf {x}))$, where $\mathbf {x}\leftarrow \mathcal {D}$ independently.
Let be a class of Boolean functions. A distribution-free testing algorithm $A$ for
is a randomized algorithm that, given as input a distance parameter $\epsilon \gt 0$ and oracle access to a pair $(f,\mathcal {D})$, accepts with probability at least $2/3$ if
and rejects with probability at least $2/3$ if $f$ is $\epsilon$-far from
with respect to $\mathcal {D}$. We say $A$ is one-sided if it always accepts when
. The query complexity of a distribution-free testing algorithm is the number of queries made on $f$ plus the number of samples drawn from $\mathcal {D}$.
One may assume without loss of generality that a distribution-free testing algorithm consists of two phases: In the first phase, the algorithm draws a certain number of sample pairs $(\mathbf {x},f(\mathbf {x}))$ from $\mathcal {D}$; in the second phase, it makes black-box queries to $f$. In general, a query $x\in \lbrace 0,1\rbrace ^n$ made by the algorithm in the second phase may depend on sample pairs it receives in the first phase (e.g. it can choose to query a string that is close to a sample received in the first phase) and results of queries to $f$ made so far. In Section 5, we will prove lower bounds on non-adaptive distribution-free testing algorithms. An algorithm is said to be non-adaptive if its black-box queries made in the second phase do not depend on results of previous black-box queries, i.e., all queries during the second phase can be made in a single batch (though we emphasize that they may depend on samples the algorithm received in the first phase).
Juntas and literals. We study the distribution-free testing of the class of $k$-juntas. Recall that a Boolean function $f$ is a $k$-junta if it depends on at most $k$ variables. More precisely, $f$ is a $k$-junta if there exists a list $1 \le i_1 \lt \cdots \lt i_{k} \le n$ of $k$ indices and a Boolean function $g: \lbrace 0,1\rbrace ^{k} \rightarrow \lbrace 0,1\rbrace$ over $k$ variables such that $f(x_1,\dots ,x_n) = g(x_{i_1},\dots ,x_{i_{k}})$ for all $x \in \lbrace 0,1\rbrace ^n$.
We say that a Boolean function $f$ is a literal if $f$ depends on exactly one variable, i.e. for some $i \in [n]$, we have that either $f(x) = x_i$ for all $x$ or $f(x)= \overline{x_i}$ for all $x$. Note that the two constant (all-1 and all-0) functions are one-juntas but are not literals.
We often use the term “block” to refer to a nonempty subset of $[n]$, which should be interpreted as a nonempty subset of the $n$ variables of a Boolean function $f:\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$. The following definition of distinguishing pairs and relevant blocks will be heavily used in our algorithms.
Given $x,y\in \lbrace 0,1\rbrace ^n$ and a block $B \subseteq [n]$, we say that $(x,y)$ is a distinguishing pair for $B$ if $x_{\overline{B}} = y_{\overline{B}}$ and $f(x) \ne f(y)$. We say $B$ is a relevant block of $f$ if such a distinguishing pair exists for $B$ (or equivalently, the influence of $B$ in $f$ is positive).
When $B=\lbrace i\rbrace$ is a relevant block, we simply say that the $i$th variable is relevant to $f$.
As will become clear later, all our algorithms reject a function $f$ only when they have found $k+1$ pairwise disjoint blocks $B_1,\ldots ,B_{k+1}$ and a distinguishing pair for each $B_i$. When this occurs, it means that $B_1,\ldots ,B_{k+1}$ are pairwise disjoint relevant blocks of $f$, which implies that $f$ cannot be a $k$-junta. As a result, our algorithms are one-sided. To prove their correctness, it suffices to show that they reject with probability at least $2/3$ when $f$ is $\epsilon$-far from $k$-juntas with respect to $\mathcal {D}$.
For the standard property testing model under the uniform distribution, Blais (2009) obtained a nearly optimal algorithm:
There exists a one-sided, $O((k/\epsilon) + k\log k)$-query algorithm Uni-formJunta $(f,k,\epsilon)$ that rejects $f$ with probability at least $2/3$ when it is $\epsilon$-far from $k$-juntas under the uniform distribution. Moreover, it rejects only when it has found $k+1$ pairwise disjoint blocks and a distinguishing pair of $f$ for each of them.
Binary Search. The standard binary search procedure (see Figure 1) takes as input two strings $\smash{x,y\in \lbrace 0,1\rbrace ^n}$ with $\smash{f(x)\ne f(y)}$, makes $\smash{O(\log n)}$ queries on $f$, and returns a pair of strings $\smash{x^{\prime },y^{\prime }} \in \lbrace 0,1\rbrace ^n$ with $f(x^{\prime })\ne f(y^{\prime })$ and $x^{\prime }=y^{\prime (i)}$ for some $i\in \mathsf {diff}(x,y)$, i.e., a distinguishing pair for the $i$th variable for some $i\in \mathsf {diff}(x,y)$.
However, we cannot afford to use the standard binary search procedure directly in our main algorithm due to its query complexity of $O(\log n)$; recall, our goal is to have the query complexity depend on $k$ only. Instead, we will employ a blockwise version of the binary search procedure, as described in Figure 2. It takes as input two strings $x,y\in \lbrace 0,1\rbrace ^n$ with $f(x)\ne f(y)$ and a sequence of pairwise disjoint blocks $B_1,\ldots ,B_r$ such that
As a warmup, we present in this section a simple, one-sided distribution-free algorithm for testing $k$-juntas (SimpleDJunta, where the capital letter $D$ is a shorthand for distribution-free). It uses $O((k/ \epsilon) + k \log n)$ queries, where $n$ as usual denotes the number of variables of the function being tested. The idea behind SimpleDJunta and its analysis (Lemma 3.2 below) will be useful in the next section where we present our main algorithm to remove the dependency on $n$.
The algorithm SimpleDJunta maintains a set $I\subset [n]$, which is such that a distinguishing pair has been found for each $i\in I$ (i.e., $I$ is a set of relevant variables of $f$ discovered so far). The algorithm sets $I=\emptyset$ at the beginning and rejects only when $|I|$ reaches $k+1$, which implies immediately that the algorithm is one-sided. SimpleDJunta proceeds round by round. In each round it draws a pair of random strings $\mathbf {x}$ and $\mathbf {y}$ with $\mathbf {x}_I=\mathbf {y}_I$. If $f(\mathbf {x})\ne f(\mathbf {y})$, then the standard binary search procedure is used on $\mathbf {x}$ and $\mathbf {y}$ to find a distinguishing pair for a new variable $i\in \overline{I}$, which is then added to $I$.
The description of the algorithm can be found in Figure 3. The following theorem establishes its correctness.
(i) The algorithm SimpleDJunta makes $O(({k}/{\epsilon })+k\log n)$ queries and always accepts when $f$ is a $k$-junta. (ii) It rejects with probability at least $2/3$ if $f$ is $\epsilon$-far from $k$-juntas with respect to $\mathcal {D}$.
For part (i), note that the algorithm only runs binary search (and spends $O(\log n)$ queries) when $f(\mathbf {x})\ne f(\mathbf {y})$ and this happens at most $k+1$ times (even though the algorithm has $O(k/\epsilon)$ rounds). The rest of part (i) is immediate from the description of the algorithm.
For part (ii), it suffices to show that when $|I|\le k$ at the beginning of a round, a new relevant variable is discovered in this round with at least a modestly large probability. For this purpose, we use the following simple but crucial lemma and note the fact that $\mathbf {x}$ and $\mathbf {y}$ on line 3 can be equivalently drawn by first sampling $\mathbf {x}\leftarrow \mathcal {D}$ and $\mathbf {w}\leftarrow \lbrace 0,1\rbrace ^n$ and then setting $\mathbf {y}=\mathbf {x}_I\circ \mathbf {w}_{\overline{I}}$ (the way we draw $\mathbf {x}$ and $\mathbf {y}$ in Figure 3 via $\mathbf {R}\leftarrow \overline{I}$ makes it easier to connect with the main algorithm in the next section).
If $f$ is $\epsilon$-far from $k$-juntas with respect to $\mathcal {D}$, then for any $I\subset [n]$ of size at most $k$, we have
Before proving Lemma 3.2, we use it to finish the proof of part (ii). Assuming Lemma 3.2 and that $f$ is $\epsilon$-far from $k$-juntas with respect to $\mathcal {D}$, for each round in which $|I| \le k$ the algorithm finds a new relevant variable with probability at least $\epsilon /2$. Using a coupling argument, the probability that the algorithm rejects $f$ (i.e., $|I|$ reaches $k+1$ during the $8(k+1)/\epsilon$ rounds) is at least the probability that
Let $I$ be a subset of $[n]$ of size at most $k$. To prove Equation (1) for $I$, we use $I$ to define the following Boolean function $h:\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$ over $n$ variables: for each $x\in \lbrace 0,1\rbrace ^n$, we set
Furthermore, we have
In this section, we present our main $ {\tilde{O}(k^2)/\epsilon }$-query algorithm for the distribution-free testing of $k$-juntas. We start with some intuition behind the algorithm.
Recall that the factor of $\log n$ in the query complexity of SimpleDJunta from the previous section is due to the use of the standard binary search procedure. To avoid it, one could choose to terminate each call to binary search early but this ends up giving us relevant blocks of variables instead of relevant variables. To highlight the challenge, imagine that the algorithm has found so far $\ell \le k$ many pairwise disjoint relevant blocks $B_j$, $j\in [\ell ]$; i.e., it has found a distinguishing pair for each block $B_j$. By definition, each $B_j$ must contain at least one relevant variable $i_j\in B_j$. However, we do not know exactly which variable in $B_j$ is $i_j$, and thus it is not clear how to draw a set $\mathbf {R}$ from $\overline{I}$ uniformly at random, where $I=\lbrace i_j:j\in [\ell ]\rbrace$, as on line 3 of SimpleDJunta, to apply Lemma 3.2 to discover a new relevant block. It seems that we are facing a dilemma when trying to improve SimpleDJunta to remove the $\log n$ factor: unless we pin down a set of relevant variables, it is not clear how to draw a random set from their complement, but pinning down a single relevant variable using the standard binary search procedure would already cost $\log n$ queries.
To explain the main idea behind our $ {\tilde{O}(k^2)/\epsilon }$-query algorithm, let us assume again that $\ell \le k$ many disjoint relevant blocks $B_j$ have been found so far, with a distinguishing pair $(x^{[j]},y^{[j]})$ for each $B_j$ (satisfying that $\mathsf {diff}(x^{[j]},y^{[j]})\subseteq B_j$ and $f(x^{[j]})\ne f(y^{[j]})$ by definition). Let
To make progress, we draw a random two-way partition of each $B_j$ into $\mathbf {P}_j$ and $\mathbf {Q}_j$, i.e., each $i\in B_j$ is added to $\mathbf {P}_j$ or $\mathbf {Q}_j$ with probability $1/2$ (so they are disjoint and $B_j=\mathbf {P}_j\cup \mathbf {Q}_j$). We make three simple but crucial observations to increase the number of disjoint relevant blocks by one.
Coming back to the assumption we made earlier, although $g_j$ is very unlikely to be a literal, it must fall into one of the following three cases: (1) close to a literal; (2) close to a (all-0 or all-1) constant function; or (3) far from 1-juntas. Here in all cases “close” and “far” means with respect to the uniform distribution over $\lbrace 0,1\rbrace ^{B_j}$. As we discuss in more detail in the rest of the section, with some more careful probability analysis the above arguments generalize to the case in which every $g_j$ is only close to (rather than exactly) a literal. However, if one of the blocks $B_j$ is in case (2) or (3), then (using the fact that we have a distinguishing pair for $B_j$) it is easy to split $B_j$ into two blocks and find a distinguishing pair for each of them. (For example, for case (3) this can be done by running Blais's uniform distribution junta testing algorithm.) As a result, we can always make progress by increasing the number of pairwise disjoint relevant blocks by one. Our algorithm basically keep repeating these steps until the number of such blocks reaches $k+1$.
Our algorithm MainDJunta $(f,\mathcal {D},k,\epsilon)$ is described in Figure 4. It maintains two collections of blocks $V=\lbrace B_1,\ldots ,B_v\rbrace$ ($V$ for “verified”) and $U=\lbrace C_1,\ldots ,C_u\rbrace$ ($U$ for “unverified”) for some nonnegative integers $v$ and $u$. They are set to be $\emptyset$ at initialization and always satisfy the following properties:
The algorithm rejects only when the total number of blocks $v+u\ge k+1$ so it is one-sided.
Throughout the algorithm and its analysis, we set a key parameter $ \gamma := 1/(8k).$ Blocks in $V$ are intended to be those that have been “verified” to satisfy the condition that $g_j$ is $\gamma$-close to a literal (for some unknown variable $i_j\in B_j$) under the uniform distribution, while blocks in $U$ have not been verified yet so they may or may not satisfy the condition. More formally, at any point in the execution of the algorithm we say that the algorithm is in good condition if its current collections $V$ and $U$ satisfy conditions (A), (B), and
The algorithm MainDJunta $(f,k,\epsilon)$ starts with $V=U=\emptyset$ and proceeds round by round. For each round, we consider two different types that the round may have: type 1 is that $u=0$, and type 2 is that $u\gt 0$. In a type-1 round (with $u=0$), we follow the idea sketched in Section 4.1 to increase the total number of disjoint relevant blocks by one. We prove the following lemma for this case in Section 4.3.
Assume that MainDJunta is in good condition at the beginning of a round, with $u=0$ and $v\le k$. Then it must remain in good condition at the end of this round. Moreover, letting $V^{\prime }$ and $U^{\prime }$ be the two collections of blocks at the end of this round, we have either $|V^{\prime }|=v$ and $|U^{\prime }|=1$, or $|V^{\prime }|=v-1$ and $|U^{\prime }|=2$ with probability at least $\epsilon /4$.
In a type-2 round (with $u\ge 1$), we pick an arbitrary block $C$ from $U$ and check whether $g_C$ is close to a literal under the uniform distribution. The following lemma, which we prove in Section 4.4, shows that with high probability, either $C$ is moved to collection $V$ and the algorithm remains in good condition, or the algorithm finds two disjoint subsets of $C$ and a distinguishing pair for each of them so that $V$ stays the same but $|U|$ goes up by one (we add these two blocks to $U$, since they have not yet been verified).
Assume that MainDJunta is in good condition at the beginning of a round, with $u\gt 0$ and $v+u\le k$. Then with probability at least $1-1/{(64k)}$, one of the following two events occurs at the end of this round (letting $V^{\prime }$ and $U^{\prime }$ be the two collections of blocks at the end of this round):
Assuming Lemmas 4.1 and 4.2, we are ready to prove the correctness of MainDJunta.
(i) The algorithm MainDJunta makes $ {\tilde{O}(k^2)/\epsilon }$ queries and always accepts $f$ when it is a $k$-junta. (ii) It rejects with probability at least $2/3$ when $f$ is $\epsilon$-far from every $k$-junta with respect to $\mathcal {D}$.
MainDJunta is one-sided, since it rejects $f$ only when it has found $k+1$ pairwise disjoint relevant blocks of $f$. Its query complexity is
For this purpose, we introduce a simple potential function $F$ to measure the progress:
Note that $F$ is 0 at the beginning ($V=U=\emptyset$) and that we must have $|U|+|V|\ge k+1$ (and thus, the algorithm rejects) when the potential function $F$ reaches $ {3(k+1)}$ or above. As a result, a necessary condition for the algorithm to accept is that one of the following two events occurs:
By a union bound, the probability of $E_1$ is at most
We start with a lemma for the subroutine WhereIsTheLiteral, which is described in Figure 5.
Assume that $g:\lbrace 0,1\rbrace ^B\rightarrow \lbrace 0,1\rbrace$ is $\gamma$-close (with respect to the uniform distribution) to a literal $x_i$ or $\overline{x_i}$ for some $i\in B$. If $i\in P$, then WhereIsTheLiteral $(g,P,Q)$ returns a distinguishing pair of $g$ for $P$ with probability at least $1-4\gamma$; If $i\in Q$, then it returns a distinguishing pair of $g$ for $Q$ with probability at least $1-4\gamma$.
Let $K$ be the set of strings $x\in \lbrace 0,1\rbrace ^B$ such that $g(x)$ disagrees with the literal to which it is $\gamma$-close (so $|K|\le \gamma \cdot 2^{|B|}$). We work on the case when $i\in Q$; the case when $i\in P$ is similar.
By the description of WhereIsTheLiteral, it returns a distinguishing pair for $Q$ if
We are now ready to prove Lemma 4.1.
First, it is easy to verify that if the algorithm starts a round in good condition, then it ends it in good condition. This is because whenever a block is added to $U$, it is disjoint from other blocks, and we have found a distinguishing pair for it.
Next it follows directly from Lemma 4.4 and a union bound that, for any sequence of partitions $P_j$ and $Q_j$ of $B_j$ picked on line 6, the probability that the for-loop correctly sets $\mathbf {S}_j$ to be the one that contains the special variable $i_j$ for all $j\in [v]$ is at least (recalling that $\gamma =1/(8k)$)
First it follows from the description of the subroutine Literal $(g)$ in Figure 6 that it either returns “true” or a pair of nonempty disjoint subsets $C^{\prime },C^*$ of $C$ and a distinguishing pair of $g$ for each of them (see Theorem 2.3). Next, let $C \in V$ be the block picked in line 20. If $g$ is $\gamma$-close to a literal, then it is easy to verify that one of the two events described in Lemma 4.2 must hold (using the property of Literal $(g)$ above). So, we focus on the other two cases in the rest of the proof: $g$ is $\gamma$-far from 1-juntas or $g$ is $\gamma$-close to a (all-1 or all-0) constant function. In both cases, we show below that the second event happens with high probability.
When $g$ is $\gamma$-far from 1-juntas under the uniform distribution, we have that one of the $ {\log k+6}$ calls to UniformJunta in Literal rejects with probability at least
When $g$ is $\gamma$-close to a constant function (say the all-1 function), we have that either string $x$ or $y$ in the distinguishing pair for $C$ disagrees with the function (say $g(x)=0$, since $g(x)\ne g(y)$). Let $K$ be the set of strings in $\lbrace 0,1\rbrace ^C$ that disagree with the all-1 function. Then line 7 of Literal $(g)$ does not hold only when one of $x^{(\mathbf {C}^{\prime })}$ or $x^{(\mathbf {C}^*)}$ lies in $K$. As both strings are distributed uniformly over $\lbrace 0,1\rbrace ^C$ by themselves, this happens with probability at most $2\gamma$ by a union bound. Therefore, the probability that line 7 holds at least once is at least
This finishes the proof of Lemma 4.2.
In this section, we prove the $\Omega (2^{k/3})$ lower bound for the non-adaptive distribution-free testing of $k$-juntas that was stated as Theorem 1.2. We start with some notation. Given a sequence $Y=(y^i:$ $i\in [q])$ of $q$ strings in $\lbrace 0,1\rbrace ^n$ and a Boolean function $\phi :\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$, we write $\phi (Y)$ to denote the $q$-bit string $\alpha$ with $\alpha _i=\phi (y^i)$ for each $i\in [q]$. We also write $\mathbf {Y}=(\mathbf {y}^i:i\in [q])\leftarrow \mathcal {D}^q$ to denote a sequence of $q$ independent draws from the same probability distribution $\mathcal {D}$.
Let $k$ and $n$ be two positive integers that satisfy $k\le n/200$. We may further assume that $k$ is at least some absolute constant $C$ (to be specified later), since otherwise, the claimed $\Omega (2^{k/3})$ lower bound on query complexity holds trivially due to the constant hidden behind the $\Omega$. Let $q=2^{k/3}.$ For convenience, we refer to an algorithm as a $q$-query algorithm if it makes $q$ sample queries and $q$ black-box queries each. Such algorithms are clearly at least as powerful as those that make $q$ queries in total. Our goal is then to show that there exists no $q$-query non-adaptive (randomized) algorithm for the distribution-free testing of $k$-juntas over Boolean functions of $n$ variables, even when the distance parameter $\epsilon$ is $1/3$.
By Yao's minimax principle, we focus on $q$-query non-adaptive deterministic algorithms. Such an algorithm $A$ (which consists of two deterministic maps $A_1$ and $A_2$ as discussed below) works as follows. Upon an input pair $(\phi ,\mathcal {D})$, where $\phi :\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$ and $\mathcal {D}$ is a probability distribution over $\lbrace 0,1\rbrace ^n$, the algorithm receives in the first phase a sequence $Y=(y^{i}:i\in [q])$ of $q$ strings (which should be thought of as samples from $\mathcal {D}$) and a binary string $\alpha =\phi (Y)$ of length $q$. In the second phase, the algorithm $A$ uses the first map $A_1$ to obtain a sequence of $q$ strings $Z=(z^{i}:i\in [q])=A_1(Y,\alpha)$ and feeds them to the black-box oracle. Once the query results $\beta =\phi (Z)$ are back, $A_2(Y,\alpha ,\beta)$ returns either 0 or 1 (notice that we do not need to include $Z$ as an input of $A_2$, since it is determined by $Y$ and $\alpha$), in which cases the algorithm $A$ either rejects or accepts, respectively. A randomized algorithm $T$ works similarly and consists of two similar maps $T_1$ and $T_2$ but both are randomized.
Given the description above, unlike typical deterministic algorithms, whether $A$ accepts or not depends on not only $(\phi ,\mathcal {D})$ but also the sample strings $\mathbf {Y}\leftarrow \mathcal {D}^q$ it draws. Formally, we have
The plan of the rest of the section is as follows. We define in Section 5.1 a pair of probability distributions ${\mathcal{YES}}$ and $\mathcal {NO}$ over pairs $(\phi ,\mathcal {D})$, where $\phi$ is a Boolean function over $n$ variables and $\mathcal {D}$ is a distribution over $\lbrace 0,1\rbrace ^n$. For clarity, we use $(f,\mathcal {D})$ to denote pairs in the support of ${\mathcal{YES}}$ and $(g,\mathcal {D})$ to denote pairs in the support of $\mathcal {NO}$. We show that (1) Every $(f,\mathcal {D})$ in the support of ${\mathcal{YES}}$ satisfies that $f$ is a $k$-junta (Lemma 5.2); (2) With probability $1-o_k(1)$, $(\mathbf {g},\boldsymbol {\mathcal {D}})\leftarrow \mathcal {NO}$ satisfies that $\mathbf {g}$ is $1/3$-far from every $k$-junta with respect to $\boldsymbol {\mathcal {D}}$ (Lemma 5.3). To obtain Theorem 1.2, it suffices to prove the following main technical lemma, which informally says that any $q$-query non-adaptive deterministic algorithm must behave similarly when it is run on $(\mathbf {f},\boldsymbol {\mathcal {D}}) \leftarrow {\mathcal{YES}}$ versus $(\mathbf {g},\boldsymbol {\mathcal {D}}) \leftarrow \mathcal {NO}$:
Any $q$-query deterministic algorithm $A$ satisfies
Assume for a contradiction that there exists a $q$-query non-adaptive randomized algorithm $T$ for the distribution-free testing of $k$-juntas over $n$-variable Boolean functions when $\epsilon =1/3$. Then it follows from Lemmas 5.2 and 5.3 that
Given $J\subseteq [n]$, we partition $\lbrace 0,1\rbrace ^n$ into sections (with respect to $J$) where the $z$-section, $z\in \lbrace 0,1\rbrace ^J$, consists of those $x\in \lbrace 0,1\rbrace ^n$ that have $x_J=z$. We write $\mathcal {JUNTA}_J$ to denote the uniform distribution over all juntas over $J$. More precisely, a Boolean function $\mathbf {h}:\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace$ drawn from $\mathcal {JUNTA}_J$ is generated as follows: For each $z\in \lbrace 0,1\rbrace ^J$, a bit $\mathbf {b}(z)$ is chosen independently and uniformly at random, and for each $x \in \lbrace 0,1\rbrace ^n$ the value of $\mathbf {h}(x)$ is set to $\mathbf {b}(x_J)$. Let
We start with ${\mathcal{YES}}$. A pair $(\mathbf {f},\boldsymbol {\mathcal {D}})$ drawn from ${\mathcal{YES}}$ is generated as follows:
For technical reasons that will become clear in Section 5.2 we use ${\mathcal{YES}}^*$ to denote the probability distribution supported over triples $(f,\mathcal {D},J)$, with $(\mathbf {f},\boldsymbol {\mathcal {D}},\mathbf {J})\leftarrow {\mathcal{YES}}^*$ being generated by the same two steps above (so, the only difference is that we include $\mathbf {J}$ in elements of ${\mathcal{YES}}^*$).
The following observation is straightforward from the definition of ${\mathcal{YES}}$.
The function $f$ is a $k$-junta for every pair $(f,\mathcal {D})$ in the support of ${\mathcal{YES}}$.
We now describe $\mathcal {NO}.$ A pair $(\mathbf {g},\boldsymbol {\mathcal {D}})$ drawn from $\mathcal {NO}$ is generated as follows:
Similarly, we let $\mathcal {NO}^*$ denote the distribution supported on triples $(g,\mathcal {D},J)$ as generated above.
See Figure 7 for an illustration of a function drawn from $\mathcal {NO}$. To gain some intuition, we first note that about half of the strings $z\in S$ have $g(z)$ disagree with the value of the background junta on the section it lies in. With such a string $z$ in hand (from one of the samples received in the first phase), an algorithm may attempt to find a string $w$ that lies in the same section as $z$ but satisfies $g(z)\ne g(w)$. If such a string is found, then the algorithm knows for sure that $(g,\mathcal {D})$ is from the $\mathcal {NO}$ distribution. However, finding such a $w$ is not easy, because one must flip more than $0.4n$ bits of $z$, but without knowing the variables in $J$ it is hard to keep $w$ in the same section as $z$ after flipping this many bits.
Next, we prove that with high probability, $(\mathbf {g},\boldsymbol {\mathcal {D}})\leftarrow \mathcal {NO}$ satisfies that $\mathbf {g}$ is $1/3$-far from every $k$-junta with respect to $\boldsymbol {\mathcal {D}}$:
With probability at least $1-o_k(1)$, $(\mathbf {g},\boldsymbol {\mathcal {D}})\leftarrow \mathcal {NO}$ is such that $\mathbf {g}$ is $1/3$-far from every $k$-junta with respect to the distribution $\boldsymbol {\mathcal {D}}$.
Fix a $k$-junta $h$, i.e., any set $I \subset [n]$ with $|I|=k$ and any $2^k$-bit truth table over variables in $I$. We have that $\mathsf {dist}_{\boldsymbol {\mathcal {D}}}(\mathbf {g},h)$ is precisely the fraction of strings $z \in \mathbf {S}$ such that $\boldsymbol {\gamma }(z) \ne h(z).$ Since each bit $\boldsymbol {\gamma }(z)$ is drawn independently and uniformly at random, we have that
Given Lemmas 5.2 and 5.3, to prove Theorem 1.2 it remains only to prove Lemma 5.1.
The following definitions will be useful. Let $Y=(y^i:i\in [q])$ be a sequence of $q$ strings in $\lbrace 0,1\rbrace ^n$, $\alpha$ be a $q$-bit string, and $J\subset [n]$ be a set of variables of size $k$. We say that $(Y,\alpha ,J)$ is consistent if
To prove Lemma 5.1, we first derive from $A$ a new randomized algorithm $A^{\prime }$ that works on triples $(\phi ,\mathcal {D},J)$ from the support of either ${\mathcal{YES}}^*$ or $\mathcal {NO}^*$. Again for clarity we use $\phi$ to denote a function from the support of ${\mathcal{YES}}$/${\mathcal{YES}}^*$ or $\mathcal {NO}$/$\mathcal {NO}^*$, $f$ to denote a function from ${\mathcal{YES}}$/${\mathcal{YES}}^*$ and $g$ to denote a function from $\mathcal {NO}$/$\mathcal {NO}^*$.
In addition to being randomized, $A^{\prime }$ differs from $A$ in two important ways:
A detailed description of the randomized algorithm $A^{\prime }$ running on $(Y,\alpha ,J)$ is as follows:
From the description of $A^{\prime }$ above, whether it accepts a triple $(\phi ,\mathcal {D},J)$ or not depends on both the randomness of $\mathbf {Y}$ and $\mathbf {h}^{\prime }$. Formally, we have
Lemma 5.1 follows immediately from the following three lemmas (note that the marginal distribution of $(\mathbf {f},\boldsymbol {\mathcal {D}})$ in ${\mathcal{YES}}^*$ (or $(\mathbf {g},\boldsymbol {\mathcal {D}})$ in $\mathcal {NO}^*$) is the same as ${\mathcal{YES}}$ (or $\mathcal {NO}$)). In all three lemmas, we assume that $A$ is a $q$-query non-adaptive deterministic algorithm while $A^{\prime }$ is the randomized algorithm derived from $A$ as described above.
We have
We have
We have
We start with the proof of Lemma 5.4, which says that a limited algorithm such as $A^{\prime }$ cannot effectively distinguish between a draw from ${\mathcal{YES}}^*$ versus $\mathcal {NO}^*$:
Since $A^{\prime }$ runs on $(Y,\alpha ,J)$, it suffices to show that the distributions of $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J})$ induced from ${\mathcal{YES}}^*$ and $\mathcal {NO}^*$ have small total variation distance. For this purpose, we first note that the distributions of $(\mathbf {Y},\mathbf {J})$ induced from ${\mathcal{YES}}^*$ and $\mathcal {NO}^*$ are identical: In both cases, $\mathbf {Y}$ and $\mathbf {J}$ are independent; $\mathbf {J}$ is a random subset of $[n]$ of size $k$; $\mathbf {Y}$ is obtained by first sampling a subset $\mathbf {S}$ of $\lbrace 0,1\rbrace ^n$ of size $m$ and then drawing a sequence of $q$ strings from $\mathbf {S}$ with replacement.
Fix a pair $(Y,J)$ in the support of $(\mathbf {Y},\mathbf {J})$. We say $Y$ is scattered by $J$ if $y^{i}_J\ne y^{j}_J$ for all $i\ne j\in [q]$. In particular, this implies that no string appears more than once in $Y$. The following claim, whose proof we defer, shows that $\mathbf {Y}$ is scattered by $\mathbf {J}$ with high probability.
We have that $\mathbf {Y}$ is scattered by $\mathbf {J}$ with probability at least $1 - O(2^{- k/3}).$
Fix any $(Y,J)$ in the support of $(\mathbf {Y},\mathbf {J})$ such that $Y$ is scattered by $J$. We claim that the distributions of $\boldsymbol {\alpha }$ conditioning on $(\mathbf {Y},\mathbf {J})=(Y,J)$ in the ${\mathcal{YES}}^*$ case and the $\mathcal {NO}^*$ case are identical, from which it follows that the total variation distance between the distributions of $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J})$ in the two cases is at most $O(2^{- k/3})\le 1/8$ when $k$ is sufficiently large. Indeed, $\boldsymbol {\alpha }$ is uniform over strings of length $q$ in both cases. This is trivial for $\mathcal {NO}^*$. For ${\mathcal{YES}}^*$ note that $\boldsymbol {\alpha }$ is determined by the random $k$-junta $\mathbf {f}\leftarrow \mathcal {JUNTA}_J$; the claim follows from the assumption that $Y$ is scattered by $J$.
We fix $J$ and show that $\mathbf {Y}$ is scattered by $J$ with high probability. As strings of $\mathbf {Y}$ are drawn one by one, the probability of $\mathbf {y}^i$ colliding with one of the previous samples is at most $(i-1)/m\le q/m$. By a union bound, all strings in $\mathbf {Y}$ are distinct with probability at least
Next, we prove Lemma 5.5.
The first expectation in Equation (5) is equal to the probability that
To show that these two probabilities are equal, we first note that the distributions of $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J})$ are identical. Fixing any triple $(Y,\alpha ,J)$ in the support of $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J})$, which must be consistent, we claim that the distribution of $\mathbf {f}$ conditioning on $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J})=(Y,\alpha ,J)$ is exactly $\mathcal {JUNTA}_{Y,\alpha ,J}$. This is because, for each $z\in \lbrace 0,1\rbrace ^J$, if $y^i_J=z$ for some $y^i$ in $Y$, then we have $\mathbf {f}(x)=\alpha _i$ for all strings $x$ with $x_J=z$; otherwise, we have $\mathbf {f}(x)=\mathbf {b}(z)$ for all $x$ with $x_J=z$, where $\mathbf {b}(z)$ is an independent and uniform bit. This is the same as how $\mathbf {h}^{\prime }\leftarrow \mathcal {JUNTA}_{Y,\alpha ,J}$ is generated. It follows directly from this claim that the two probabilities are the same. This finishes the proof of the lemma.
Finally, we prove Lemma 5.6, the most difficult among the three lemmas:
Similar to the proof of Lemma 5.5, the first expectation is the probability of
The following definition is crucial. We say a tuple $(Y,\alpha ,J,\mathcal {D})$ in the support of $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J},\boldsymbol {\mathcal {D}})$ is good if it satisfies the following three conditions ($S$ below is the support of $\mathcal {D}$):
We delay the proof of the following claim to the end.
We have that $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J},\boldsymbol {\mathcal {D}})$ is good with probability at least $7/8$.
Fix any good $(Y,\alpha ,J,\mathcal {D})$ in the support and let $Z=A_1(Y,\alpha)$. We finish the proof by showing that the distribution of $\mathbf {g}(Z)$, a binary string of length $q$, conditioning on $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J},\boldsymbol {\mathcal {D}})=(Y,\alpha ,J,\mathcal {D})$ is the same as that of $\mathbf {h}^{\prime }(Z)$ with $\mathbf {h}^{\prime }\leftarrow \mathcal {JUNTA}_{Y,\alpha ,J}$. This combined with Claim 5.8 implies that the difference of the two probabilities has absolute value at most $1/8$.
To see this is the case, we partition strings of $Z$ into $Z_w$, where each $Z_w$ is a nonempty set that contains all $z$ in $Z$ with $z_J=w\in \lbrace 0,1\rbrace ^J$. For each $Z_w$, we consider the following two cases:
So the conditional distribution of $\mathbf {g}(Z)$ is identical to that of $\mathbf {h}^{\prime }(Z)$ with $\mathbf {h}^{\prime }\hspace{-1.42271pt}\leftarrow \hspace{-1.42271pt}\mathcal {JUNTA}_{Y,\alpha ,J}$. This finishes the proof of the lemma.
We bound the probabilities of $(\mathbf {Y},\boldsymbol {\alpha },\mathbf {J},\boldsymbol {\mathcal {D}})$ violating each of the three conditions $E_0$, $E_1$ and $E_2$ and apply a union bound. By Claim 5.7, $E_0$ is violated with probability $O(2^{-k/3})$.
For $E_1$, we fix a pair $(Y,\alpha)$ in the support and let $\ell \le q$ be the number of distinct strings in $Y$ and $Z=A_1(Y,\alpha)$. Conditioning on $\mathbf {Y}=Y$, $\mathbf {S}\setminus \mathbf {Y}$ is a uniformly random subset of $\lbrace 0,1\rbrace ^n\setminus Y$ of size $m-\ell$. Instead of working with $\mathbf {S}\setminus \mathbf {Y}$, we let $\mathbf {T}$ denote a set obtained by making $m-\ell$ draws from $\lbrace 0,1\rbrace ^n$ uniformly at random (with replacements).
On the one hand, the total variation distance between $\mathbf {S}\setminus \mathbf {Y}$ and $\mathbf {T}$ is exactly the probability that either (1) $\mathbf {T}\cap Y$ is nonempty or (2) $|\mathbf {T}|\lt m-\ell$. By two union bounds, (1) happens with probability at most $(m-\ell)\cdot (\ell /2^n)\le mq/2^n$ and (2) happens with probability at most $(m/2^n)\cdot m$. As a result, the total variation distance is at most $(mq+m^2)/2^n$. On the other hand, the probability that one of the strings of $\mathbf {T}$ has distance at most $0.4n$ with one of the strings of $Z$ is at most $ mq\cdot \exp (-n/100)$ by a Chernoff bound followed by a union bound. Thus, the probability of violating $E_1$ is at most (using the assumption that $k\le n/200$)
For $E_2$, we fix a pair $(Y,\alpha)$ in the support and let $Z=A_2(Y,\alpha)$. Because $\mathbf {J}$ is independent from $(\mathbf {Y},\boldsymbol {\alpha })$, it remains a subset of $[n]$ of size $k$ drawn uniformly at random. For each pair $(y,z)$ with $y$ from $Y$ and $z$ from $Z$ that satisfy $d(y,z)\gt 0.4n$, the probability of $y_\mathbf {J}=z_\mathbf {J}$ is at most
Finally, the lemma follows from a union bound when $k$ (and thus, $n$) is sufficiently large.
This article has been accepted by STOC 2018.
This work is supported by the NSF CCF under Grants No. 1703925, No. 1420349, and No. 1563155.
Authors’ addresses: Z. Liu, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai, China; email: lzy5118@sjtu.edu.cn; X. Chen, R. A. Servedio, Y. Sheng, and J. Xie, Columbia University, Department of Computer Science, New York, NY, 10027, USA; emails: xichen@cs.columbia.edu, rocco@cs.columbia.edu, ys2982@cs.columbia.edu, jinyu@cs.columbia.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org.
©2018 Association for Computing Machinery. 1549-6325/2018/09-ART1 $15.00
DOI: https://doi.org/10.1145/3264434
Publication History: Received February 2018; accepted July 2018