ACM Trans. Algorithms, Vol. 15, No. 1, Article 3, Publication date: September 2018.
DOI: https://doi.org/10.1145/3275444
We continue developing the information theory of structured data. In this article, we study models generating $d$-ary trees ($d\ge 2$) and trees with unrestricted degree. We first compute the entropy which gives us the fundamental lower bound on compression of such trees. Then we present efficient compression algorithms based on arithmetic encoding that achieve the entropy within a constant number of bits. A naïve implementation of these algorithms has a prohibitive time complexity of $O(n^d)$ elementary arithmetic operations (each corresponding to a number $f(n, d)$ of bit operations), but our efficient algorithms run in $O(n^2)$ of these operations, where $n$ is the number of nodes. It turns out that extending source coding (i.e., compression) from sequences to advanced data structures such as degree-unconstrained trees is mathematically quite challenging and leads to recurrences that find ample applications in the information theory of general structures (e.g., to analyze the information content of degree-unconstrained non-plane trees).
ACM Reference format:
Zbigniew Gołębiewski, Abram Magner, and Wojciech Szpankowski. 2018. Entropy and Optimal Compression of Some General Plane Trees. ACM Trans. Algorithms 15, 1, Article 3 (September 2018), 23 pages. https://doi.org/10.1145/3275444
Advances in sensing, communication, and storage technologies have created a state of the art in which our ability to collect data from richly instrumented environments has far outpaced our ability to process, understand, and analyze this data in a (provably) rigorous manner. A significant component of this complexity arises from the highly structured nature of data. This poses significant challenges for theoretical characterization of limits of information storage, content, and transmission and methods that achieve these limits. While heuristic approaches are often currently used, critical issues regarding their performance remain. These challenges have motivated our recent research program [5, 6, 21] and others [1, 18, 26], which aims to characterize limits on information content of non-sequential data, such as trees and graphs. Quite generally, these problems consist of two complementary parts: fundamental achievable limits on data compression (where the Shannon entropy of the probability distribution from which the data arises gives the minimum possible expected code length), and, conversely, efficient encoding/decoding algorithms that achieve those limits.
Briefly, the entropy of a discrete random variable $X$ taking values on a domain $\mathcal {X}$ is, properly speaking, a functional of the distribution of $X$. It is denoted by $H(X)$ and is given by
As a start to understanding structured data in an information-theoretic setting, we focused on graphs [5] and trees with vertex (correlated) names [21]. In 1990, Naor proposed an efficiently computable representation for unlabeled graphs (solving Turán's open question) that is optimal up to the first two leading terms of the entropy when all unlabeled graphs are equally likely. Naor's result is asymptotically a special case of recent work of Choi and Szpankowski [5], who extended Naor's result to general Erdős-Rényi graphs. In particular, in [5] the entropy and an optimal compression algorithm (up to two leading terms of the entropy) for Erdős-Rényi graph structures were presented. Recently, these results were extended to preferential attachment graphs in [20] and to uniform random intersection graphs in [15]. Furthermore, in [22] an automata approach was used to design an optimal graph compression scheme. For unlabeled binary plane increasing trees rigorous information-theoretic results were obtained in [18], complemented by a universal grammar-based lossless coding scheme [26]. Finally, in recent work [21] (see also [6]) the authors study binary trees with structure-correlated vertex names, as well as non-plane binary increasing trees, and design optimal compression schemes based on arithmetic encoding.
In this article, we study three different models generalizing unlabeled binary plane increasing trees: $d$-ary plane increasing trees without correlated vertex names (i.e., trees with degree $d\ge 2$), $m$-ary search trees, and degree-unconstrained trees without any restriction on vertex degrees, giving both precise entropy calculations and information theoretically optimal and efficient compression algorithms. In particular, we study their unlabeled versions. It is well known (see [2]) that for binary trees an equivalence was proved between two models: the binary search tree model and a model in which leaves are selected randomly to expand the tree by adding two additional nodes (new leaves). This equivalence allowed one to analyze the entropy of such trees by solving a relatively simple recurrence, and then to construct an optimal compression algorithm. Unfortunately, this equivalence does not work for $d\ge 3$, and we develop here a different methodology to overcome this problem.
The tree models that we consider here have a long history, and various aspects of their asymptotic properties have been studied using a variety of techniques (for instance, generating functions and complex analytic techniques, Pólya urn models). See, for example, [3, 8, 9, 13, 17, 23] for some representative works. One of the most relevant ones to our work is [23], which studies the asymptotic distribution and moments of additive functionals on $d$-ary plane increasing trees. The results in that work apply to our case, since the entropy of a random tree may be written as the expected value of an additive functional—namely, $\log p(T)$, where $p(T)$ denotes the probability assigned to the tree in question under our model. Thus, our exact and asymptotic entropy results (but not our optimal compression results, which are novel) for $d$-ary plane increasing trees are direct consequences of their framework (similarly, the asymptotics in [25] apply for plane increasing trees, and those in [4, 12] apply for $m$-ary search trees). We include alternative, elementary derivations for the sake of completeness.
Another relevant work is [13], which studies the asymptotic number of occurrences of given subtree patterns in binary plane increasing trees. They include results on the size of a directed, acyclic graph representation of such trees (which captures the repeated occurrence of subtree patterns in a given tree). This is intimately related to more recent works on universal compression of binary trees [26].
In the present work, we shall show that for the models (such as the random unlabeled $d$-ary plane increasing tree^{1} and $m$-ary search tree models described in the present article) under consideration that produce $d$-ary trees $T_n$ on $n$ internal nodes the entropy $H(T_n)$ satisfies
In Section 3, we show that for unlabeled $d$-ary plane increasing trees we need to analyze a recurrence of the following form (see Lemma 3.4):
After describing in Section 2 the three generalizations of the binary tree model from [21], we present our main results in Section 3. We first provide in Corollary 3.2 the entropy rate for $m$-ary search trees. Then we consider $d$-ary plane increasing trees and in Theorem 3.5 give our expression for the entropy of such trees. We extend it to degree-unconstrained plane increasing trees in Theorem 3.9. After establishing these fundamental lower bounds on tree compression, we provide efficient compression algorithms, which are optimal in expected code length to within a constant number of bits. The algorithms are a variation of the arithmetic coding technique [7], wherein a tree $T$ to be compressed is mapped to a subinterval of $[0, 1]$ (in such a way as to avoid any overlap with the subintervals corresponding to other trees of the same size), whose length is equal to the probability of $T$. The binary codeword for $T$ is then a truncation of the binary expansion of the midpoint of the interval of $T$. A naive implementation of this idea yields an algorithm whose time complexity scales like $\Theta (n^d f(d, n))$ in the worst case for $d$-ary trees, for some positive function $f$. However, by careful rearrangement of expressions involved in interval calculations, we are able to reduce this complexity to $\Theta (n^2 f(d, n))$, a substantial improvement. We furthermore give experimental results which show that the typical compressed code lengths for our codes are very close to their expected values—within four bits for trees with 500 internal nodes.
We note that the present article is an expanded journal version of [16].
In this section, we describe the concepts of unlabeled plane trees with and without restrictions on the nodes’ out-degrees. This will allow us to introduce three models of tree generation.
We call a rooted tree a plane tree when we distinguish left-to-right order of the successors of each node; i.e., we distinguish all different embeddings of a tree into the plane (see [10]). Degree-unconstrained unlabeled plane trees are rooted plane trees with no restriction on the number of children of the nodes. On the other hand, unlabeled $d$-ary plane trees are rooted plane trees where each internal (i.e., non-leaf) node has exactly $d$ children (either internal or external). Since they are unlabeled, they can be seen as objects that encode the structures of a possible labeled (in the graph-theoretic sense) plane tree.
Search trees are plane trees built from a set of $n$ distinct keys taken from some totally ordered set, for instance, a random permutation of the numbers $\lbrace 1,2,\ldots ,n\rbrace$. An $m$-ary search tree is a tree in which each node has at most $m$ children; moreover, each node stores up to $m-1$ keys. We define the size of a search tree as the number of keys $n$. The construction of $m$-ary search trees can be described as follows [10].
If $n = 0$, the tree is empty. If $1 \le n \le m-1$, the tree consists of a root only, with all keys stored in the root. If $n \ge m$, we select $m-1$ keys that are called pivots. The pivots are stored in the root. The $m-1$ pivots split the set of remaining $n-m+1$ keys into $m$ sublists $I_1, \ldots , I_m$: if the pivots are $p_1 \lt p_2 \lt \cdots \lt p_{m-1}$, then $I_1 := (p_i: p_i \lt p_1), I_2 := (p_i: p_1 \lt p_i \lt p_2), \ldots , I_m := (p_i: p_{m-1} \lt p_i)$. We then recursively construct a search tree for each of the sets $I_i$ of keys. In order to obtain an unlabeled search tree of size $n$, we remove the keys from a search tree of size $n$ (see Figure 1).
The standard probability model assumes that every permutation of the keys $\lbrace 1,\ldots , n\rbrace$ is equally likely. The choice of pivots can then be deterministic. For example, one always chooses the first $m-1$ keys. After removing node labels from a $m$-ary search tree generated by a uniformly random permutation, we obtain a random unlabeled search tree, where the probability distribution of the resulting tree on the set of unlabeled search trees is non-uniform.
We consider the following generation model of an unlabeled random $d$-ary plane increasing tree [2, 10]. The process starts with an empty tree, that is, with just an external node (leaf). The first step in the growth process is to replace this external node by an internal one with $d$ successors that are external nodes (see Figure 2). Then with probability $\frac{1}{d}$ each, one of these $d$ external nodes is selected and again replaced by an internal node with $d$ successors. In each subsequent step one of the external nodes is replaced (with equal probability) by an internal node with $d$ successors.
This evolution process (if we had labeled each node with the time index at which it arrived) produces a random $d$-ary plane increasing tree (see [2, 10]). By definition, $d$-ary plane increasing trees are rooted plane trees with labels on internal nodes and where each node has exactly $d$ successors (either internal or external). The root is labeled by 1 and the labels of all internal successors of any node $v$ are larger than the label of $v$. We define the size of the $d$-ary plane increasing tree by the number of internal nodes.
Let $\mathcal {F}$ denote the set of all unlabeled $d$-ary plane trees and, for each integer $n \ge 0$, let $\mathcal {F}_n$ be the subset of $\mathcal {F}$ consisting of all trees that contains exactly $n$ internal nodes. Let $F_n$ be a random variable supported on $\mathcal {F}_n$, constructed by the generation process just described.
Similarly, let $\mathcal {G}$ and $\mathcal {G}_n$ be the analogous sets of labeled $d$-ary plane increasing trees. Throughout the article, we will denote a given unlabeled $d$-ary plane tree of size $n$ as $\mathtt {f}_n$ and a given labeled $d$-ary plane increasing tree of size $n$ as $\mathtt {g}_n$.
Observe that the above described evolution process (choosing an external node to replace with uniform distribution among all external nodes) generates every $d$-ary plane increasing tree of size $n$ in a unique way and with uniform distribution. However, after removing labels from a $d$-ary plane increasing tree $\mathtt {g}_n$ we obtain an unlabeled $d$-ary plane tree $\mathtt {f}_n$ with a non-uniform distribution. For example, there are $ {\binom{3}{2,0,1}} {\binom{1}{0,1,0}} = 3$ ways to obtain the resulting tree from Figure 2, but there is only one way (exactly $ {\binom{3}{3,0,0}} {\binom{2}{2,0,0}} {\binom{1}{1,0,0}}=1$) to obtain a tree where every internal node is a leftmost child of a parent node.
We consider the following generation model of unlabeled plane trees. Suppose that the process starts with the root node carrying a label 1. Then we add a child node with label 2 to the root. The next step is to attach a node with label 3. However, there are three possibilities: either to add it to the root (as a left or right sibling of 2) or to the node with label 2. One proceeds further in the same way, as shown in Figure 3. At the end, we remove the labels from nodes of the tree. Observe that if a node already has out-degree $k$ (where the descendants are ordered), then there are $k+1$ possible ways to add a new node (this time we do not distinguish between external and internal nodes). Hence, if a plane tree already has $j-1$ nodes, then there are precisely $2j-3$ possibilities to attach the $j$th node (see Figure 3). More precisely, the probability of choosing a node of out-degree $k$ equals $(k+1)/(2j-3)$. We define the size of the tree as the number of nodes.
In this section, we present our main results. In particular, we briefly address the entropy of $m$-ary search trees. Then we present our derivation of the recurrence for the entropy of unlabeled $d$-ary plane increasing trees that leads us to a formula for the entropy rate. Finally, we derive the entropy rate for degree-unconstrained trees. In the next section, we give efficient and nearly optimal compression algorithms, which achieve an expected code length within a small constant number of bits of the entropies derived here.
In all our models, the probability distribution on the relevant set of trees is non-uniform, and subtrees of the root are conditionally independent given their respective sizes. Indeed, let $T_n$ be a random variable representing a tree $\mathtt {t}_n$ on $n$ internal nodes. Assume now that at the root we split $\mathtt {t}_n$ into $d$ subtrees of size $ {k_1}, \ldots , {k_d}$, respectively, where $k_1 + \cdots + k_d=n-1$. Then the probability of generating tree $\mathtt {t}_n$ in all our models satisfies
We shall use the following notation. Let $\mathbf {k}^{(d)} = (k_1,\ldots ,k_d)$ denote a $d$-dimensional vector of non-negative integers and $\Vert \mathbf {k}^{(d)}\Vert = |k_1|+\cdots +|k_d|$ be its $L^1$ norm. Let $(k, \mathbf {k}^{(d-1)}) = (k, k_2\ldots ,k_d)$ denote a $d$-dimensional vector with the first coordinate equal to $k$. We often write $\mathbf {k}$ for $\mathbf {k}^{(d)}$ when the vector dimension is obvious.
Here we summarize the main properties of entropy that we use in this article. For more information and definitions, see [7].
The main concepts that we will need are joint and conditional entropy of pairs of random variables defined on a common probability space. The following properties will be used throughout the article:
Let $U_n$ denote a random unlabeled $m$-ary search tree with $n$ keys, generated according to the process described earlier. We write $\mathtt {u}_n$ for an arbitrary fixed $m$-ary (unlabeled) search tree with $n$ keys.
We describe the splitting of keys at the root of the search tree by the random vector $\mathbf {Y}_n^{(m)} = (Y_{n,1}, \ldots , Y_{n,m})$, where $Y_{n,j} = |I_j|$ is the number of keys that go into the $j$th subtree of the root. If $n \ge m-1$, we have $Y_{n,1} + \cdots Y_{n,m} = n-m+1$ and
Suppose that the tree $\mathtt {u}_n$ has subtrees $\mathtt {u}_{k_1}, \ldots , \mathtt {u}_{k_m}$ of sizes $k_1, \ldots , k_m$. Then by (2),
Let us establish the initial conditions of the entropy of $m$-ary search trees. If $n = 0$, we have an empty tree, and $H(U_0) = 0$. Moreover, if $1 \le n \le m-1$, all keys are stored in one node, and $H(U_{n}) = 0$. For $n \gt m-1$, there is a bijection between a tree $U_n$ and a tuple $(\mathbf {Y}_n^{(m)}, U_{Y_{n,1}}, \ldots , U_{Y_{n,m}})$, which is an immediate consequence of the inductive definition of $m$-ary search trees. Therefore, for $n \gt m-1$, we have
For $n \ge m-1$ and $1 \le j \le m$, the random variables $Y_{n,j}$ are identically distributed, and for $0 \le k \le n-1$,
To finish the calculation of $H(U_n)$, we need to calculate $H(\mathbf {Y}_n^{(m)})$. From (6) we have
Finally, we arrive at the following recurrence for the entropy of unlabeled $m$-ary search trees:
Let
The asymptotics of our recurrence can also be derived using the continuous master theorem in [24], which is a rather more general tool.
Using Theorem 3.1 and the fact that $H(\mathbf {Y}_n^{(m)}) = \log {\binom{n}{m-1}} = o(n)$, we conclude this section with the following result.
The entropy rate $h^{u}_{m} = \lim _{n \rightarrow \infty } H(U_n)/n$ of the unlabeled $m$-ary trees, generated according to the $m$-ary search tree model, is given by
It can be checked numerically that the entropy rate for unlabeled $m$-ary search trees for $m=2,3,4,5$ is $h^u_{2} \approx 1.73638$, $h^u_{3} \approx 1.50084$, $h^u_{4} \approx 1.3569$, $h^u_{5} \approx 1.25723$, respectively. Notice that the entropy rate decreases as we increase $m$. Intuitively, this is to be expected: consider the extreme case, with $m=n$ (though, in the rest of the article, we always consider constant $m$). In this case, every pattern of insertions into an $m$-ary search tree results in the same unlabeled tree (since the $m$ slots of the root are never filled), so the entropy is 0.
Observe that if $m = 2$ the number of nodes of an $m$-ary search tree equals $n$, the number of inserted keys; but for $m \gt 2$ the number of nodes is a random variable $S_{n,m}$. Knuth [19] was the first to show that .
In order to compare the entropy rate $h^u_{m}$ of $m$-ary search trees with that for $d$-ary plane increasing trees (discussed next), it makes sense to normalize the constant $h^u_{m}$ as $\hat{h}^u_{m} = h^u_{m}/\phi _m$. Then numerically $\hat{h}^u_{2} \approx 1.73638, \hat{h}^u_{3} \approx 2.5014, \hat{h}^u_{4} \approx 2.93994, \hat{h}^u_{5} \approx 3.22688$.
Let $g_n = |\mathcal {G}_n|$ be the number of $d$-ary plane increasing trees with $n$ internal nodes. Trivially, for $d = 2$, we have $g_n = n!$. Moreover, for $d\gt 2$ we have
Let $\mathcal {G}_{\mathtt {f}_n}$ denote the subset of $\mathcal {G}_{n}$ of trees that have the same structure as the unlabeled tree $\mathtt {f}_n \in \mathcal {F}_n$; that is, $\mathcal {G}_{\mathtt {f}_n}$ is the set of labeled representatives of $\mathtt {f}_n$. Moreover, let $g_{\mathtt {f}_n} = |\mathcal {G}_{\mathtt {f}_n}|$ be the number of $d$-ary plane increasing trees that have the same structure as a tree $\mathtt {f}_n$. Observe that the probability that the $d$-ary plane increasing tree source generates a given unlabeled tree $\mathtt {f}_n \in \mathcal {F}_n$ is
Let us establish the initial conditions of the entropy of our source. If $n = 0$, we have an empty tree, and $H(F_0) = 0$. If $n=1$, we have one fixed tree and $H(F_1) = 0$. By the definition of the trees, for $n \gt 1$ there is a bijection between a tree $F_n$ and a tuple $(\mathbf {V}_n^{(d)}, F_{V_1}, \ldots , F_{V_d})$. Therefore, for $n \gt 1$, we have
For $k = 0, \ldots , n-1$, let $p_{n,k}$ be the probability that one specified subtree in a $d$-ary plane increasing tree is of size $k$, that is,
The next lemma, proved in Section 4.1, gives an explicit formula for $p_{n,k}$.
For $k = 0, \ldots , n-1$ and $d\gt 1$, let $\alpha = \frac{d}{d-1}$, then
Observe that for $d=2$, we have $\alpha = 2$ and $p_{n,k} = \frac{1}{n}$. It does not depend on $k$, which greatly simplifies computations as shown in [21]. Moreover, it equals in the case of the binary search trees. Therefore, the two models, one for binary plane increasing trees and the other for binary search trees, are equal. For $d \gt 2$, this is not the case. For instance, for $d=3$, we have $\alpha =\frac{3}{2}$ and
The recurrence presented in (15) is the one that we need to solve. In the lemma below, we propose a general solution for recurrences of this form. It is proved in Section 4.2.
For constant $\alpha , x_0$ and $x_1$, the recurrence
Observe again that for $d=2$ and $x_0=x_1=0$ and $a_n = o(n)$, we have $\alpha = 2$ and
This leads us to our first main result.
The entropy of an unlabeled $d$-ary plane tree, generated according to the $d$-ary plane increasing tree model, is given by (see Figure 4)
We conclude this section with the following result.
The entropy rate $h^f_{d} = \lim _{n \rightarrow \infty } H(F_n)/n$ of the unlabeled $d$-ary plane trees, generated according to the model of $d$-ary plane increasing trees, is given by
Having Theorem 3.5 in mind, we just need to prove that $H(\mathbf {V}_n^{(d)}) = o(n)$. Let us recall that $\mathbf {V}_n^{(d)}: \mathcal {G}_n \rightarrow \lbrace 0,\ldots ,n-1\rbrace ^d$. Since the entropy of random variable is upper bounded by the logarithm of the variable image cardinality, we have
Taking a closer look at $H(\mathbf {V}_n^{(d)})$, we find
Let $r_n = |\mathcal {R}_n|$, the number of labeled plane increasing trees with $n$ nodes. We know that there are
As in the case of the $d$-ary plane increasing trees, let $\mathcal {R}_{\mathtt {t}_n}$ denote the subset of trees in $\mathcal {R}_{n}$ that have the same structure as a given unlabeled tree $\mathtt {t}_n \in \mathcal {T}_n$ (i.e., $\mathcal {R}_{\mathtt {t}_n}$ is the set of labeled representatives of $\mathtt {t}_n$); moreover, let $r_{\mathtt {t}_n} = |\mathcal {R}_{\mathtt {t}_n}|$ be the number of such trees. Observe that
Let $D_n$ denote the random variable representing the number of subtrees of the root. Observe that , where $r_n^{(d)} = |\mathcal {R}_n^{(d)}|$ is the number of plane increasing trees with root degree equal to $d$. Suppose that the tree $\mathtt {t}_n$ has $d$ subtrees $\mathtt {t}_{k_1}, \ldots , \mathtt {t}_{k_d}$ of sizes $k_1, \ldots , k_d$. Then by (2) and the fact that by assumption,
The initial conditions for the entropy are as follows. If $n = 1$, we have just a root node, so $H(T_1) = 0$. Similarly, if $n=2$, we have one fixed tree, so $H(T_2) = 0$. We now give the recurrence for $n \gt 2$. For $k = 1, \ldots , n-1$, let $q^{(d)}_{n,k}$ be defined as the probability that the root of a plane increasing tree has degree $d$ and that one specified root subtree is of size $k$. Then
We need an expression for the probability $q_{n,k}^{(d)}$ which we present in the next lemma proved in Section 4.3.
For $k = 1, \ldots , n-1$, we have
The recurrence found in (24) is another one that we need to analyze. Its general solution is presented next and proved in Section 4.4.
For constant $y_1$ and $y_2$, the recurrence
This leads us to our second main result.
The entropy of an unlabeled degree-unconstrained plane tree, generated according to the model of plane increasing tree, is given by
We conclude this section with the following result.
The entropy rate $h_t = \lim _{n \rightarrow \infty } H(T_n)/n$ of the unlabeled degree-unconstrained plane trees, generated according to the model of plane increasing trees, is given by
From Theorem 3.9, we just need to prove that . Let us recall that the random vector $\mathbf {W}_n^{(d)}: \mathcal {R}_n^{(d)} \rightarrow \lbrace 1,\ldots ,n-d\rbrace ^d$ describes the split at the root of a tree: precisely that a tree root degree equals $d$ and its subtrees are of sizes $W_{n,1}^{(d)}, \ldots , W_{n,d}^{(d)}$. Since the entropy of random variable is upper bounded by the logarithm of the variable image cardinality, we have
Taking a closer look at $H(\mathbf {W}_n^{(D_n)} | D_n)$, we find that
Here we address optimal compression of the trees generated by the sources under consideration. We will use a variant of the arithmetic coding method. We explain in detail the generalization for compression of unlabeled $d$-ary plane increasing trees (the case of degree-unconstrained plane increasing trees can be handled along analogous lines).
In a nutshell, we first define a total order $\lt$ on the set of such trees with a given size. Having fixed this, we must show how to efficiently compute two quantities, for a given tree $t$: the probability of all trees $t^{\prime } \lt t$, as well as the probability of $t$ itself. Granted efficient procedures for computing these, we produce a subinterval $I(t)$ of $[0,1]$, unique to $t$, which has length $|I(t)|$ equal to the probability of $t$ and whose left endpoint is the probability of all trees $t^{\prime } \lt t$. Arithmetic coding then prescribes that the code word corresponding to $t$ be given by the binary expansion of the midpoint of the interval assigned to $t$, truncated to a length of $\lceil {\log |I(t)|}\rceil + 1$ bits. The expected length of this code is then easily at most the entropy of the source, plus at most 2 bits.
We now define the total order on the set $\mathcal {F}_n$ on unlabeled $d$-ary trees: we denote by $\prec$ the lexicographic order on tuples of non-negative integers, and then we have the following definition.
The relation $\lt$ on $\mathcal {F}$ is defined as follows: let $\mathtt {f}_1, \mathtt {f}_2 \in \mathcal {F}$ with subtrees sizes $(s_1, \ldots , s_d)$, $(k_1, \ldots , k_d)$ respectively, then $\mathtt {f}_1 \lt \mathtt {f}_2$ if and only if one of the following holds:
It is simple to check that this is a total order. Next, we present an algorithm which computes the subinterval corresponding to an input tree $\mathtt {f} \in \mathcal {F}$ (see Algorithm 1).
This does exactly as intuitively described above: it implements a depth-first search of the input tree $f$, and at each step refining the current interval based on the split of vertices among the root subtrees of the current node.
Now, we explain more precisely the procedures CalculateSplitProbability and CalculateIntervalBegin. The former simply calculates the probability that a $d$-ary tree of size $n$ has root subtrees of sizes $k_1, \ldots, k_d$ (giving the length of the next subinterval). This is illustrated in Figure 5.
The latter gives the probability that such a $d$-ary tree has a root subtree size tuple lexicographically less than $(s_1, \ldots, s_d)$. That is, it computes the expression
Observe that a naive implementation of this calculation generates all $\Theta (n^d)$ integer partitions with $d$ parts of the number $n-1$ and calculates the split probability for each of them. To reduce the time complexity, we rewrite the sum as follows:
This leads us to the time complexity of the algorithm: observe that for each vertex $v$ of an input tree $\mathtt {f}$ of size $n$, the procedure Explore calls one time both procedures CalculateSplitProbability and CalculateIntervalBegin. Therefore, we get $O(n)$ calls of both procedures. Since the CalculateIntervalBegin procedure performs $O(n)$ calculations where each of them is of the same time complexity as one CalculateSplitProbability procedure, we have that the compression algorithm runs in time $O(n^2 \cdot f(d, n))$, where $f(d, n)$ denotes the number of bit operations in the CalculateSplitProbability procedure (it is not too difficult to see that $f(d, n)$ can be bounded by a small polynomial function of $n$ alone, by taking into account cancellations of factors in the expression for the split probabilities).
Finally, standard arithmetic coding arguments show that the algorithm is optimal up to a small constant number of bits, in expectation. Experimental results confirming this statement can be seen in Table 1.
$n$ | 50 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 500 |
$d=3$ | ||||||||||
$H(F_n)$ | 115.63 | 238.17 | 361.12 | 484.24 | 607.46 | 730.73 | 854.04 | 977.39 | 1,100.75 | 1,224.14 |
code length | 117.04 | 239.46 | 362.07 | 484.94 | 608.70 | 731.56 | 853.72 | 978.67 | 1,103.61 | 1,223.66 |
$d=8$ | ||||||||||
$H(F_n)$ | 192.63 | 394.08 | 595.95 | 797.99 | 1,000.12 | 1,202.31 | 1,404.54 | 1,606.80 | 1,809.09 | 2,011.39 |
code length | 193.85 | 396.05 | 599.79 | 799.56 | 1,001.24 | 1,205.07 | 1,404.22 | 1,609.57 | 1,810.42 | 2,012.90 |
It is instructive to compare the performance of our algorithm with a natural compact representation (which we think of as an uncompressed representation) of the trees in question. In particular, in the parenthesis representation, a plane tree is encoded as a bit string via a depth-first search. When a node is first encountered (and pushed), a 0 is appended. When the same node is popped (i.e., the search leaves that node's subtree), a 1 is appended.
In such a representation, the number of bits needed per vertex of a $d$-ary tree is approximately $2d$. In Figure 7
we highlight the gain obtained by using our optimal compression algorithm. Although we observe that $h^f_{d} = \Theta (d)$, the difference in constant is significant.Using (13), we can rewrite (14) as
Let us multiply both sides of the recurrence by the normalizing factor $\frac{\Gamma (n+\alpha -1)}{n!}$. Define also
If $d=1$, then the root has only 1 subtree with all other nodes, so its size has to be equal to $n-1$ and
Using Lemma 3.7, for $n\gt 2$, we find
In this article, we focused on finding entropies of various trees: namely, $m$-ary search trees, $d$-ary plane increasing trees, and degree-unconstrained plane increasing trees. In the course of deriving these entropies, we encountered recurrences that we showed how to solve in a general setting. These recurrences find ample applications in analyzing variations on such general trees. For example, as in [21], the next natural question is to find the entropy of non-plane $d$-ary increasing trees and degree-unconstrained trees. For arbitrary $d$, such a problem is quite challenging as one can see the approach of [6] for the binary case. Generalization to the $d$-ary case is highly non-trivial.
Finally, we presented optimal, polynomial-time (with small exponents) compression algorithms that achieve these entropies via an arithmetic coding approach.
This work was supported in part by NSF Center for Science of Information (CSoI) Grant CCF-0939370, by NSF Grant CCF-1524312, NIH Grant 1U01CA198941-01, and by Polish NCN Grant 2013/09/B/ST6/02258.
Authors’ addresses: Z. Gołębiewski, Wrocław University of Science and Technology, Faculty of Fundamental Problems of Technology, Wrocław, Poland; email: zbigniew.golebiewski@pwr.edu.pl; A. Magner, Center for the Science of Information, Purdue University, West Lafayette, IN; email: abram10@gmail.com; W. Szpankowski, Purdue University, Computer Science, West Lafayette, IN; email: spa@cs.purdue.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
©2018 Association for Computing Machinery. 1549-6325/2018/09-ART3 $15.00
DOI: https://doi.org/10.1145/3275444
Publication History: Received September 2017; revised July 2018; accepted August 2018