ACM Comput. Surv., Vol. 52, No. 1, Article 13, Publication date: January 2019.

DOI: https://doi.org/10.1145/3291044

Bayesian (machine) learning has been playing a significant role in machine learning for a long time due to its particular ability to embrace uncertainty, encode prior knowledge, and endow interpretability. On the back of Bayesian learning's great success, Bayesian nonparametric learning (BNL) has emerged as a force for further advances in this field due to its greater modelling flexibility and representation power. Instead of playing with the fixed-dimensional probabilistic distributions of Bayesian learning, BNL creates a new “game” with infinite-dimensional stochastic processes. BNL has long been recognised as a research subject in statistics, and, to date, several state-of-the-art pilot studies have demonstrated that BNL has a great deal of potential to solve real-world machine-learning tasks. However, despite these promising results, BNL has not created a huge wave in the machine-learning community. Esotericism may account for this. The books and surveys on BNL written by statisticians are overcomplicated and filled with tedious theories and proofs. Each is certainly meaningful but may scare away new researchers, especially those with computer science backgrounds. Hence, the aim of this article is to provide a plain-spoken, yet comprehensive, theoretical survey of BNL in terms that researchers in the machine-learning community can understand. It is hoped this survey will serve as a starting point for understanding and exploiting the benefits of BNL in our current scholarly endeavours. To achieve this goal, we have collated the extant studies in this field and aligned them with the steps of a standard BNL procedure—from selecting the appropriate stochastic processes through manipulation to executing the model inference algorithms. At each step, past efforts have been thoroughly summarised and discussed. In addition, we have reviewed the common methods for implementing BNL in various machine-learning tasks along with its diverse applications in the real world as examples to motivate future studies.

ACM Reference format:

Junyu Xuan, Jie Lu, and Guangquan Zhang. 2019. A Survey on Bayesian Nonparametric Learning. *ACM Comput. Surv.* 52, 1, Article 13 (January 2019), 36 pages. https://doi.org/10.1145/3291044

The Bayesian paradigm for machine-learning, also known as Bayesian (machine)-learning, is to apply probability theories and techniques to represent and learn knowledge from data. Bayesian-learning has played a significant and irreplaceable role in the machine-learning area ever since the pioneering works of Judea Pearl. Some renowned examples are Bayesian network, Gaussian mixture models, hidden Markov model [134], Markov random field, conditional random field, and latent Dirichlet allocation [11]. Compared to other learning paradigms, Bayesian learning has distinctive advantages. First, it embraces the uncertainty by explicitly representing, manipulating, and mitigating it based on a solid theoretical foundation—probability—which makes Bayesian learning more robust facing real-world intelligent systems [96]. Second, Bayesian learning can smoothly encode the prior knowledge about a problem under study in parallel with the knowledge learned from observed data—an asset that is also useful for overcoming the overfitting issues that tend to arise with limited amounts of data because priors can be treated as regularisers of the data. Last, Bayesian learning has inherent interpretability thanks to its clear and meaningful probabilistic structure. Beyond its abilities to make predictions, these interpretability becomes an additional desired learning outcome. In a standard Bayesian learning procedure, the goal is to build a model composed of a set of parameters (or random variables) to suit the target problem and then infer the posterior of these parameters given observed data. The model's parameters, which are normally defined to satisfy known fixed-dimensional probabilistic distributions (e.g., Gaussian, Dirichlet, gamma, and multinomial distributions), control the data's goodness-of-fit and the model's complexity. Thus, inappropriate model parameters may lead to underfitting (i.e., bad goodness-of-fit but good model complexity) or overfitting (i.e., good goodness-of-fit but bad model complexity) issues. Since the building blocks of Bayesian learning are fixed-dimensional probabilistic distributions, the number of parameters must be finite; hence the name, Bayesian parametric learning. Determining or learning the number of these parameters are based on human labour or restarting the algorithm several times to find the optimal settings—a process that is time-consuming and not scalable to large-scale unfamiliar data.

Bayesian nonparametric learning (BNL) advances Bayesian learning in terms of the representation power and modelling flexibility. First, it is necessary to clarify that *nonparametric* does not mean “there are no parameters.” In fact, quite the opposite is true. In theory, there are an infinite number of parameters in Bayesian nonparametric models. Therefore, *nonparametric* means “there is no need to predefine the dimensionality for the parameters.” Studies on Bayesian nonparametrics began with two papers written by Thomas S. Ferguson in 1973 [57] and Kjell Doksum in 1974 [43]. However, the nonparametric paradigm did not attract much attention from computer scientists until 2005, when a conference paper titled “Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes” applied Bayesian nonparametrics to machine learning [166]. Due to the great success of this work, computer scientists began to pay attention to Bayesian nonparametrics, giving rise to BNL. From then onward, BNL became an interdisciplinary subject for statisticians and computer scientists. Rather than playing with fixed-dimensional probabilistic distributions, BNL is a “game” to play with infinite-dimensional stochastic processes (e.g., Dirichlet, Gaussian, Poisson, gamma, and negative binomial processes). The benefits are twofold. Stochastic processes are not restricted by predetermined underlying assumptions about the dimensionality of the data, which provides a far greater ability to represent the data accurately. Further, BNL builds probabilistic models with flexible structures that can autonomically adapt to new incoming observations (also known as “letting the data speak”). A typical scenario is document modelling. Here, BNL can not only learn the summarised topics in a set of documents but also adapt the number of learned topics according to the documents in the set. Further, when new documents arrive, BNL can change the number and content of the topics effortlessly. However, there is no such thing as a free lunch. While BNL does bring powerful representations and highly flexible models to the learning table, model inference still faces great challenges. Fortunately, statistical inference techniques have simultaneously experienced a similar pace of advancement as BNL has progressed. These advancements include sampling-based inference algorithms (e.g., slice sampling and Hamiltonian Monte Carlo), optimisation-based inference algorithms (e.g., variational inference), and hybrid inference algorithms (e.g., stochastic gradient Markov chain Monte Carlo). Together with the growing computational powers of “hardware” (e.g., multi-core CPUs, GPUs, and distributed computing platforms), these efficient inference algorithms make applying BNL on large-scale real-world tasks possible.

Only a small number of surveys and books in the literature focus on BNL, and most were written by statisticians, Among these, the pioneer review [58] and three books [74, 88, 123] stand out. While thorough, these works each express a perspective on BNL from a statistician's point of view. Applications for BNL are based on statistical scenarios and supported by deep and detailed theoretical analyses and property proofs in those contexts. Although profound and highly necessary for developing new models or investigating consistency and asymptotic behaviours, all are difficult to understand for those with a computer science background. Further, they lack a roadmap to applying these techniques to machine-learning tasks or real-world data-driven scenarios. Outside of statistics, there are also two general introductions to BNL [70, 96] and one survey specifically related to non-exchangeable priors [60]. However, all either present very early research progress or focus on one aspect of BNL. A substantial amount of progress has been made in this field over the past 10 years, which justifies a comprehensive and updated review.

This survey aims to provide a good starting point for researchers who are interested in BNL—primarily those in the computer science community. To achieve this goal, we have organised the work in this field to align with the standard BNL procedure. That is, (a) select appropriate stochastic processes, (b) manipulate those processes, and (c) execute the model inference. In presenting the stochastic processes, we have mainly focused on the ability of each to model different kinds of data, rather than on detailed theoretical definitions and proofs. These definitions and proofs have been purposely omitted with corresponding references for further reading. Our purpose is to highlight the potential pathways for studying BNL so researchers can choose their own points of interest as a launch pad for further explorations. For example, developing a new stochastic process for a specific data structure or designing a more efficient inference algorithm. The Bayesian nonparametric extensions of current machine-learning algorithms or models have been reviewed as motivating examples for researchers who already have knowledge in machine learning. The goal here is to explain BNL's merits for consideration to those who intend to extend other algorithms or models. Additionally, we have presented a selection of real-world applications across a diverse range of domains to show the practical value in BNL as encouragement to applied researchers to consider BNL as an analytical tool in their future studies.

The remainder of this work is organised as follows. The definitions of BNL are introduced in Section 2. Section 3 presents the basic ingredients (i.e., various stochastic processes) of BNL followed by their manipulations summarised in Section 4. Section 5 reviews the model inference techniques used in this community. Sections 6 and 7 discuss the use of BNL in machine-learning tasks and real-world applications. Section 8 concludes this survey and provides our visions for the future.

One closely related definition of Bayesian nonparametrics is given by statisticians as

Bayesian nonparametrics are models and methods characterised by (a) big parameter spaces (unknown density and regression functions, link and response functions, etc.) and (b) construction of probability measures over these spaces.

This definition is used to distinguish four important concepts in statistics: frequentist parametrics, Bayesian parametrics, frequentist nonparametrics, and Bayesian nonparametrics. Another simpler definition of Bayesian nonparametric models comes from the *Encyclopedia of Machine Learning* as

A Bayesian nonparametric model (BNP) is a Bayesian model on an infinite-dimensional parameter space. The parameter space is typically chosen as the set of all possible solutions for a given learning problem.

Definition 2 was formulated for researchers in the machine-learning field, so it concentrates more on the infinite-dimensional characteristic and the learning ability of BNP. Additionally, the ability to build models on infinite-dimensional parameter spaces is an important distinction in this definition, as it implicates stochastic processes as an alternative to probabilistic distributions. Compared to traditional and naive applications of stochastic processes in time series events modelling, one major contribution of Bayesian nonparametric learning to the machine-learning community is to introduce stochastic processes for two- (or higher) dimensional space partition or more complex data structures. BNL's strengths from a computer science perspective are its flexible data structures and manipulations. Therefore, we argue it would be better to define BNL according to those strengths to promote a better understanding of BNL in computer scientists and enhance their willingness to adopt BNL in their studies. An alternative definition for BNL follows.

Bayesian nonparametric learning (BNL) is to build and inference the probabilistic models for specific learning tasks based on stochastic processes and their manipulations.

In the above definition, the core of BNL is to define and manipulate stochastic processes according to the target task. As an analogy, building a Bayesian nonparametric model for a specific learning task is just like using Lego bricks to build a robot, where the stochastic processes are bricks of different shapes, and the manipulation is like assembling the bricks to form more complex objects, e.g., a robot. This definition vividly demonstrates the core of BNL and the standard procedure for building a model. In the next section, we review the different bricks, i.e., stochastic processes, currently used in this field, followed by the different manipulation techniques.

In probability theory, a stochastic process is a set of (indexed) random variables, where indexes are derived from an (countably infinite) index/parameter space and the variables are derived from a state space. Statistics defines a number of different stochastic processes according to the different properties of these random variables, which we have organised some popular ones in BNL into a relational diagram in Figure 1. Not all of these processes have been used in machine-learning field (yet), but the most popular and representative ones are reviewed below.

—*Gaussian Process (GP).* A GP [151] is a specific set of random variables with a property that any finite subset of these variables satisfies a Gaussian distribution. A GP prior is represented as

\begin{equation} \begin{aligned} g(x) \sim \text{GP}(f_m(x), k_c(x, x^{\prime })), \end{aligned} \end{equation}

When used in machine learning, $g(x)$ models an underlying function, e.g., the mapping relationships between the features and labels of data. Hence, the GP is a good prior for latent functions and is especially useful for situations when there no knowledge of the functional form exists. It has been proven that GP has a deep relationship with traditional algorithms in machine learning, e.g., support vector machines [178].

One large group of popular stochastic processes in BNL falls within the scope of Lévy processes [148], which are (informally) defined as the stochastic processes with stationary and independent increments. According to the Lévy-Itô decomposition, a Lévy process could be roughly decomposed into two components: one continuous component and one discrete component. When there is no continuous component and discrete jumps are positive, the Lévy process is called a pure-jump increasing Lévy process and is characterised by Lévy measure [177]. One merit of this pure-jump increasing Lévy process is its close relationship with Poisson process that each pure-jump increasing Lévy process is corresponding to a Poisson process using Lévy measure as mean measure. Because of this relationship, the realisation of each pure-jump increasing Lévy process could be viewed as a (countably infinite) set of points in a product space defined by its index/parameter space and state space, where each point is a pair <a parameter, a variable>. These points/pairs are all that is obtained from a stochastic process. Based on these points/pairs, a measure on parameter space can be constructed as $M = \sum _{k=1}^{\infty } \text{variable}_k \,\delta _{\text{parameter}_k}$, where $M$ is a measure, $k$ is the point index, and $\delta$ is Dirac measure. The constructed measure is called a random measure since the realisation of process is random. Furthermore, an important property of the Poisson process is its equivalence with completely random measures (CRM), noting that any measure $M$ from a Poisson process is a completely random measure [100]. A number of stochastic processes in this area have been proven to be as special cases of CRMs, and these special stochastic processes can be manipulated thank to such property as discussed in more detail in the following section. Figure 2 illustrates these concepts using three examples. Next, we review some representative examples in this group in more details.

—*Poisson Process (PP).* A PP [102] over the product space is represented as $\Pi \sim \text{PP}(\Pi _0)$, where $\Pi _0$ is a base measure over $\Theta$. Note that, in theory, the base measure $\Pi _0$ could be any measure, but PP is normally used as the likelihood, so $\Pi _0$ is often a discrete measure from a prior, such as $\Pi _0=\sum _{k=1}^\infty \lambda _k\delta _{\theta _k}$. A realisation $\Pi$ is set of points , and is represented as

\begin{equation} \begin{aligned} \Pi = \sum _k N_k \delta _{\theta _k}, \end{aligned} \end{equation}

—*Gamma Process (GaP).* A GaP [145] $\Gamma \sim \text{GaP}(c, \Gamma _0)$ is a Lévy process whose Lévy measure depends on two parameters: $c$ and $\Gamma _0$. $c$ denotes concentration parameter, and $\Gamma _0$ represents the base measure,

\begin{equation} \begin{aligned} \nu (d\theta , dr) = cr^{-1}e^{-cr}\mathrm{d}r\Gamma _0(\mathrm{d}\theta). \end{aligned} \end{equation}

\begin{equation} \begin{aligned} \Gamma = \sum _k r_k \delta _{\theta _k}, \end{aligned} \end{equation}

—*Beta Process (BP).* A BP [87, 170] $B \sim BP(c, B_0)$ is a Lévy process whose Lévy measure depends on two parameters: $c$ and $B_0$. $c$ is a positive concentration parameter, and $B_0$ is a base measure on $\Theta$. If $B_0$ is continuous, then the Lévy measure of $\Theta \times [0, 1]$ of the BP is

\begin{equation} \begin{aligned} \nu (\mathrm{d}\theta , \mathrm{d}b) = cb^{-1}(1-b)^{c-1}\mathrm{d}b B_0(\mathrm{d}\theta). \end{aligned} \end{equation}

\begin{equation} \begin{aligned} B = \sum _k b_k \delta _{\theta _k}, \end{aligned} \end{equation}

—*Negative-binomial Process (NBP).* A NBP [196, 200] $X \sim \text{NBP}(\Gamma _0, B_0)$ is another process over the product space , parameterised by two parts: $\Gamma _0 = \sum _k r_k \delta _{\theta _k}$ is a shape measure and $B_0 = \sum _k b_k \delta _{\theta _k}$ is the probability measure. A realisation $X$ from NBP is a set of points and the obtained measure is

\begin{equation} \begin{aligned} X = \sum _k \kappa _k \delta _{\theta _k}, \end{aligned} \end{equation}

—*Bernoulli Process (BeP).* A BeP [102] $I$ is Lévy process $I \sim \text{BeP}(I_0)$ where $I_0 = \sum _k b_k \delta _{\theta _k}$ is a hazard measure and a realisation of BeP is

\begin{equation} \begin{aligned} I = \sum _k \zeta _k \delta _{\theta _k}, \end{aligned} \end{equation}

There are also some processes that are not Lévy processes but can be derived from them. For example, Dirichlet process is a normalisation of GaP and Mondrian process is a multidimentional extension of PP.

—*Dirichlet Process (DP).* DP [57, 162] is the pioneer and foundation of the BNL. It is defined as: A DP, which is specified by a base measure $H$ on a measurable space $\Theta$ and a concentration parameter $\alpha$, is a set of countably infinite random variables that can be viewed as a (probability) measure on partitions from a random infinite partition $\lbrace \Omega \rbrace _{k=1}^{\infty }$ of $\Theta$. An explicit representation of a DP using the stick-breaking process [162] is

\begin{equation} \begin{aligned} G = \sum _k \pi _k \delta _{\theta _k}, \end{aligned} \end{equation}

\begin{equation} \begin{aligned} (G(\Omega _1), G(\Omega _2), \ldots , G(\Omega _K)) \sim \text{Dir}(\alpha H(\Omega _1), \alpha H(\Omega _2), \ldots , \alpha H(\Omega _K)), \end{aligned} \end{equation}

—*Mondrian Process (MoP).* Let $\Theta$ be a box in with a positive linear dimension. The Mondrian process [143] on $\Theta$, denoted as MP($\Theta$), is a temporal stochastic process $(M_t)_{t\ge 0}$ takes values in guillotine partitions of $\Theta$ and its distribution is specified by the generative process Mondrian($\Theta$): The random variable $M_t$ is the guillotine partition of $\Theta$ formed by cuts/nodes with birth time $t_b \le t$.

Although the measure is not a widely used concept in computer science, we have still included it in this article because the measure view of these stochastic processes is essential for developing aspects of BNL, such as new manipulations and posterior inference algorithms. The following example illustrates such concept: The bottom row of Figure 2 shows a sample (i.e., an infinite number of points): ${(b_1,\theta _1),(b_2,\theta _2), \ldots ,(b_k,\theta _k)}$ from a beta process with a 2-simplex as the index space and (0, 1) as the state space. The measure composed by these points is $M=\sum _{k} b_k \delta _{\theta _k}$, which is defined as a measure of the index/parameter space, i.e., it maps a subset of this space to a nonnegative value. For a subset of the 2-simplex (i.e., $A$ in the figure), the measure value on $A$ by $M$ is $M(A)=\sum _{k} b_k \delta _{\theta _k}(A) = b_1 + b_2$. From this example, we can see that the summation in the measure definitions is to sum the weights of the points within the subset $A$. To make it further simple, particularly for the computer scientists, a measure can be simply considered as a set of points, e.g., ${(b_1,\theta _1),(b_2,\theta _2), \ldots ,(b_k,\theta _k)}$ shown in Figure 2. Such points are, in fact, what we obtain and operate in the BNL. A concrete example to show how the points from measure definition are used follows: Suppose we have a number of data, i.e., $\mathbf {x}$, in two-dimensional real space as shown in Figure 3. We can use the points from the Dirichlet Process, i.e., ${(\pi _1,\theta _1 = \lt \mu _1, \Sigma _1\gt),(\pi _2,\theta _2= \lt \mu _2, \Sigma _2\gt), \ldots ,(\pi _k,\theta _k= \lt \mu _k, \Sigma _k\gt)}$, to model these data as $p(\mathbf {x})=\sum _k \pi _k \mathcal {N}(x; \mu _k, \Sigma _k)$ where $\mathcal {N}(x; \mu _k, \Sigma _k)$ denotes a Gaussian probability density function parameterised by a mean vector $\mu _k$ and covariance matrix $\Sigma _k$. We can see that one component of a point $\theta$ is used as the parameters for a mixture, and the other component $\pi$ is used as a weight for a mixture. After the posterior inference, an appropriate number of points are selected to optimally model the data. As shown in Figure 3, three points/mixtures (i.e., eclipses) are chosen for modelling/representing the data. The above model is well-known as an infinite Gaussian mixture model [137].

Some important properties of the above stochastic processes are summarised in Table 1, including the stick-breaking construction methods and the power-law extensions. Note that these stochastic processes are only a very small portion of all stochastic processes. In addition to the above processes, there are a large number of stochastic processes with special properties existing in the current statistic literature, many of which may be valuable for modelling different data structures or resolving different learning tasks. We believe these processes are a huge undiscovered treasure that deserves more attention from the machine-learning community.

Stochastic Process | Stick-breaking | CRM | NRM | Power-law |
---|---|---|---|---|

Poisson Process | $\surd$ | |||

Dirichlet Process | $\surd$ [162] | $\surd$ | $\surd$ [167] | |

Beta Process | $\surd$ [13] | $\surd$ | ||

Bernoulli Process | $\surd$ | |||

Gamma Process | $\surd$ [145] | $\surd$ | ||

Negative-binomial Process | $\surd$ | $\surd$ | $\surd$ [14] | |

Hierarchical Dirichlet Process | $\surd$ [176] | $\surd$ | $\surd$ [12] | |

Indian Buffet Process | $\surd$ [164] | |||

“Stick-breaking” denotes whether the process has a stick-breaking constructive representation; “CRM” denotes whether the process is a completely random measure; “NRM” denotes whether the process is a normalised random measure; “Power-law” denotes whether there is a version with power-law (cluster number to data number) phenomenon. |

Recall that the definition of a process only states its respective properties but does not provide a way to construct its realisations. Discovering a way to construct the realisation of a stochastic process (also known as a representation) is crucial in posterior analysis. The above subsection only gives the definitions of different processes from which we can judge whether a measure is from a specific stochastic process. To obtain an instance/sample from a process and perform the model inference, an explicit construction method is needed. Based on the properties of Beta distribution, stick-breaking is commonly accepted as an efficient construction methodology. Several sticking-breaking methods have been proposed for different processes. These are summarised below.

—*Sticking-breaking for Dirichlet Process [152].* Sticking-breaking continuously breaks a stick with a unit length, and constructs the weights using these breaks. For DP, the procedure is as follows:

\begin{equation*} \begin{aligned} G = \sum _k^{\infty } \pi _k \delta _{\theta _k} \hspace{28.45274pt} \pi _k = \nu _k \prod _{j=1}^{k-1} (1-\nu _j) \hspace{28.45274pt} \nu _k \sim \text{Beta}(1, \alpha) \hspace{28.45274pt} \theta _k \sim H, \end{aligned} \end{equation*}

—*Sticking-breaking for Hierarchal Dirichlet Process [167, 176].* There are two versions of stick-breaking for HDP: Teh's version [167] and Sethuraman's version [176]. The Sethuraman's verion is

\begin{equation*} \begin{aligned} G_0 &= \sum _k^{\infty } \pi _{0,k} \delta _{\theta _k} \hspace{28.45274pt}& \pi _{0,k}&=\nu _{0,k} \prod _{j=1}^{k-1} (1-\nu _{0,j}) \hspace{28.45274pt}& \nu _{0,k} &\sim \text{Beta}(1, \alpha _0) \\ G_d &= \sum _t^{\infty } \pi _{d,t} \delta _{\theta _{d,t}} & \pi _{d,t}&=\nu _{d,t} \prod _{j=1}^{t-1} (1-\nu _{d,j}) & \nu _{d,t} &\sim \text{Beta}(1, \alpha _d) \\ \theta _k &\sim H & \theta _{d,t} &= \theta _{z_{d,t}} & z_{d,t} &\sim \pi _0, \end{aligned} \end{equation*}

—*Sticking-breaking for Indian Buffet Process [164].* To obtain $Z \sim \text{IBP}(\alpha)$, the stick-breaking procedure is

\begin{equation*} \begin{aligned} z_{i,k} \sim \text{Bernoulli}(b_k) \hspace{28.45274pt} b_{k} = \nu _{k} b_{k-1} = \prod _{l=1}^k \nu _{l} \hspace{28.45274pt} \nu _{k} \sim \text{Beta}(\alpha , 1), \end{aligned} \end{equation*}

—*Sticking-breaking for Beta Process [13].* The stick-breaking construction procedure for $B \sim \text{BP}(c, B_0)$ is different from IBP, in that there are more than one unit sticks. The procedure is

\begin{equation*} \begin{aligned} B = \sum _{i=1}^\infty \sum _{j=1}^{C_i} \nu _{i,j}^{(i)}\prod _{l=1}^{i-1} \big(1-\nu _{i,j}^{(l)}\big) \delta _{\theta _{i,j}} \hspace{17.07182pt} \nu _{i,j}^{(l)} \sim \text{Beta}(1, c) \hspace{17.07182pt} C_i \sim \text{Poisson}(B_0(\Omega)) \hspace{17.07182pt} \theta _{i,j} \sim B_0, \end{aligned} \end{equation*}

The aim of the BNL is to use the above stochastic processes to model the data and then resolve the specific tasks. However, the modelling ability of any single stochastic process is limited. Therefore, in more complicated situations, two or more stochastic processes may need to be manipulated in different ways. A summary of the state-of-the-art designs of the manipulation follows, and their relationships to the most popular stochastic processes used in BNL are illustrated in Figure 5.

The base measure of a stochastic process can either be a continuous or discrete, and the realisation of a stochastic process is a random measure. Hence, it is possible to use a random measure from one stochastic process as the base measure for other stochastic processes to share the statistical strength between these processes. Take document modelling for example. The random measure in the upper layer could be the topic pool and the lower measure at the document level could be the topics in a document. The existing new Bayesian nonparametric priors based on this manipulation are summarised as follows.

—*Hierarchical Dirichlet Process (HDP).* HDP [167] is built by piling one DP on top of another DP(s) to transfer some statistical strengths from top layer to the bottom layer. For example, suppose a top measure $G_0$ over $\Theta$ is with a DP ($\alpha$, $H$) prior and $G_d$ is the measure in the bottom layer also with a DP ($\alpha _d, G_0$) prior, then we have

\begin{equation} \begin{aligned} G_d \sim \text{DP}(\alpha _d, G_0), \,\,\,G_0 \sim \text{DP}(\alpha , H). \end{aligned} \end{equation}

—*Hierarchical Pitman-Yor processes (HPYP).* Similarly to HDP, HPYP [110] piles one PYP on top of other PYP(s). Unlike HDP, HPYP has two parameters: a concentration parameter $\alpha$ and a discount parameter $\beta$.

\begin{equation} \begin{aligned} Y_d \sim \text{PYP}(\alpha _d, \beta _d, Y_0), \,\,\,T_0 \sim \text{PYP}(\alpha , \beta , H). \end{aligned} \end{equation}

—*Hierarchical Beta Process (HBP).* Akin to HDP, HBP [170] was proposed to make the different BPs share the same discrete measure (an infinite number of probabilities) from a global BP through

\begin{equation} \begin{aligned} B_d \sim \text{BP}(c_d, B_0), \,\,\,B_0 \sim \text{BP}(c, H). \end{aligned} \end{equation}

—*Gamma-Negative Binomial Process (GNBP).* In contrast to HDP and HBP, GNBP [197] is not composed of a single process (e.g., a DP and a BP), but rather of two processes: a GaP and a NBP as follows:

\begin{equation} \begin{aligned} I_d \sim \text{NBP}(p, \Gamma _0), \,\,\,\Gamma _0 \sim \text{GaP}(c, H). \end{aligned} \end{equation}

—*Beta-Negative Binomial Process (BNBP).* Analogous to a negative-binomial distribution, a BNP has two parameters. While GNBP places a GaP on one parameter, another BNBP layering manipulation [14] is to place a BP on another parameter as follows:

\begin{equation} \begin{aligned} I_d \sim \text{NBP}(B_0, r), \,\,\,B_0 \sim \text{BP}(c, H). \end{aligned} \end{equation}

—*Enriched Dirichlet Process (enDP).* EnDP [173] could be viewed as a special case of HDP, which is defined as

\begin{equation} \begin{aligned} G_{\theta } \sim \text{DP}(\alpha , H_{\theta }), \,\,\, G_{\phi |\theta } \sim \text{DP}(\alpha _{\phi |\theta } G_{\theta }) \end{aligned} \end{equation}

Superposition is used to combine two or more random measures together, like a “plus” operation. If layering is considered to be a form of *vertical* manipulation, then the superposition is a *horizontal* manipulation.

—*Superposition of Poisson Processes.* According to the Superposition Theorem in Reference [102], it is known that combining a set of independent Poisson processes yields a new PP with a mean measure that is the sum of the mean measures of the individual processes:

\begin{equation} \begin{aligned} \Pi _1 \oplus \cdots \Pi _n \sim \text{PP}(\mu _1 + \cdots + \mu _n) \end{aligned} \end{equation}

—*Superposition of Dirichlet Processes.* Inspired by the superposition of PP, the authors of Reference [112] proposed the superposition of DP as follows:

\begin{equation} \begin{aligned} (c_1, c_2, \ldots , c_n) &\sim \text{Dir}(\mu _1(\Omega), \mu _2(\Omega), \ldots , \mu _n(\Omega))\\ c_1G_1 \oplus c_2G_2 \oplus \cdots \oplus c_nG_n &\sim \text{DP}(\mu _1 + \mu _2 + \cdots + \mu _n), \end{aligned} \end{equation}

—*Superposition of Normalised Random Measures (SNRM).* Since DP is a special case of NRM, the superposition manipulation of DP has been extended to the more general NRMs [28] as follows; the superposition of $n$ independent normalised random measures $\lbrace \mu _j\rbrace ^n_{j=1}$ on $\Theta$ is

\begin{equation} \begin{aligned} \mu _1 \oplus \cdots \oplus \mu _n &= c_1\mu _1 + \cdots + c_n\mu _n, \end{aligned} \end{equation}

Subsampling (randomly) selects parts of infinite components in a random measure from a stochastic process. When certain conditions are satisfied, the selected components form a new random measure from an underlying process.

—*Thinned Poisson Process [112].* Based on the Marking Theorem [102], let $\Pi \sim \text{PP}(\mu)$ be a PP on the space $\Theta$, and a measurable function $q : \Theta \rightarrow [0, 1]$. If independently drawing $z_{\theta } \in \lbrace 0, 1\rbrace$ for each $\theta \in \Pi$ with $P(z_{\theta } = 1) = q(\theta)$, then a new PP is

\begin{equation} \begin{aligned} \Pi _t &= \lbrace \theta \in \Pi : z_{\theta } = 1\rbrace \sim \text{PP}(q\mu), \end{aligned} \end{equation}

—*Thinned Completely Random Measures (TCRM) [59].* Let $\Pi = \sum _{k=1}^{\infty }\pi _k \delta _{\theta _k}$ be a CRM on $\Theta$ and a measurable function $q : \Theta \rightarrow [0, 1]$. For each point $(\theta _k, \pi _k)$, we define a Bernoulli variable $r_k$ with $P(r_k = 1) = q(\theta _k)$ (independent with other $\lbrace r\rbrace$),

\begin{equation} \begin{aligned} TCRM &= \sum _{k=1}^{\infty }r_k\pi _k \delta _{\theta _k}, \end{aligned} \end{equation}

—*Thinned Dirichlet Process [112].* Let $G \sim \text{DP}(\mu)$ be a DP on $\Theta$ that can be represented as $G = \sum _{k=1}^{\infty } \pi _k \delta _{\theta _k}$ and a measurable function $q : \Theta \rightarrow [0, 1]$. For each $k$, independently draw $r_k$ through $P(r_k = 1) = q(\theta _k)$,

\begin{equation} \begin{aligned} G_t &= \sum _{k:r_k=1} \pi _k^{\prime } \delta _{\theta _k} \sim \text{DP}(q\mu), \end{aligned} \end{equation}

—*Thinned Normalised Random Measures (TNRM) [28, 29, 112].* Given an NRM $\mu = \sum _{k=1}^{\infty } \pi _k \delta _{\theta _k}$ on $\Theta$ and a Bernoulli variable $r_k \in [0, 1]$. The TNRM is

\begin{equation} \begin{aligned} TNRM &= \sum _{k:r_k=1} \pi _k^{\prime } \delta _{\theta _k}, \end{aligned} \end{equation}

Point-transition moves the points of a random measure according to an underlying probabilistic transition.

—*Point-transition of Poisson Process.* Based on the Transition Theorem [102], let $\Pi \sim \text{PP}(\mu)$ be a PP on space $\Theta$ and $T: \Theta \times \mathcal {F}_{\Theta } \rightarrow [0, 1]$ be a probabilistic transition. From Reference [112], the transformed measure is

\begin{equation} \begin{aligned} \Pi _p &= \lbrace T(\theta): \theta \in \Pi \rbrace = \text{PP}(T\mu). \end{aligned} \end{equation}

—*Point-transition of Dirichlet Process [112].* Let $G = \sum _{k=1}^{\infty } \pi _k \delta _{\theta _k} \sim \text{DP}(\mu)$, and its point-transition is

\begin{equation} \begin{aligned} G_p &= \sum _{k=1}^{\infty } \pi _k \delta _{T(\theta _k)} \sim \text{DP}(T\mu). \end{aligned} \end{equation}

—*Point-transition of Normalised Random Measures (PNRM) [28].* Given an NRM $\mu = \sum _{k=1}^{\infty } \pi _k \delta _{\theta _k}$ on $\Theta$, the point-transition of $\mu$ is to draw atoms $\theta _k^{^{\prime }}$ from a transformed base measure to yield a new NRM as

\begin{equation} \begin{aligned} PNRM(\mu) &= \sum _{k} \pi _k \delta _{\theta _k^{^{\prime }} = T(\theta _k)}, \end{aligned} \end{equation}

With a partition of a space by a process, e.g., DP, nesting is to further partition each area through another process. In other words, each component of a random measure is attached to an additional random measure.

—*Nested Dirichlet Process (nDP) [8].* The root partition is created by a DP, and each component in that partition is further attached to another DP. Repeating this procedure forms a tree of infinite width and depth. Each datum (e.g., a document) is associated with a path of this tree.

—*Nested Hierarchical Dirichlet Process (nHDP) [129].* This process generates a nested structure similar to a nDP but associates a datum with a new revised copy of the generated tree structure rather than a simple path.

According to de Finetti's Theorem [2], an exchangeable sequences could be obtained through marginalising Finetti mixing measure out as follows,

\begin{equation} \begin{aligned} P(X_1, X_2, \ldots , X_n) = \int \prod _{i=1}^n P(X_i|\theta) P(\mathrm{d}\theta), \end{aligned} \end{equation}

—*Chinese Restaurant Process (CRP).* Marginalising a DP through,

\begin{equation} \begin{aligned} CRP(X) = \int P(X|G) \mathrm{d} G,\,\, G \sim DP, \end{aligned} \end{equation}

—*Chinese Restaurant Franchise (CRF) [167].* CRF is a marginalisation of HDP. Based on CRP, the metaphor to understand this process is as follows: There are $D$ Chinese restaurants with a shared menu. The $i$th customer walks into the $d$th restaurant and picks an occupied table at which to sit with the probability $\frac{n_{d,t}}{\alpha _d+i-1}$ or a new table with the probability $\frac{\alpha _d}{\alpha _d+i-1}$, where $n_{d,t}$ is the number of customers siting at table $t$ in $d$th restaurant. If this customer picks an occupied table, then she just eats the dish already on that table; if a new table is picked, then she needs to order a new dish. The new dish is ordered from the menu according to its popularity. The probability that the new dish is the same as the one on other tables has a probability of $\frac{T_k}{\alpha +\sum _k T_k}$ and the probability that it is a new dish is $\frac{\alpha }{\alpha +\sum _k T_k}$, where $T_k$ is the number of tables with the same dish $\theta _k$. As a result, $\theta _{d,t}$ is the dish on table $t$ of restaurant $d$, and $\theta _{d,i}$ is the dish eaten by customer $i$ in restaurant $d$.

—*Nested Chinese Restaurant Process (nCRP) [141].* nCRP is a marginalisation of nDP. A metaphor based on CRP can be used to understand this process is as follows. There are infinite restaurants in a city and each restaurant has infinite tables. On each table, there is a card with the address of another restaurant. As such, all restaurants are organised into an infinite tree structure. A customer visiting this city dines first at the root restaurant and chooses a table using the CRP strategy. she notes the name of the restaurant on the card on her table and the next night she dines there, again choosing the table using CRP. This procedure repeats infinitely many times.

—*Nested Chinese Restaurant Franchise (nCRF) [1].* Analogous to nCRP and nDP, nCRF is a marginalisation of nHDP. The metaphor based on CRP and CRF to understand this process is slightly different: There are multiple cities, and each city has infinite restaurants with infinite table. The card on the table has the name of another restaurant in the same city as with nCRP. However, there are also an infinite number of menus organised using nCRP. The tree structures for all cities and menus are the same. A customer visiting a city dines first at the root restaurant and again chooese a table using nCRP. But this time, the meal is ordered from the menu (at the same position of the infinite tree) according to popularity as determined by CRF. Hence, nCRF combined with CRF helps to share the menus between restaurants, and nCRP helps to build the hierarchy.

—*Indian Buffet Process (IBP).* Marginalising a BP through,

\begin{equation} \begin{aligned} IBP(X) = \int P(X|B) \mathrm{d} B,\,\, B \sim BP, \end{aligned} \end{equation}

—*Nested Indian Buffet Process (nIBP) [31].* Similar to nCRP, a nested version of IBP is able to build a hierarchy where each layer is composed of a number of IBPs (i.e., the number of features in IBPs at the up-layer). Continuing the metaphor, there are infinite number of Indian buffet restaurants in a city, each has an infinite number of dishes, and a card with the address of another restaurant in the same city next to each dish. Hence, the restaurants in this city are organised as an infinite tree structure. A customer who visits the root restaurant in this city and chooses dishes using the IBP strategy but uses a layer-dependent probability $\text{Poisson}(\frac{\alpha /\ell }{i})$ to choose any new dishes. The next day, the customer will visit all the restaurants on the cards next to her chosen dishes. Note that a customer can go to multiple restaurants with nIBP, rather than only a single one in nCRP.

After building an appropriate Bayesian nonparametric models for a specific task, the next step is to infer the latent variables defined in the model (more accurately, the posterior joint distribution of latent variables), given an amount of observed data. To obtain the posterior distribution, there are two main methodologies: sampling-based (i.e., Markov chain Monte Carlo) and optimisation-based (i.e., Variational inference). The most straightforward candidate for the Bayesian model posterior inference is Markov chain Monte Carlo (MCMC) (e.g., Gibbs sampling), which has also been widely adopted for Bayesian nonparametric models. The basic idea of MCMC [3] is to approximate a (posterior) distribution from its samples. To obtain the samples, a Markov chain is constructed with its stationary distribution as the desired (posterior) distribution. Figure 6(a) illustrates this idea, where the bins are the count of samples in the corresponding support areas. It is clear that more samples means a more accurate approximation. Although the Monte Carlo Markov chain (MCMC) methodology can, in theory, derive the exact posterior distribution of the latent variables, it is inefficient. As illustrated in Figure 6(b), an alternative for BNL is variational inference [10]. This methodology uses a set of (often simpler, independent and parameterised) variational distributions to approximate the real posterior distribution. This approach transforms a posterior distribution inference problem into a high-dimensional optimisation problem. With the help of gradients, this method can efficiently explore the parameter space of the variational distributions to approximate the desired posterior distribution as much as possible. In general, MCMC has better theoretical distribution approximation accuracy than variational inference when a sufficient number of samples are obtained, because introducing the variational distributions will introduce additional unnecessary approximation errors that do not exist in MCMC. In the case of big data, variational inference is more efficient than MCMC because obtaining sufficient samples in MCMC is very time-consuming and hard to evaluate. Yet gradient-based variational inference is able to explore the parameter space efficiently. The existing works for basic processes in the current literature on BNL literature are summarised below and in Table 2.

Stochastic Process | Slice | VI | SMC | EP | Scalable |
---|---|---|---|---|---|

Dirichlet Process | $\surd$ [97] | $\surd$ [9] | $\surd$ [20] | $\surd$ [113] | |

Beta Process | $\surd$ [14] | $\surd$ [19] | |||

Gamma Process | $\surd$ [145] | ||||

Hierarchical Dirichlet Process | $\surd$ [176] | $\surd$ [155] | |||

Indian Buffet Process | $\surd$ [44] | $\surd$ [182] | $\surd$ [42] | ||

Slice is short for slice sampling; VI for Variational inference; SMC for Sequential Monte Carlo; EP for Expectation Propagation. |

While the (countably) infinite nature of BNL enables great modelling power, it also brings a challenge on model inference in that an infinite number of factors and weights make the posterior inference of the latent variables much harder. One commonly accepted solution in BNL is the truncation method [62, 181], which sets the component number so large that the given data would only adopt a subset of them, but it does introduce an approximation error. Another successful technique to resolve this problem is *data/variable augmentation* [160, 172], also known as *Slice Sampling* [39, 126]. The existing Bayesian nonparametric models that use the slice sampling method for model inference are summarised below.

—*Slice Sampling for DP.* A realisation from DP can be represented as $G=\sum _{k=1}^{\infty } \pi _k \delta _{\theta _k}$, where $\pi _k$ are (a countably infinite number of) stick weights. The slice sampling for DP [97] introduces an auxiliary variable $u_i \sim \text{Unif}(0, \pi _{k_i})$, which functions as an adaptive truncation for the data $i$, and then only needs to sample $\pi \gt u_i$ for data $i$.

—*Slice Sampling for BP.* Instead of the fixed range used in DP, a positive decay function $f : \lim _{k \rightarrow \infty } f(\pi _k) = 0$ controls the support for the auxiliary variable $u_k \sim \text{Unif}(0, f(\pi _k))$ [14]. Compared to the function used in DP, the decay function is more flexible and more applicable to many other models.

Using variational inference involves two important components: setting variational distributions and choosing optimisation methods. Clearly setting the variational distribution can significantly reduce the additional approximation error and appropriate optimisation methods can boost the inference efficiency and adapt to the complex scenarios, e.g., large-scale and streaming data. Existing work on variational inference for BNL is summarised below.

—*Ordinary Variational Inference.* The form of variational inference for BNL is based on a stick-breaking representation where the latent variables in this representation include stick weights, atom parameters, and data assignment indexes. Since stochastic processes such as DP [9, 106] and GaP [145] have the potential to produce an infinite number of atoms, we have opted for a truncation method to ensure that only a finite number of atoms need to be approximated with a finite number of mean-field (latent variable) distributions. Other similar works have been applied for HDP [9], IBP [44], and nCRP [174].

—*Collapsed Variational Inference.* As stated, there are normally three groups of latent variables in BNL (i.e., stick weights, atom parameters, and data assignment indexes). Sometimes stick weights could be marginalised out to make the inferences of DP [105] and HDP [147, 168] more accurate and efficient.

—*Online/Stochastic Variational Inference.* Instead of coordinate-ascent optimisation that is not efficient for large datasets because a full pass of all the data at each iteration is required, stochastic optimisation is used to update the variational parameters [15, 176]. At each iteration, a number of data are sampled from the entire dataset before updating the variational parameters. This optimisation could improve the inference efficiency because of the estimated noisy gradients of the variational objective have been proven to be natural gradients of the Kullback-Leibler divergence objective [90].

—*Truncation-free Variational Inference.* One problem with the above variational inference methods is that the truncation is needed, which brings additional approximation errors that are resolved with a locally collapsed method. Here, the global latent variables are marginalised, and the distribution of the local variables could be sampled [175]. This method can be viewed as a combination of the sampling and variational inference methods. Another idea to avoid truncation is variational DP [106], where a tying assumption is given. That is, the variational distributions of components larger than $T$ are set as priors, and $T$ are adaptively increased with a decrease in the optimisation objective and KD-tree-based data organisation is introduced to accelerate the speed of inference.

—*Streaming Variational Inference.* Streaming variational inference algorithms were proposed to handle data that arrives sequentially in the form of a data stream. This method does not require data to be preserved in memory, which means it can be discarded after processing. The idea is to decompose the joint posterior to a recursive form where the distribution of the incoming datum is derived from the posterior distribution of the former observed data. Different techniques can then be used for approximation, such as a designed variational stochastic process with independent factors for DP [111] or more general extensions based on assumed density filtering [159] or component identification [16].

Considering the exponentially increasing amount of data in many areas, the ability to handle Big Data is also a research direction for BNL. The idea is to extend current inference algorithms, particularly those designed for a single processor/machine into parallel versions for multiple processors/machines. Existing inferences algorithms for this problem are mainly categorised into the following two groups: MCMC and variational.

—*Parallel MCMC.* The authors of Reference [155] proposed a parallel MCMC for HDP using an asynchronous method. The advantages of this approach are that it is easy to incorporate new data and processors, and it is extremely fault-tolerant. The disadvantages is with additional approximation. To overcome this disadvantage, further parallel MCMC for DP, HDP, and PYP were proposed [48, 113, 179] based on the inverse-superposition of DPs. Here, each processor or machine handles one supercluster (i.e., one of DPs). Generating data-based super-clusters generation further improves the efficiency [23]. Although the above methods could use marginalised representations of DP or HDP to avoid truncation, a slice sampling-based parallel MCMC [68] has been developed to explicitly sample the stick weights in DP and HDP in an elegant way.

—*Parallel Variational Inference.* In this form of inference, the posterior distribution is decomposed for the streaming and parallel inference, and a component identification algorithm resolves the mismatch problem when merging different variational posteriors from different processing nodes [16]. Instead of obtaining the exact variational approximation of the real posterior distribution by combining subposteriors (from different processing nodes) together, a Markov chain is designed to collect samples from the variational approximation [127], which again can be viewed as a combination of variational and sampling.

Some other inference methods that have been applied to BNL are as follows:

—*Sequential Monte Carlo.* Also known as particle filtering, this approach approximates the posterior distribution through a large collection of samples (i.e., particles) that are propagated over time and updated by sequential importance sampling. The sequential Monte Carlo method has been applied to DP mixture with time varying mixtures [20], beta-binomial DP mixtures [116], general conjugate DP mixtures [56], and nonparametric Bayesian matrix factorisation [182].

—*Power Expectation Propagation.* [121] This method generalises expectation propagation and variational inference using a flexible $\alpha$-divergence. An example is nonparametric Bayesian matrix factorisation [42].

BNL is an efficient mechanism that has been used to resolve many tasks in machine learning. Indeed, many famous models in machine learning already have a Bayesian nonparametric counterpart, as summarised in Table 3. This section presents a selection of studies on the application of BNL arranged according to task. Each category of task explores how BNL has been used a solution task and the additional benefits BNL has brought. Note that these state-of-the-art works not only show the wide applicability of BNL, but also showcase how the BNL procedure has been used in practice. Some popular stochastic processes in BNL that suit different machine-learning tasks are summarised in Table 4.

Classical machine-learning models | Nonparametric extensions |
---|---|

Gaussian Mixture Model (GMM) | [137] |

Latent Dirichlet Allocation (LDA) | [167][4] |

Hidden Markov Model (HMM) | [167] [64] [65] |

Linear Dynamic Systems (LDS) | [21, 61] |

Support Vector Machine (SVM) | [202] |

Nonnegative Matrix/Tensor Factorisation (NMF/NTF) | [42, 182] [108, 189, 192] |

Mixed Membership Stochastic Block models (MMSB) | [54, 122] |

Partially-Observable Reinforcement Learning (PORL) | [46] |

Conditional Random Field (CRF) | [95, 135] |

Markov Random Field (MRF) | [25] |

Author Topic Model (ATM) | [190] |

Principal Component Analysis (PCA) | [51] |

Bayesian Inverse Reinforcement Learning (BIRL) | [32, 33] |

Machine-learning tasks | Processes | Real-world tasks | Processes |
---|---|---|---|

Supervised Learning | DP, HDP, GP, BP, CRP, IBP | Text Mining | DP, HDP, IBP, PYP, GaP |

Factor analysis | IBP, BP, GaP | Natural language processing | CRP, DP, HDP |

Transfer learning | HDP, HBP, PYP, DP, nCRP | Computer vision | BP, IBP, DP |

Tree structure learning | nCRP, HDP, DP, PP, GaP | Biology | DP, HDP, PYP |

Relational learning | DP, IBP, PP | Music analysis | HDP, GP |

Reinforcement learning | HDP, PYP, GP, IBP, BP | Robots | DP, HDP, GP |

Causal inference | GP, DP | ||

Metric learning | BP |

Sometimes data have labels (also known as responses), such as the emotion tags, the GDPs of countries, network relations, and so on. Supervised learning is a method for modelling the relationships between data and these responses/covariants to make predictions—a basic task in machine learning. Supervised BNL tends to fall into two categories.

—*Generalized linear models (GLM)-based*, where the data are assigned to a number of clusters, and the responses (for classification) of the data are modelled by multinomial logistic models (also called “softmax”) [82, 153]. A DP is used as a prior for weighting $\lbrace \pi _k\rbrace _{k=1}^{\infty }$ the clusters and the parameters $\lbrace \theta _k\rbrace _{k=1}^{\infty }$ are used in the multinomial logit models. Although the data and the responses in each cluster have a linear relationship, multiple clusters make it possible to capture non-linear relationships. This idea has been further extended for group data using HDP [36, 193].

—*Covariant space-based.* The relationship between the covariant space and the data is captured by dependent stochastic processes by (1) setting the base measure as a special stochastic process, such as a single-variable stochastic process [114] and a multi-variable Gaussian process [69]; (2) varying the stick-breaking procedure, such as linking stick weights through a stochastic process [115], a kernel function [50], or a permutation stick-breaking order [34, 115]; (3) auxiliary PP in a covariant space, such as kernel beta process [140] and correlated normalised random measure [76]; and (4) revising the seating mechanisms in CRP or IBP, such as ddCRP [7] and ddIBP [71]. More discussions on this aspect can be found in Reference [60].

Factor analysis describes or captures the variability of observed or correlated data with the help of a collection of unobserved variables called factors, which is useful for denoising or storage [171]. The mathematical definition is $Y=\Phi X+E$, where $Y$ is data, $\Phi$ is factors, $X$ is the factor loading matrix, and $E$ is the error. Traditionally, the dimensionality of $\Phi$ needs to be given in advance, but BNL relaxes this requirement. Since IBP defines a distribution for infinite (on columns) binary matrices, it has become a significant cog in this field [77]. Below, we summarise the studies into two important branches of factor analysis: Principal component analysis and Non-negative matrix/tensor factorisation.

—*Principal component analysis (PCA)*. PCA is a very popular tool for dimension reduction, but the selection of the number of significant components requires strong background knowledge that is normally unknown. Unlike general factor analysis, PCA aims to project the data to an space spanned by orthonormal vectors. A BNP-PCA [51] is defined as

\begin{equation} \begin{aligned} Y = P(X \odot Z) + E \end{aligned} \end{equation}

—*Non-negative matrix/tensor factorisation (NMF/NTF)*. Same with PCA, the nonparametric versions of NMF/NTF are also mainly based on IBP or BP. The basic idea is to factor the data matrix or tensor into two matrices: one is binary (mask) matrix and the other is a factor matrix, and an infinite prior is given to the binary matrix, such as IBP [42, 182] or BP [108],

\begin{equation} \begin{aligned} Y = X Z, \end{aligned} \end{equation}

\begin{equation} \begin{aligned} Y = X \Gamma M, \end{aligned} \end{equation}

Transfer learning [131] is a learning scheme that extracts transferable knowledge from the a source domain(s) and reuses this knowledge in a target domain to resolve the problem of insufficient data in traditional machine-learning schemes. It appears that the core of the transfer learning is to extract transferable knowledge between domains that can be generalised to the target domain. BNL is good at generalisation because of its ability to build a flexible prior and, hence, has been applied to transfer learning. The existing works in this area are summarised according to the different forms of the transferable knowledge below.

—*Transfer by sharing factors.* Factors are real-valued vectors and can also be viewed as high-level semantic descriptions of data; this allows them to be transferred from one domain to another. In Bayesian nonparametric joint factor analysis (NJFA) [79], domains are jointly factorised with shared factors and domain-specified factors. In contrast to simply sharing factors between two domains through a matrix product and summation, factor sharing in NJFA relies on an ingenious HBP prior. HBP is used to control the assignment of factors between domains, and its advantages are (1) factors are shared by domains through hierarchically dependent beta processes and (2) the number of both shared and domain-specified factors does not need to be prefixed but rather automatically learned from the data.

—*Transfer by sharing topics.* Different from factors, topics are a set of latent variables characterised by the unit summation and are used as the transferable statistical strengths between domains. BNL makes these transferable statistical strengthens more flexible by giving them Bayesian nonparametric priors. In Clustered Naive Bayes [142], a number of Naive Bayes share topics (named parameters in Reference [142]) from a DP. Further tasks within the same group share the same parameters so that well-trained tasks can help to train the insufficiently-trained task. As we discussed in Section 4.2, HDP defines a way to share statistical strengths, which has also been applied to transfer learning [18]. Here, categories share the topics/clusters. HPYP is also able to benefit the transfer learning analogy to HDP. For example, a set of basic topics is first generated for sharing, and each domain uses these topics as the base measure of a Transformed PYP, which transforms these topics to another set of topics using a domain-specific transformation matrix [27]. Unlike HDP, the activated topics are controlled by an additional IBP. The advantage over HDP is that sharing of topics and their weights could be decoupled, which makes it possible to share low-weight topics between domains [53].

—*Transfer by sharing tree.* Knowledge is inherently multi-granular, and the more general the knowledge in the source domain, the larger the probability it will be reusable in a (related) target domain compared to specific knowledge. Hence, sharing trees has emerged as an efficient method for transfer learning. In terms of the tree construction, BNL could contribute a rather flexible tree prior. One example is the transfer Hierarchical LDA (thLDA) [98], which transfers the knowledge in an existing tree to a target domain, where a path of the tree from the root to the bottom node is sampled for each document in the target domain through nCRP. In addition to thLDA, the second example is Reference [146] where a tree structure is built with a fixed three-layer depth and unbounded width through nCRP. In this model, domains are treated as nodes in the bottom layer (named “level 1”), the parent nodes of the domains are named *supercategory* in the middle layer (named “level 2”), and the root node in the top layer (named “level 3”) is a set of two variables. Each domain is characterized by a Gaussian distribution, and domains are expected to share similar parameters with their parent nodes (i.e., supercategories or root nodes). The third example treats the parameters in the tree as transferable knowledge to assist the classification of classes with less labeled data [156] where a tree structure with leaf nodes as labels/domains is built with an nCRP prior. Another example is the polylingual tree-based topic model [91], which uses a similar method to nCRP. A tree is assigned a path through a probability made up of the product of the weights of branches in the path, where the tree is also separately built from the source domain. Without resorting to nCRP, the doubly Hierarchal Pitman-Yor Language Model [183] builds a latent HPYLM as the transferable knowledge and, simultaneously, two separate HPYLMs are built for the two domains with the latent HPYLM as part of the base measures in PYPs in two HPYLMs.

Tree structures play an important role in machine learning because they are pervasively applied and reflect the human habit of organising information. Thus, learning out a hierarchical structure from plain data has attracted a great deal of attention from researchers in the Bayesian nonparametric field. Compared to other approaches to this task, Bayesian nonparametric models have the advantage of a more flexible hierarchical structure. The lack of bounds on the structure's width and depth makes it much easier to incorporate the new data.

—*nCRP-based.* Here, the nCRP views a tree as a nested sequence of partitions. A space is first partitioned by a CRP and each area in this partition is further partitioned into several areas to generate a tree of potentially infinite depth and branches. A datum (i.e., a document) is associated with a path in the tree using DP or Flexible Martingale [157] priors with the nCRP [8]. The datum could also be associated with a subtree of the generated tree using HDP priors in nHDP [129] instead of with a path.

—*Stick-breaking-based.* With this method, an iterative stick-breaking process is used to construct a Pólya Tree (PT) [133] in a nested fashion. A datum is associated with a leaf node of the generated tree, and the traditional stick-breaking process is revised to generate breaks within the tree structure to result in a Tree Structured Stick Breaking (TSSB) [72]. A datum is attached to a node in the generated tree.

—*Diffusion-based.* Both Kingman's coalescent [101, 163, 165] and Dirichlet Diffusion Tree (DDT) [125] define a prior for an infinite (binary) tree. The idea is that the data are generated by a diffusion procedure with several divergences during this procedure. An additional time-varying continuous stochastic processes (i.e., Markov process) is needed for the divergence control. A datum is placed at the end of branches in the diffusions. DDT has been extended into a more general structure: multifurcating branches by a Pitman-Yor Diffusion Tree (PYDT) [103] and to a feature hierarchy by Beta Diffusion Tree (BDT) [84].

—*Others.* Motivated by the deep belief network (DBN) [86], the Poisson gamma belief network (PGBN) [199] learns a hierarchical structure where nodes have nonnegative real-valued weights rather than binary-valued weights in DBN and the width of each layer is flexible rather than fixed. Each layer's nodes can be viewed as an abstract feature expression of the input data.

The aim of relational learning (also known as stochastic relational learning) is to analyse, model, and predict the relationships between entities. Many data contain relationships by nature, e.g., social networks, traffic networks, protein interactions, and user-item relationships. Hence, a typical and practical application of relational learning is predicting potential future connections using a trained model. In traditional Bayesian learning, the key to better predictions is learning the hidden generative procedures in a network given a prior that encodes different understandings and explanations. BNL holds the potential provide a much more flexible prior [149]. The existing research in this direction is summarised below.

—*Latent class models.* The idea of these models is to group nodes into a number of classes based on the same behaviours. For example, all the nodes in one class may have the same probability of being related to all the nodes in another class [188] or to all the nodes that sit outside their own class [30]. DP is usually the main machinery behind latent class modelling due to its flexibility on latent classes. [188] was the first to model relationships using BNL. By Xu et al.’s definition, the probability of observing a link $l_{a, b}$ is modelled as $p(l_{a, b} | A_a, A_b) = \theta _{a,b}$ where $A_a$ and $A_b$ are the attributes of nodes linked by $l_{a,b}$, and $\theta _{a,b}$ is the link probability from a DP prior to make it more flexible. Further, the attributes of nodes are generated according to the latent variables $Z_a$, i.e., $p(A_a|Z_a)$, and the link $p(l_{a, b} | Z_a, Z_b, W)$ is modelled using these latent variables rather than directly using node attributes where $W$ is a model (connectivity) parameter. These latent variables can be understood as the “interests” or hidden features (i.e., latent classes) of nodes that determine the formation of the links. Using a DP prior for $Z$ means the data itself determines the optimal number of states for the variables [99, 187], which has been proven to be effective for social network analysis [186]. This link patterns between groups are also set to be shared across multiple networks in Reference [93] for multi-relational data. Rather than modelling the link probability between groups, the authors of Reference [188] model the probability that a node belongs to a class, and nodes with similar patterns are grouped into classes.

—*Latent feature models.* The authors of References [120, 122] model a link as $p(l_{a, b} | Z_a, Z_b, W)$, where $Z_a$ is a binary vector in Reference [187] and IBP is used as the prior for $Z$. This means that each node is either characterised by a series of hidden features or it belongs to a series of clusters. A DP can also be used to supplement the IBP by creating sub-clusters within each feature/cluster [130].

—*Random network models.* In contrast to the above models that focus on a given network with a fixed number of nodes and links, random network models try to model the generation of both links and nodes to allow the network to grow as more nodes join. Another advantage of random network models is their ability to model sparse networks (i.e., where the number of nodes is $o(n^2)$). Essentially, a network is represented as a link sequence by $Z = \sum _{n=1}^N z_{i,j} \delta _{\theta _i, \theta _j}$, where $\theta _i$ for each node is from a CRM or NCRM, e.g., PP [22] and DP [180]. Additionally, with a simple transformation, the same idea could also be applied to binary, integer, and multi-relational networks.

The reinforcement learning (RL) is a category of sequential decision-making problems that seeks to find the optimal policy for maximising its long-term profit given an agent that interacts with an uncertain environment. RL is normally modelled through a Markov Decision Process (MDP), which is usually represented as tuple $\lt S, A, T, R\gt$ in which $S$ denotes the environment state set, $A$ denotes the action set of the agent, $T(s^{\prime }|s,a)$ is the state transition function from the state $s$ to $s^{\prime }$ after the action $a$, and $R(r|a, s^{\prime })$ represents reward function that defines the profit from action $a$. The agent needs to find the optimal policy $\pi (a|s)$ for attaining a long-term reward through the interaction with the environment. While RL has many applications, it is typically used in robots and games, such as the famous AlphaGo program [154]. In RL, BNL could provide a principled way to tackle the core task of exploration-exploitation problem [73] and the ability to model the underlying environment dynamics in a flexible way. The existing studies on BNL in RL are summarised below according to whether the target is a policy or a reward.

—*Forward reinforcement learning.* Sometimes, the state of an environment cannot be observed or can only be partially observed, which provides incomplete information (termed observation $o$). Determining the optimal policy in such situations is known as partially-observable RL and is modelled as a Partially-Observable MDP. Here, the hidden states or environmental models need to be inferred through the historical (interaction) data. In other words, the transition, observation, and reward functions need to be learned from a history. GP [37, 40] is a natural and powerful prior for unknown functions and is particularly efficient in a continuous state space. However, in discrete state spaces, a Bayesian nonparametric approach based on HDP-HMM [45–47] models the hidden state transitions as stick weights from HDP. The base measure is composed of two parts: one is a function for reward evaluation, the other generates observations. Another form of forward reinforcement learning is based on PYP [46], which extends the classical Deterministic Markov Models (DMM) [117] using the same method.

—*Inverse reinforcement learning.* In contrast to forward RL, the target of inverse reinforcement learning is a reward function. The reward function is given to other elements of an MDP and the trajectories/demonstrations (i.e., state-action pairs) determined by experts with an assumption that experts made decisions with knowledge of the underlying environment model. Classical methods model this reward function as a linear combination of handcrafted features, which apparently limits the ability to capture the complex behaviours, but this constraint is relaxed using GP [107]. Instead of improving the learning capability of a single function, a different idea to resolve the complex function learning is to partition the observations [118, 119] or divide experts [32] into groups and learn relatively simple reward function for each group based on DP. A similar idea was used to resolve switched MDP using sticky HDP-HMM [158] as the prior with the additional ability to partition the data based on their temporal properties, which is beneficial for representing rich behaviours. Another IBP-based or BP-based idea [33, 136] is to learn the composite features as conjunctions of the original features and then learn the reward function for the composite features rather than original features.

Causal inference and metric learning are other interesting applications of BNL.

—*Causal inference*, which estimates the effect of an action, such as a medical treatment or a sales policy, given some observational data. Consider the following basic scenario. Suppose there are a number of patients $X$, a treatment regime $A$, and observed medical history data $\lt X, Y, A\gt$. $X$ takes $A$, which leads to the result $Y$, noting that each $X$ can only choose to take $A=1$ or take $A=0$ at any given time. The goal is to estimate the effect of the treatment regime, i.e., or . In Reference [85], GP is used to approximate the conditional distributions $P(Y | X, A=0)$ and $P(Y | X, A=1)$ separately. Another strategy is to approximate the joint distribution $P(X, Y, A)$ and then obtain the conditional expectations through marginalisation [144], while a counterfactual GP [150] is designed for the continuous-time scenarios.

—*Metric learning*, which aims to learn out a similarity measure that forces similar data to be close and dissimilar data to be further apart. A Bayesian nonparametric model for this task is based on BP [5], where the data likelihood is defined as $p(X|H) = f(H)$ and H is the latent features of the data and decomposed as $H=S \odot Z$. BP is used as the prior for $Z$, and an additional regulariser is introduced to the target variational inference optimisation function to control the similar and dissimilar data pairs.

The first study on BNL in the machine-learning community concerned document modelling [167]. However, as BNL has developed over the years, it has attracted attention from researchers in other fields, such as computer vision and robots. In turn, the increasing number of scenarios BNL is being applied to is prompting further development of theories and techniques in this area. A summary of the main fields of application for BNL is described in this section and Table 4.

The goal of text mining is to understand a document corpus. Below, we have summarised the various text mining applications for BNL according to the different types of document corpora to be mined.

—*Single corpus.* This task seeks to learn the knowledge shared between different documents in a corpus. HDP [167] was designed to deal with such tasks by assigning a random measure to each document using a generated global random measure as a base measure. However, with an HDP approach, the word vocabulary needs to be fixed in advance, which means new words cannot be incorporated if subsequent documents are added to the corpus. Latent IBP Compound Dirichlet Allocation [4] relaxes this constraint through a four-parameter IBP. Further, by using dependent GaPs or a mixed GaP-NBP, additional information from the corpus can be introduced into a model. The authors of the Reference [191] incorporated the links (e.g., citations between scientific papers) between the documents using a dependent GaPs and authorship information using a mixed GaP-NBP [190].

—*Multiple corpora.* This task aims to learn the knowledge shared across different corpora. The first attempt at such a task was undertaken by a three-level HDP [167] that assigned each corpus with a random measure using a generated global random measure as the base measure. Through this hierarchical method, the topics shared by different corpora could be inferred. Moving beyond shared topics, the difference between topics at different corpora is also learned by Differential Topic Models [27] based on PYP.

—*Multiple time-varying corpora.* This task aims to learn the shared knowledge across time-varying corpora. Beyond the idea of HDP, an Evolutionary HDP [194] was proposed to deal with this task, which added another time-varying dependency between the random measures for different corpora. The base measure of a corpus is the (convex) combination of two measures: the global measure and the measure at the previous time stamp.

The main applications of BNL in the field of natural language processing are detailed below.

—*Word segmentation.* This task is to identify word boundaries in continuous speech. CRP is used to capture the word sequence generation process [75] with two options: if a novel lexical item arrives, generate a phonemic form; if not, choose an existing lexical form. This idea has subsequently been applied to HDP and performs better than DP.

—*Phrase alignment.* This task aims to find frequent phrase pairs from bilingual texts for the benefit of phrase-based translation systems. Since bilingual texts do not come already segmented, and the number of aligned phrase pairs is unknown, DP and HDP are used as priors to identify each aligned phrase pair in a probabilistic model for this task [41].

—*Unsupervised part-of-speech (PoS) tagging.* Pos tagging is to mark the words in a text with its corresponding part-of-speeches, which is the basis of text analysis. Infinite HMM is used to model the word sequence with the help of an HMM and the hidden states (PoS tags) are unbounded with the help of HDP [66].

In the field of computer vision (i.e., image processing and video processing), BNL has been successfully applied to the following problems:

—*Image interpolation.* It is one of the most common image processing tasks and mainly includes image resizing and remapping. The goal is to factor an image by a linear, but finite, combination of dictionary atoms [83], which are represented as a matrix. BP and IBP [198] are used as the prior for this matrix to avoid the need to predetermine the number of dictionary atoms with an additional sparsity property. Zhou et al. [201] proposed a dependent HBP to incorporate a constraint on the patch positions in an image.

—*Motion capture segmentation.* This task aims to identify a finite number of hidden dynamic behaviours in multiple time series given that each time series may contain multiple dynamic behaviours. The mapping relationship between time series and dynamic behaviours is expressed as a matrix, which is given a BP prior to make it infinite [63].

—*Background subtraction.* This task aims to delineates background information in video streams. The background is modelled by a probability density and performs better as a multi-mode probability density with dynamic backgrounds. The appropriate number of modes in the distribution tends to depend on the target video stream. Hence, [81] proposed a DP-based Gaussian Mixture Model to allow the number of nodes to be determined by the data.

The field of biology is fortunate to boast a massive amount of data, much of which has been derived from clinical trials and personalised medicines. BNL provides an efficient paradigm in a fully model-based probabilistic framework that is highly flexible and adaptable [49]. Therefore, BNL has the potential to be extremely useful in biology, because of the lack of knowledge on the parametric model establishing.

—*Brain MRI tissue classification.* MRIs are an effective and routine diagnostic tool, and, as such, the accurate automatic classification of brain MRI can assist doctors’ diagnoses. Mixture model clustering algorithms have dominated this task for some time due to their relatively good performance. However, more recently, nonparametric extensions to these algorithms, such as DP [35] and HDP [94], have performed even better.

—*Positive selection detection.* This task aims to detect the positive natural selection from alignments of protein-coding DNA. Traditional methods assume that the non-synonymous/ synonymous rate ratio at a site as a random variable that satisfies an underlying distribution. DP is used as the prior of the non-synonymous/synonymous rate ratios of site clusters [92].

—*Expressed sequence tag (EST) analysis.* It plays a crucial role in gene analysis in molecular biology, like gene identification in organisms. One aim of this task is to predict the number of new genes in a new coming sample that can benefit the experimental design and redundancy measuring of an EST library [49, 109]. PYP [109] is used to estimate the gene proportion in the EST library and the number of new genes, while the probability of discovering new genes estimated by additional samples [55].

Music analysis is another interesting application scenario for BNL. Such applications include teaching music, analysing the human perception of sounds, and designs for music searches [138, 169]. Music data is stored as acoustic waveforms. Among a number of methodologies for music analysis, Bayesian techniques have been found to be very effective, which has paved the way for further advancements using BNL. The following applications are highlighted.

—*Musical similarity computation.* This task is to estimate the timbral similarity between recorded songs, which could be further applied to music retrieval. HDP is used for this task through representing each song as a (stick weight) posterior distribution. The similarity between two songs can then be evaluated by the symmetrised KL divergence of two corresponding distributions. Compared to the traditional single Gaussian-based or GMM-based approaches, HDP-based approaches perform better [89]. A similar idea was also adopted by the authors of Reference [124] with a two-dimensional tree structure learned to represent each song.

—*Blind source separation.* This task is to separate different types of sounds, e.g., instruments or people, from an audio clip. One challenge for this task is that the number of sound sources is unknown; however, Blei et al. [6] proposed a GP-based NMF to resolve this problem.

Using BNL to teach robots how to complete specific tasks is arguably one of the most pioneering tasks in current machine learning. Some examples are given below:

—*Robot teaching.* This task aims to train robots, such as autonomous quadrotor flights or self-driving cars, to learn actions from hand-held demonstrations [24, 119]. Experts demonstrate good or perfect actions for a specific task, and the robots learn from these demonstrations through reinforcement learning (also known as imitation learning).

—*Robot object identification.* This task aims to enhance the a robot's ability to identify (often complex) objects, such as clothes [104] and tactile surfaces [38].

—*Robot navigation.* This task requires a robot to build a map, locate itself, and find a route to a target according to its sensor data, which has been modelled as a regression task using GP in References [95, 132].

—*Robot introspection.* This task helps a robot understand what it is doing by identifying the actions and sub-tasks it executes [184], which makes it able to react to unstructured environments.

BNL is becoming a hot topic in machine learning due to its unique characteristics. BNL offers a strong theoretical foundation and the ability to generate powerful models in a highly flexible setting. Starting from the basic motivations and definitions, this article reviews the latest state-of-the-art research in this field following a standard procedure of BNL. A standard procedure of BNL comprises two steps: model construction and inference. Model construction can be likened to playing with Legos, where basic stochastic processes are the bricks, and the model is built by manipulating those bricks, while model inference is a parameter adjustment procedure according to the observed data. The recent advances on both steps have been reviewed, including the popular stochastic processes and their manipulations; the sampling-based and optimisation-based inference algorithms. The major applications of BNL in machine learning, e.g., relational learning and transfer learning, and real-world tasks, e.g., biology and computer vision, have also been summarised. From this survey, we find that BNL is still in its development stage, with a large gap between the current theories and techniques and the demand for solutions that address more complicated real-world tasks. However, rather than seeing this gap as presenting challenges to current developments, we see them as a rich source of possibilities deserving of further studies in this field. A selection are highlighted below.

—*Truncation-free variational inference.* Most of the variational inference algorithms for BNL require truncation, which not only introduces errors into the posterior approximation but also discards the asymptotic property in the original Bayesian nonparametric model. Some scholars have attempted to resolve this issue [106, 175], but a general truncation-free variational inference method for Bayesian nonparametric models has yet to be developed. Integrating all the state-of-the-art models to form a unified general method of truncation would be especially significant for practical platforms.

—*Deep Bayesian nonparametric model.* Inspired by deep learning, stacking stochastic processes in more layers would pave the way for deep Bayesian nonparametric models. The challenges such an innovation would bring, in both constructing such a model and to the model's inference, represent great opportunities for further study and progress in machine learning for real-world scenarios.

—*High-dimensionality data modelling.* High-dimensionality is a common challenge in data mining and machine learning. BNL normally assumes the features of this type of data are interchangeable, which is inappropriate for some tasks. Hence, it would be interesting to study ways of building reasonable models for high-dimensional data without sacrificing too many correlations.

—*New stochastic processes and their manipulations.* BNL's powerful modelling ability depends on an abundance of stochastic processes and how those processes are manipulated. As the complexity of data increases, new and ingenious stochastic processes and manipulations will be in greater demand. One motivating example is the Hawkes process [52], which implies a relationship between the index and parameters that are independent in PP. This relationship may be valuable for some machine-learning tasks, e.g., modelling the relationships between data and their labels.

- Amr Ahmed, Linagjie Hong, and Alexander J. Smola. 2013. Nested Chinese restaurant franchise processes: Applications to user tracking and document modeling. In
*Proceedings of the 30th International Conference on International Conference on Machine Learning (ICML’13)*. JMLR.org, III–1426–III–1434. - David J. Aldous. 1985.
*Exchangeability and Related Topics*. Springer. - Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. 2003. An introduction to MCMC for machine learning.
*Mach. Learn.*50, 1 (2003), 5–43. - Cédric Archambeau, Balaji Lakshminarayanan, and Guillaume Bouchard. 2015. Latent IBP compound Dirichlet allocation.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 321–333. - Behnam Babagholami M., Seyed M. Roostaiyan, Ali Zarghami, and Mahdieh S. Baghshah. 2014. Multi-modal distance metric learning: A Bayesian non-parametric approach. In
*Proceedings of the 13th European Conference on Computer Vision Workshops (ECCV’14)*. 63–77. - David M. Blei, Perry R. Cook, and Matthew Hoffman. 2010. Bayesian nonparametric matrix factorization for recorded music. In
*Proceedings of the 27th International Conference on Machine Learning (ICML’10)*. 439–446. - David M. Blei and Peter I. Frazier. 2011. Distance dependent Chinese restaurant processes.
*J. Mach. Learn. Res.*12, 8 (2011), 2461–2488. - David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. 2010. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies.
*J. ACM*57, 2 (2010), 7. - David M. Blei and Michael I. Jordan. 2006. Variational inference for Dirichlet process mixtures.
*Bayes. Anal.*1, 1 (2006), 121–143. - David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. 2017. Variational inference: A review for statisticians.
*J. Am. Stat. Assoc.*112, 518 (2017), 859–877. - David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation.
*J. Mach. Learn. Res.*3, 1 (2003), 993–1022. - Phil Blunsom and Trevor Cohn. 2011. A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction. In
*Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (ACL’11)*. 865–874. - Tamara Broderick, Michael I. Jordan, and Jim Pitman. 2012. Beta processes, stick-breaking and power laws.
*Bayes. Anal.*7, 2 (2012), 439–476. - Tamara Broderick, Lester Mackey, John Paisley, and Michael I. Jordan. 2015. Combinatorial clustering and the beta negative binomial process.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 290–306. - Michael Bryant and Erik B. Sudderth. 2012. Truly nonparametric online variational inference for hierarchical Dirichlet processes. In
*Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS’12)*. 2699–2707. - Trevor Campbell, Julian Straub, John W. Fisher III, and Jonathan P. How. 2015. Streaming, distributed variational inference for Bayesian nonparametrics. In
*Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS’15)*. 280–288. - Kevin R. Canini and Thomas L. Griffiths. 2011. A nonparametric Bayesian model of multi-level category learning. In
*Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI’11)*. 307–312. - Kevin R. Canini, Mikhail M. Shashkov, and Thomas L. Griffiths. 2010. Modeling transfer learning in human categorization with the hierarchical Dirichlet process. In
*Proceedings of the 27th International Conference on Machine Learning (ICML’10)*. 151–158. - Lawrence Carin, David M. Blei, and John W. Paisley. 2011. Variational inference for stick-breaking beta process priors. In
*Proceedings of the 28th International Conference on Machine Learning (ICML’11)*. 889–896. - Francois Caron, Manuel Davy, and Arnaud Doucet. 2007. Generalized polya urn for time-varying Dirichlet process mixtures. In
*Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI’07)*. 33–40. - FranÇois Caron, Manuel Davy, Arnaud Doucet, Emmanuel Duflos, and Philippe Vanheeghe. 2008. Bayesian inference for linear dynamic models with Dirichlet process mixtures.
*IEEE Trans. Sign. Process.*56, 1 (2008), 71–84. - FranÇois Caron and Emily B. Fox. 2017. Sparse graphs using exchangeable random measures.
*J. Roy. Stat. Soc. B*79, 5 (2017), 1295–1366. - Jason Chang and John W. Fisher III. 2013. Parallel sampling of DP mixture models using sub-cluster splits. In
*Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS’13)*. 620–628. - Sotirios P. Chatzis, Dimitrios Korkinof, and Yiannis Demiris. 2012. A nonparametric Bayesian approach toward robot learning by demonstration.
*Robot. Auton. Syst.*60, 6 (2012), 789–802. - Sotirios P. Chatzis and Gabriel Tsechpenakis. 2010. The infinite hidden markov random field model.
*IEEE Trans. Neur. Netw.*21, 6 (2010), 1004–1014. - Bo Chen, Gungor Polatkan, Guillermo Sapiro, Lawrence Carin, and David B. Dunson. 2011. The hierarchical beta process for convolutional factor analysis and deep learning. In
*Proceedings of the 28th International Conference on Machine Learning (ICML’11)*. 361–368. - Changyou Chen, Wray Buntine, Nan Ding, Lexing Xie, and Lan Du. 2015. Differential topic models.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 230–242. - Changyou Chen, Nan Ding, and Wray L. Buntine. 2012. Dependent hierarchical normalized random measures for dynamic topic modeling. In
*Proceedings of the 29th International Conference on Machine Learning (ICML’12)*. - Changyou Chen, Vinayak Rao, Wray Buntine, and Yee W. Teh. 2013. Dependent normalized random measures. In
*Proceedings of the 30th International Conference on Machine Learning (ICML’13)*. 969–977. - Yi Chen, X L Wang, Xin Xiang, Buzhou Tang, and Junzhao Bu. 2015. Network structure exploration via Bayesian nonparametric models.
*J. Stat. Mech.: Theory Exp.*2015, 10 (2015), P10004. - Jen-Tzung Chien. 2018. Bayesian nonparametric learning for hierarchical and sparse topics.
*IEEE/ACM Trans. Aud. Speech. Lang. Process.*26, 2 (2018), 422–435. - Jaedeug Choi and Kee-Eung Kim. 2012. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. In
*Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS’12)*. 305–313. - Jaedeug Choi and Kee-Eung Kim. 2013. Bayesian nonparametric feature construction for inverse reinforcement learning. In
*Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13)*. 1287–1293. - Yeonseung Chung and David B. Dunson. 2009. The local Dirichlet process.
*Ann. Inst. Stat. Math.*63, 1 (2009), 59–80. - Adelino R. Ferreira da Silva. 2007. A Dirichlet process mixture model for brain MRI tissue classification.
*Med. Image Anal.*11, 2 (2007), 169–182. - Andrew M. Dai and Amos J. Storkey. 2015. The supervised hierarchical Dirichlet process.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 243–255. - Patrick Dallaire, Camille Besse, Stephane Ross, and Brahim Chaib-draa. 2009. Bayesian reinforcement learning in continuous POMDPs with Gaussian processes. In
*Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS’09)*. 2604–2609. - Patrick Dallaire, Philippe Giguère, Daniel Émond, and Brahim Chaib-draa. 2014. Autonomous tactile perception: A combined improved sensing and Bayesian nonparametric approach.
*Robot. Autonom. Syst.*62, 4 (2014), 422–435. - P. Damlen, John Wakefield, and Stephen Walker. 1999. Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables.
*J. Roy. Stat. Soc. Ser. B*61, 2 (1999), 331–344. - Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters. 2008. Model-based reinforcement learning with continuous states and actions. In
*Proceedings of the European Symposium on Artificial Neural Networks (ESANN’08)*. 19–24. - John DeNero, Alexandre Bouchard-Côté, and Dan Klein. 2008. Sampling alignment structure under a Bayesian translation model. In
*Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08)*. 314–323. - Nan Ding, Rongjing Xiang, Ian Molloy, and Ninghui Li. 2010. Nonparametric Bayesian matrix factorization by Power-EP. In
*Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS’10)*. 169–176. - Kjell Doksum. 1974. Tailfree and neutral random probabilities and their posterior distributions.
*Ann. Probab.*2, 2 (1974), 183–201. - Finale Doshi, Kurt Miller, Jurgen V. Gael, and Yee W. Teh. 2009. Variational inference for the Indian buffet process. In
*Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS’09)*. 137–144. - Finale Doshi-velez. 2009. The infinite partially observable Markov decision process. In
*Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09)*. 477–485. - Finale Doshi-Velez, David Pfau, Frank Wood, and Nicholas Roy. 2015. Bayesian nonparametric methods for partially-observable reinforcement learning.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 394–407. - Finale Doshi-Velez, David Wingate, Nicholas Roy, and Joshua B. Tenenbaum. 2010. Nonparametric Bayesian policy priors for reinforcement learning. In
*Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10)*. 532–540. - Kumar Dubey, Sinead Williamson, and Eric P. Xing. 2014. Parallel Markov chain Monte Carlo for Pitman-Yor mixture models. In
*Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI’14)*. 142–151. - David B. Dunson. 2010. Nonparametric Bayes applications to biostatistics. In
*Bayesian Nonparametrics*. Cambridge University Press. 223–273. - David B. Dunson and Ju-Hyun Park. 2008. Kernel stick-breaking processes.
*Biometrika*95, 2 (2008), 307–323. - Clément Elvira, Pierre Chainais, and Nicolas Dobigeon. 2017. Bayesian nonparametric principal component analysis.
*arXiv preprint arXiv:1709.05667*(2017). - Paul Embrechts, Thomas Liniger, and Lu Lin. 2011. Multivariate Hawkes processes: An application to financial data.
*J. Appl. Probab.*48A (2011), 367–378. - Ali Faisal, Jussi Gillberg, Gayle Leen, and Jaakko Peltonen. 2013. Transfer learning using a nonparametric sparse topic model.
*Neurocomputing*112 (2013), 124–137. - Xuhui Fan, Longbing Cao, and Richard D.Y. Xu. 2015. Dynamic infinite mixed-membership stochastic blockmodel.
*IEEE Trans. Neur. Netw. Learn. Syst.*26, 9 (2015), 2072–2085. - Stefano Favaro, Antonio Lijoi, and Igor Prünster. 2012. A new estimator of the discovery probability.
*Biometrics*68, 4 (2012), 1188–1196. - Paul Fearnhead. 2004. Particle filters for mixture models with an unknown number of components.
*Stat. Comput.*14, 1 (2004), 11–21. - Thomas S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems.
*Ann. Stat.*1, 2 (1973), 209–230. - Thomas S. Ferguson, Eswar G. Phadia, and Ram C. Tiwari. 1992. Bayesian nonparametric inference.
*Lect. Not Monogr. Ser.*17 (1992), 127–150. - Nicholas J. Foti, Joseph D. Futoma, Daniel N. Rockmore, and Sinead Williamson. 2013. A unifying representation for a class of dependent random measures. In
*Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS’13)*. 20–28. - Nicholas J. Foti and Sinead A. Williamson. 2015. A survey of non-exchangeable priors for bayesian nonparametric models.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 359–371. - Emily Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. 2011. Bayesian nonparametric inference of switching dynamic linear models.
*IEEE Trans. Sign. Process.*59, 4 (2011), 1569–1585. - Emily B. Fox. 2009.
*Bayesian Nonparametric Learning of Complex Dynamical Phenomena*.Ph.D. Dissertation . Massachusetts Institute of Technology. - Emily B. Fox, Michael C. Hughes, Erik B. Sudderth, and Michael I. Jordan. 2014. Joint modeling of multiple time series via the beta process with application to motion capture segmentation.
*Ann. Appl. Stat.*8, 3 (2014), 1281–1313. - Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. 2011. A sticky HDP-HMM with application to speaker diarization.
*Ann. Appl. Stat.*5, 2A (2011), 1020–1056. - Jurgen V. Gael, Yee W. Teh, and Zoubin Ghahramani. 2008. The infinite factorial hidden Markov model. In
*Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS’08)*. 1697–1704. - Jurgen V. Gael, Andreas Vlachos, and Zoubin Ghahramani. 2009. The infinite HMM for unsupervised PoS tagging. In
*Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’09)*. 678–687. - Zekai Gao, Yangqiu Song, Shixia Liu, Haixun Wang, Hao Wei, Yang Chen, and Weiwei Cui. 2011. Tracking and connecting topics via incremental hierarchical Dirichlet processes. In
*Proceedings of the 11th IEEE International Conference on Data Mining (ICDM’11)*. 1056–1061. - Hong Ge, Yutian Chen, Moquan Wan, and Zoubin Ghahramani. 2015. Distributed inference for Dirichlet process mixture models. In
*Proceedings of the 32nd International Conference on Machine Learning (ICML’15)*. 2276–2284. - Alan E. Gelfand, Athanasios Kottas, and Steven N. MacEachern. 2005. Bayesian nonparametric spatial modeling with Dirichlet process mixing.
*J. Am. Stat. Assoc.*100, 471 (2005), 1021–1035. - Samuel J. Gershman and David M. Blei. 2012. A tutorial on Bayesian nonparametric models.
*J. Math. Psychol.*56, 1 (2012), 1–12. - Samuel J. Gershman, Peter I. Frazier, and David M. Blei. 2015. Distance dependent infinite latent feature models.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 334–345. - Zoubin Ghahramani, Michael I. Jordan, and Ryan P. Adams. 2010. Tree-structured stick breaking for hierarchical data. In
*Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10)*. 19–27. - Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. 2015. Bayesian reinforcement learning: A survey.
*Found. Trends Mach. Learn.*8, 5-6 (2015), 359–483. - Jayanta K. Ghosh and R.V. Ramamoorthi. 2002.
*Bayesian Nonparametrics*. Springer. - Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring the effects of context.
*Cognition*112, 1 (2009), 21–54. - Jim E. Griffin, Michalis Kolossiatis, and Mark F.J. Steel. 2013. Comparing distributions by using dependent normalized random-measure mixtures.
*J. Roy. Stat. Soc. Ser. B*75, 3 (2013), 499–529. - Thomas L. Griffiths and Zoubin Ghahramani. 2005. Infinite latent feature models and the Indian buffet process. In
*Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS’05)*. 475–482. - Thomas L. Griffiths and Zoubin Ghahramani. 2011. The indian buffet process: An introduction and review.
*J. Mach. Learn. Res.*12, 4 (2011), 1185–1224. - Sunil K. Gupta, Dinh Phung, and Svetha Venkatesh. 2012. A Bayesian nonparametric joint factor model for learning shared and individual subspaces from multiple data sources. In
*Proceedings of the 12th International Conference on Data Mining (SDM’12)*. 200–211. - Sunil K. Gupta, Dinh Q. Phung, and Svetha Venkatesh. 2012. A slice sampler for restricted hierarchical beta process with applications to shared subspace learning. In
*Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI’12)*. 316–325. - Tom S. F. Haines and Tao Xiang. 2014. Background subtraction with Dirichlet process mixture models.
*IEEE Trans. Pattern Anal. Mach. Intell.*36, 4 (2014), 670–683. - Lauren A. Hannah, David M. Blei, and Warren B. Powell. 2011. Dirichlet process mixtures of generalized linear models.
*J. Mach. Learn. Res.*12, 6 (2011), 1923–1953. - Li He, Hairong Qi, and Russell Zaretzki. 2013. Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution. In
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13)*. 345–352. - Creighton Heaukulani, David A. Knowles, and Zoubin Ghahramani. 2014. Beta diffusion trees. In
*Proceedings of the 31th International Conference on Machine Learning (ICML’14)*. 1809–1817. - Jennifer L. Hill. 2011. Bayesian nonparametric modeling for causal inference.
*J. Comput. Graph. Stat.*20, 1 (2011), 217–240. - Geoffrey E. Hinton, Simon Osindero, and Yee W. Teh. 2006. A fast learning algorithm for deep belief nets.
*Neur. Comput.*18, 7 (2006), 1527–1554. - Nils L. Hjort. 1990. Nonparametric Bayes estimators based on beta processes in models for life history data.
*Ann. Stat.*18, 3 (1990), 1259–1294. - Nils L. Hjort, Chris Holmes, Peter Müller, and Stephen G. Walker. 2010.
*Bayesian Nonparametrics*. Vol. 28. Cambridge University Press. - Matthew D. Hoffman, David M. Blei, and Perry R. Cook. 2008. Content-based musical similarity computation using the hierarchical Dirichlet process. In
*Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08)*. 349–354. - Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference.
*J. Mach. Learn. Res.*14, 1 (2013), 1303–1347. - Yuening Hu, Ke Zhai, Vladimir Eidelman, and Jordan L. Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. In
*Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14)*. 1166–1176. - John P. Huelsenbeck, Sonia Jain, Simon W. D. Frost, and Sergei L. Kosakovsky Pond. 2006. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences.
*Proc. Natl. Acad. Sci. U.S.A.*103, 16 (2006), 6263–6268. - Tomoharu Iwata, James R. Lloyd, and Zoubin Ghahramani. 2016. Unsupervised many-to-many object matching for relational data.
*IEEE Trans. Pattern Anal. Mach. Intell.*38, 3 (2016), 607–617. - Saad Jbabdi, Mark Woolrich, and Timothy E.J. Behrens. 2009. Multiple-subjects connectivity-based parcellation using hierarchical Dirichlet process mixture models.
*NeuroImage*44, 2 (2009), 373–384. - Yun Jiang and Ashutosh Saxena. 2013. Infinite latent conditional random fields for modeling environments through humans. In
*Proceedings of Robotics: Science and Systems IX*. Berlin, Germany. - Michael I. Jordan. 2010. Bayesian nonparametric learning: Expressive priors for intelligent systems.
*Heuristics, Probability and Causality: A Tribute to Judea Pearl*11 (2010), 167–185. - Maria Kalli, Jim E. Griffin, and Stephen G. Walker. 2011. Slice sampling mixture models.
*Stat. Comput.*21, 1 (2011), 93–105. - Jeon-Hyung Kang, Jun Ma, and Yan Liu. 2012. Transfer topic modeling with ease and scalability. In
*Proceedings of the 12th SIAM International Conference on Data Mining (SDM’12)*. 564–575. - Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. 2006. Learning systems of concepts with an infinite relational model. In
*Proceedings of the 21st National Conference on Artificial Intelligence—Volume 1 (AAAI’06)*. 381–388. - John F. C. Kingman. 1967. Completely random measures.
*Pac. J. Math.*21, 1 (1967), 59–78. - John F. C. Kingman. 1982. The coalescent.
*Stochast. Process. Appl.*13, 3 (1982), 235–248. - John F. C. Kingman. 1992.
*Poisson Processes*. Vol. 3. Oxford University Press. - David A. Knowles and Zoubin Ghahramani. 2015. Pitman Yor diffusion trees for Bayesian hierarchical clustering.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 271–289. - Nishanth Koganti, Tomoya Tamei, Kazushi Ikeda, and Tomohiro Shibata. 2017. Bayesian nonparametric learning of cloth models for real-time state estimation.
*IEEE Trans. Robot.*33, 4 (2017), 916–931. - Kenichi Kurihara, Max Welling, and Yee W. Teh. 2007. Collapsed variational Dirichlet process mixture models. In
*Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07)*, Vol. 7. 2796–2801. - Kenichi Kurihara, Max Welling, and Nikos A. Vlassis. 2006. Accelerated variational Dirichlet process mixtures. In
*Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS’06)*. 761–768. - Sergey Levine, Zoran Popovic, and Vladlen Koltun. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. In
*Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS’11)*, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). 19–27. - Dawen Liang, Matthew D. Hoffman, and Daniel P.W. Ellis. 2013. Beta process sparse nonnegative matrix factorization for music. In
*Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR’13)*. 375–380. - Antonio Lijoi, Ramsés H. Mena, and Igor Prünster. 2007. A Bayesian nonparametric method for prediction in EST analysis.
*BMC Bioinf.*8, 1 (2007), 1–10. - Kar W. Lim, Wray Buntine, Changyou Chen, and Lan Du. 2016. Nonparametric Bayesian topic modelling with the hierarchical Pitman-Yor processes.
*Int. J. Approx. Reason.*78, C (2016), 172–191. - Dahua Lin. 2013. Online learning of nonparametric mixture models via sequential variational approximation. In
*Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS’13)*. 395–403. - Dahua Lin, Eric Grimson, and John W. Fisher III. 2010. Construction of dependent Dirichlet processes based on Poisson processes. In
*Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10)*. 1396–1404. - Dan Lovell, Jonathan Malmaud, Ryan P. Adams, and Vikash K. Mansinghka. 2013. ClusterCluster: Parallel Markov chain Monte Carlo for Dirichlet process mixtures.
*arXiv preprint arXiv:1304.2302*(2013). - Steven N. MacEachern. 1999. Dependent nonparametric processes. In
*Proceedings of the Section on Bayesian Statistical Science*. American Statistical Association, 50–55. - Steven N. MacEachern. 2000.
*Dependent Dirichlet Processes*.Technical Report . Department of Statistics, The Ohio State University. - Steven N. MacEachern, Merlise Clyde, and Jun S. Liu. 1999. Sequential importance sampling for nonparametric Bayes models: The next generation.
*Can. J. Stat.*27, 2 (1999), 251–267. - M. Mahmud. 2010. Constructing states for reinforcement learning. In
*Proceedings of the 27th International Conference on Machine Learning (ICML’10)*. 727–734. - Bernard Michini and Jonathan P. How. 2012. Bayesian nonparametric inverse reinforcement learning. In
*Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML’12)*. Bristol, UK, 148–163. - Bernard Michini, Thomas J. Walsh, Ali-Akbar Agha-Mohammadi, and Jonathan P. How. 2015. Bayesian nonparametric reward learning from demonstration.
*IEEE Trans. Robot.*31, 2 (2015), 369–386. - Kurt Miller, Michael I. Jordan, and Thomas L. Griffiths. 2009. Nonparametric latent feature models for link prediction. In
*Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09)*. 1276–1284. - Thomas Minka. 2004.
*Power EP*.Technical Report . Microsoft Research, Cambridge. - Morten Mørup, Mikkel N. Schmidt, and Lars K. Hansen. 2011. Infinite multiple membership relational modeling for complex networks. In
*Proceedings of the 21st IEEE International Workshop on Machine Learning for Signal Processing (MLSP’11)*. 1–6. - Peter Müller, Fernando A. Quintana, Alejandro Jara, and Tim Hanson. 2015.
*Bayesian Nonparametric Data Analysis*. Springer. - Masahiro Nakano, Yasunori Ohishi, Hirokazu Kameoka, Ryo Mukai, and Kunio Kashino. 2012. Bayesian nonparametric music parser. In
*Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12)*. 461–464. - Radford M. Neal. 2003. Density modeling and clustering using Dirichlet diffusion trees.
*Bayes. Stat.*7 (2003), 619–629. - Radford M. Neal. 2003. Slice sampling.
*Ann. Stat.*31, 3 (2003), 705–767. - Willie Neiswanger, Chong Wang, and Eric Xing. 2014. Embarrassingly parallel variational inference in nonconjugate models. In
*Workshop on Advanced Variational Inference, Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPSW’14)*. 1–18. - Peter Orbanz and Yee Whye Teh. 2010. Bayesian nonparametric models. In
*Encyclopedia of Machine Learning*. Springer US, Boston, MA, 81–89. - John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan. 2015. Nested hierarchical Dirichlet processes.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 256–270. - Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. 2015. Relational learning and network modelling using infinite latent attribute models.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 462–474. - Sinno J. Pan and Qiang Yang. 2010. A survey on transfer learning.
*IEEE Trans. Knowl. Data Eng.*22, 10 (2010), 1345–1359. - Christian Plagemann, Kristian Kersting, Patrick Pfaff, and Wolfram Burgard. 2007. Gaussian beam processes: A nonparametric Bayesian measurement model for range finders. In
*Proceedings of the Robotics: Science and Systems (RSS'07), Atlanta, Georgia*. - S. C. Williams R. Daniel Mauldin, William D. Sudderth. 1992. Polya trees and random distributions.
*Ann. Stat.*20, 3 (1992), 1203–1221. - Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition.
*Proc. IEEE*77, 2 (1989), 257–286. - Natraj Raman and S.J. Maybank. 2016. Non-parametric hidden conditional random fields for action classification. In
*Proceedings of the International Joint Conference on Neural Networks (IJCNN’16)*. 3256–3263. - Pravesh Ranchod, Benjamin Rosman, and George Konidaris. 2015. Nonparametric Bayesian reward segmentation for skill discovery using inverse reinforcement learning. In
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’15)*. 471–477. - Carl E. Rasmussen. 1999. The infinite Gaussian mixture model. In
*Proceedings of the 13th Annual Conference on Neural Information Processing Systems (NIPS’99)*. 554–560. - Lu Ren, David Dunson, Scott Lindroth, and Lawrence Carin. 2010. Dynamic nonparametric Bayesian models for analysis of music.
*J. Am. Stat. Assoc.*105, 490 (2010), 458–472. - Lu Ren, David B. Dunson, and Lawrence Carin. 2008. The dynamic hierarchical Dirichlet process. In
*Proceedings of the 25th International Conference on Machine Learning (ICML’08)*. 824–831. - Lu Ren, Yingjian Wang, Lawrence Carin, and David B. Dunson. 2011. The kernel beta process. In
*Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS’11)*. 963–971. - Abel Rodriguez, David B. Dunson, and Alan E. Gelfand. 2008. The nested Dirichlet process.
*J. Am. Stat. Assoc.*103, 483 (2008), 1131–1154. - Daniel M. Roy and Leslie Pack Kaelbling. 2007. Efficient Bayesian task-level transfer learning. In
*Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07)*, Vol. 7. 2599–2604. - Daniel M. Roy and Yee W. Teh. 2008. The mondrian process. In
*Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS’08)*. 1377–1384. - Jason Roy, Kirsten J. Lum, Bret Zeldow, Jordan Dworkin, and Vincent Lo Re III, Michael J. Daniels. 2017. Bayesian nonparametric generative models for causal inference with missing at random covariates.
*Biometrics*. DOI: 10.1111/biom.12875 - Anirban Roychowdhury and Brian Kulis. 2015. Gamma processes, stick-breaking, and variational inference. In
*Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS’15)*. 800–808. - Ruslan Salakhutdinov, Joshua B. Tenenbaum, and Antonio Torralba. 2011. One-shot learning with a hierarchical nonparametric Bayesian model. In
*Workshop on Unsupervised and Transfer Learning—Proceedings of the 28th International Conference on Machine Learning (ICMLW’11)*. 195–206. - Issei Sato, Kenichi Kurihara, and Hiroshi Nakagawa. 2012. Practical collapsed variational bayes inference for hierarchical dirichlet process. In
*Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12)*. 105–113. - Ken-iti Sato. 1999.
*Lévy Processes and Infinitely Divisible Distributions*. Cambridge University Press. - Mikkel N. Schmidt and Morten Morup. 2013. Nonparametric Bayesian modeling of complex networks: An introduction.
*IEEE Sign. Process. Mag.*30, 3 (2013), 110–128. - Peter Schulam and Suchi Saria. 2017. Reliable decision support using counterfactual models. In
*Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS’17)*. 1697–1708. - Matthias Seeger. 2004. Gaussian processes for machine learning.
*Int. J. Neur. Syst.*14, 02 (2004), 69–106. - Jayaram Sethuraman. 1994. A constructive definition of Dirichlet priors.
*Stat. Sinica*4, 2 (1994), 639–650. - Babak Shahbaba and Radford Neal. 2009. Nonlinear models using Dirichlet process mixtures.
*J. Mach. Learn. Res.*10, 8 (2009), 1829–1850. - David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
*et al.*2017. Mastering the game of go without human knowledge.*Nature*550, 7676 (2017), 354. - Padhraic Smyth, Max Welling, and Arthur U. Asuncion. 2009. Asynchronous distributed learning of topic models. In
*Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09)*. 81–88. - Nitish Srivastava and Ruslan R. Salakhutdinov. 2013. Discriminative transfer learning with tree-based priors. In
*Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS’13)*. 2094–2102. - Jacob Steinhardt and Zoubin Ghahramani. 2012. Flexible martingale priors for deep hierarchies. In
*Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS’12)*. 1108–1116. - Amit Surana and Kunal Srivastava. 2014. Bayesian nonparametric inverse reinforcement learning for switched Markov decision processes. In
*Proceedings of the 13th IEEE International Conference on Machine Learning and Applications (ICMLA’14)*. 47–54. - Alex Tank, Nicholas Foti, and Emily Fox. 2015. Streaming variational inference for Bayesian nonparametric mixture models. In
*Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS’15)*. 968–976. - Martin A. Tanner and Wing H. Wong. 2010. From EM to data augmentation: The emergence of MCMC Bayesian computation in the 1980s.
*Stat. Sci.*25, 4 (2010), 506–516. - Yee W. Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In
*Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL’06)*. 985–992. - Yee W. Teh. 2010. Dirichlet process. In
*Encyclopedia of Machine Learning*. Springer US, Boston, MA, 280–287. - Yee W. Teh, Charles Blundell, and Lloyd Elliott. 2011. Modelling genetic variations using fragmentation-coagulation processes. In
*Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS’11)*. 819–827. - Yee W. Teh, Dilan Görür, and Zoubin Ghahramani. 2007. Stick-breaking construction for the Indian buffet process. In
*Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS’07)*. 556–563. - Yee W. Teh, Hal Daume III, and Daniel M. Roy. 2007. Bayesian agglomerative clustering with coalescents. In
*Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS’07)*. 1473–1480. - Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In
*Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS’05)*. 1385–1392. - Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet processes.
*J. Am. Stat. Assoc.*101, 476 (2006), 1566–1581. - Yee W. Teh, Kenichi Kurihara, and Max Welling. 2007. Collapsed variational inference for HDP. In
*Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS’07)*. 1481–1488. - David Temperley. 2007.
*Music and Probability*. MIT Press. - Romain Thibaux and Michael I. Jordan. 2007. Hierarchical beta processes and the Indian buffet process. In
*Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS’07)*. 564–571. - Bruce Thompson. 2004.
*Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications*. American Psychological Association. - David A Van Dyk and Xiao-Li Meng. 2001. The art of data augmentation.
*J. Comput. Graph. Stat.*10, 1 (2001), 1–50. - Sara Wade, Silvia Mongelluzzo, and Sonia Petrone. 2011. An enriched conjugate prior for Bayesian nonparametric inference.
*Bayes. Anal.*6, 3 (2011), 359–385. - Chong Wang and David M. Blei. 2009. Variational inference for the nested Chinese restaurant process. In
*Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09)*. 1990–1998. - Chong Wang and David M. Blei. 2012. Truncation-free online variational inference for Bayesian nonparametric models. In
*Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS’12)*. 413–421. - Chong Wang, John W. Paisley, and David M. Blei. 2011. Online variational inference for the hierarchical Dirichlet process. In
*Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11)*. 752–760. - Yingjian Wang and Lawrence Carin. 2012. LéVy measure decompositions for the beta and gamma processes. In
*Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12)*. 499–506. - Christopher K. I. Williams and Carl Edward Rasmussen. 2006.
*Gaussian Processes for Machine Learning*. Vol. 2. MIT Press. - Sinead Williamson, Avinava Dubey, and Eric Xing. 2013. Parallel Markov chain Monte Carlo for nonparametric mixture models. In
*Proceedings of the 30th International Conference on Machine Learning (ICML’13)*. 98–106. - Sinead A. Williamson. 2016. Nonparametric network models for link prediction.
*J. Mach. Learn. Res.*17, 1 (2016), 7102–7121. - Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, and Emily B. Fox. 2009. Nonparametric Bayesian learning of switching linear dynamical systems. In
*Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09)*. 457–464. - Frank Wood and Thomas L. Griffiths. 2006. Particle filtering for nonparametric Bayesian matrix factorization. In
*Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS’06)*. 1513–1520. - Frank Wood and Yee W. Teh. 2009. A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. In
*Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS’09)*. 607–614. - Hongmin Wu, Hongbin Lin, Yisheng Guan, Kensuke Harada, and Juan Rojas. 2017. Robot introspection with bayesian nonparametric vector autoregressive hidden markov models. In
*Proceedings of the 17th IEEE-RAS International Conference on Humanoid Robotics (Humanoids'17)*. 882--888. - Tianbing Xu, Zhongfei Zhang, Philip S. Yu, and Bo Long. 2008. Evolutionary clustering by hierarchical Dirichlet process with hidden Markov state. In
*Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08)*. 658–667. - Zhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kersting. 2010. Social network mining with nonparametric relational models. In
*Proceedings of the 2nd International Workshop on Advances in Social Network Mining and Analysis (SNAKDD’08)*. 77–96. - Zhao Xu, Volker Tresp, Kai Yu, and Hans-Peter Kriegel. 2006. Infinite hidden relational models. In
*Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI’06)*. 544–551. - Zhao Xu, Volker Tresp, Kai Yu, Shipeng Yu, and Hans-Peter Kriegel. 2005. Dirichlet enhanced relational learning. In
*Proceedings of the 22nd International Conference on Machine Learning (ICML’05)*. 1004–1011. - Zenglin Xu, Feng Yan, and Yuan Qi. 2015. Bayesian nonparametric models for multiway data analysis.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 475–487. - Junyu Xuan, Jie Lu, Guangquan Zhang, Richard Y.D. Xu, and Xiangfeng Luo. 2015. Infinite author topic model based on mixed gamma-negative binomial process. In
*Proceedings of the 15th IEEE International Conference on Data Mining (ICDM’15)*. Atlantic City, New Jersey, USA, 489–498. - Junyu Xuan, Jie Lu, Guangquan Zhang, Richard Y.D. Xu, and Xiangfeng Luo. 2017. Bayesian nonparametric relational topic model through dependent Gamma processes.
*IEEE Trans. Knowl. Data Eng.*29, 7 (2017), 1357–1369. - Junyu Xuan, Jie Lu, Guangquan Zhang, Richard Y.D. Xu, and Xiangfeng Luo. 2018. Doubly nonparametric sparse nonnegative matrix factorization based on dependent Indian buffet processes.
*IEEE Trans. Neur. Netw. Learn. Syst.*29, 5 (2018), 1835–1849. - Cheng Zhang, Carl Henrik Ek, Xavi Gratal, Florian T. Pokorny, and Hedvig Kjellstrom. 2013. Supervised hierarchical Dirichlet processes with variational inference. In
*Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW’13)*. 254–261. - Jianwen Zhang, Yangqiu Song, Changshui Zhang, and Shixia Liu. 2010. Evolutionary hierarchical Dirichlet processes for multiple correlated time-varying corpora. In
*Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10)*. 1079–1088. - Jiangchuan Zheng, Siyuan Liu, and Lionel M. Ni. 2014. Effective mobile context pattern discovery via adapted hierarchical Dirichlet processes. In
*Proceedings of the 15th IEEE International Conference on Mobile Data Management (MDM’14)*, Vol. 1, 146–155. - Mingyuan Zhou and Lawrence Carin. 2012. Augment-and-conquer negative binomial processes. In
*Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS’12)*. 2546–2554. - Mingyuan Zhou and Lawrence Carin. 2015. Negative binomial process count and mixture modeling.
*IEEE Trans. Pattern Anal. Mach. Intell.*37, 2 (2015), 307–320. - Mingyuan Zhou, Haojun Chen, Lu Ren, Guillermo Sapiro, Lawrence Carin, and John W. Paisley. 2009. Non-parametric Bayesian dictionary learning for sparse image representations. In
*Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09)*. 2295–2303. - Mingyuan Zhou, Yulai Cong, and Bo Chen. 2015. Augmentable gamma belief networks.
*Journal of Machine Learning Research*17, 163 (2016), 1--44. - Mingyuan Zhou, Lauren Hannah, David B. Dunson, and Lawrence Carin. 2012. Beta-negative binomial process and poisson gactor snalysis. In
*Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS’12)*. 1462–1471. - Mingyuan Zhou, Hongxia Yang, Guillermo Sapiro, David B. Dunson, and Lawrence Carin. 2011. Dependent hierarchical beta process for image interpolation and denoising. In
*Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11)*. 883–891. - Jun Zhu, Ning Chen, and Eric P. Xing. 2011. Infinite latent SVM for classification and multi-task learning. In
*Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS’11)*. 1620–1628.

This work is supported by the Australian Research Council (ARC) under Discovery Grant DP170101632.

Authors’ addresses: J. Xuan, J. Lu, and G. Zhang, University of Technology Sydney, Centre for Artificial Intelligence, P. O. Box 123 Broadway, Sydney, 2007 Australia; emails: Junyu.Xuan@uts.edu.au, Jie.Lu@uts.edu.au, Guangquan.Zhang@uts.edu.au.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

©2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.

0360-0300/2019/01-ART13 $15.00

DOI: https://doi.org/10.1145/3291044

Publication History: Received May 2018; revised October 2018; accepted October 2018