Clustering Directions Based on the Estimation of a Mixture of Von Mises-Fisher Distributions

Objective: In this paper we propose a dynamic clusters type algorithm based on the estimation of the parameters of a mixture of von MisesFisher distributions for clustering directions, and we compare this algorithm with the Estimation-Maximization algorithm. We also define the between-groups and within-groups variability measures to compare the solutions obtained with the algorithms through these measures.


INTRODUCTION
Clustering data in the unit sphere is an important task in modern data analysis, for example, in clustering text documents when analysing textual data.
One approach to address such issue is the spherical k-means clustering.This technique was proposed by Dhillon and Modha (2001) [1] and implemented in a R package, called skmeans by Hornik et al. (2012) [2], and it is based on the cosine similarity to obtain a partition of term weight representation of the documents.
Other works that have appeared in the literature for clustering directional data are based on model-based clustering methods.For instance, Peel et al. (2001) [3] used the Kent distribution [4] to form groups of fracture data through a model-based clustering and Dortet-Bernadet and Wicker (2008) [5] supposed a model-based clustering of data that lies on a unit sphere and applied this clustering method to gene expression profiles.Banerjee et al. (2005) [6] applied a model-based clustering of directional data to text analysis.These authors considered the estimation of a mixture of von Mises-Fisher distributions using two variants of the Estimation-Maximization EM algorithm, denoted by soft-movMF and hard-movMF algorithms.Another variant of the EM algorithm, denoted by stochastic EM was given by Celeux and Govaert (1992) [7].Banerjee et al. (2005) [6] showed that the spherical k-means algorithm may be obtained as a variant of the EM algorithm for the maximum likelihood estimation of the mean direction parameters of a mixture of von Mises-Fisher distributions with common concentration parameter κ, using hard-max classification E-step.Figueiredo and Gomes (2015) [8] proposed an algorithm based on the dynamic clusters algorithm proposed by Diday and Schroeder (1976) [9] for the estimation of the parameters of a mixture of Watson distributions defined on the hypersphere and compared it with the EM algorithm, proposed by Dempster et al. (1977) [10] for problems of incomplete data.Similarly in this paper, to estimate the parameters of a mixture of k von Mises-Fisher distributions and obtain a partition of the sample into clusters, we propose a dynamic clusters type algorithm and we compare it with the EM algorithm.This proposed algorithm has the advantage of converging quickly to a local optimum, while the EM algorithm may converge slowly to the local optimum.On the other hand, the EM algorithm provides strongly consistent estimators with asymptotic normal distribution (Redner and Walker, 1984) [11].For comparing the solutions obtained in both algorithms, we define between-groups and within-groups variability measures.Then, for several generated samples and a real data set we compare the solutions obtained with these algorithms.
In Section 2, we recall the von Mises-Fisher distribution and the maximum likelihood estimators of the parameters of this distribution.In Section 3, we describe the EM algorithm and we propose the dynamic clusters type algorithm for the estimation of a mixture of k von Mises-Fisher distributions.In Section 4, we define the variability measures and we compare the algorithms through these measures, using simulated data from von Mises-Fisher populations and a real data set.In Section 5 we present some concluding remarks.

VON MISES-FISHER DISTRIBUTION
The von Mises-Fisher distribution is one of the most used distributions in the statistical analysis of directional data.It is usually denoted by M p (µ, κ) and has probability density function defined by where the normalising constant is given by , and I v (.) denotes the modified Bessel function of the first kind and order ν and S p−1 denotes the unit sphere in .This distribution is called von Mises distribution for circular data and Fisher distribution for spherical data.The parameter µ is the vector of the mean direction and κ is the concentration parameter around µ.This distribution is rotationally symmetric about µ.
Let (x 1 , x 2 , ...,x n ) be a random sample of size n from the von Mises-Fisher distribution, M p (µ, κ) .Let be the resultant length mean of the sample defined where is the sample mean vector of (x 1 , x 2 , ...,x n ) defined by .The maximum likelihood estimator of µ is the sample mean direction, i.e., and the maximum likelihood estimator of κ is the solution of the equation where the function A p (κ) is defined by For more details about this distribution, see for instance, Mardia and Jupp (2000), p. 198.[12]

ESTIMATION OF A MIXTURE OF K VON MISES-FISHER DISTRIBUTIONS
A mixture of k von Mises-Fisher components C 1 ,...,C k has probability density function given by where and is the density function of the C j component, i.e., x the density of M p (µ j , κ j ) distribution.The parameters π j , j = 1, .., k with 0 < π j < 1 and are the proportions of the mixture, Q = (V, θ), with V = (π 1 ,π 2 ,.....π k ) and θ = (θ 1 ,....,θ k ) is the vector of unknown parameters of the mixture.
For the estimation of the parameters of the mixture, we review the EM algorithm and its variants (soft-movMF, hard-movMF and stochastic EM) in Subsection 3.1.and we propose a dynamic clusters type algorithm in Subsection 3.2.

EM Algorithm
The Estimation-Maximization (EM) algorithm is used to obtain the maximum likelihood estimates of the parameters of the mixture and can be briefly described as follows.
Let (x 1 , x 2 , ...,x n ) be a random sample from the mixture and let Z = (z 1 , z 2 , ....,z n ) be the be the missing data, where the indicator vector z i = indicates the component of the mixture for x i .The expected log-likelihood as associated with the complete sample (x 1 , ....,x n , Z), derived in Appendix A, is given by where t j (x i ) is the a-posteriori probability of x i belonging to C j defined by So (3) may be written as To estimate the vector of unknown parameters Q, the EM algorithm uses iteratively the two steps: Estimation (E) and Maximization (M).
The algorithm starts with an initial solution: or with an initial partition into k groups, and then determine the estimates Q 0 based on the partition.In the mth iteration (m ≥ 1) the steps are:

E -Step
For j = 1,...,k, i = 1,...,n, calculate the a-posteriori probability of x i belonging to the j th component of the mixture .

M -Step
Use estimates t j (m) (x i ) to maximize L 1 (Q|x 1 , ...,x n , Z) subject to the constraint µ T j µ j = 1 and L 2 (Q|x 1 , ..,x n , Z) subject to the constraint The estimators obtained, derived in Appendix B, are the following: • The maximum likelihood estimator of µ j in the (m + 1)th iteration, is given by (8) • The maximum likelihood estimator of k j in the (m + 1)th iteration, is the solution of the equation where R j is the length of the vector , that is .
• The maximum likelihood estimator of π j in the (m + 1)th iteration, is given by In the particular case of components with the same concentration parameter k, the estimates of µ j and π j , j = 1, ..., k, are given by the expressions ( 8) and (10), and the estimate of the common concentration parameter κ, derived in Appendix C, is the solution of the equation where R j is defined as before.
The EM algorithm is assumed to have converged if the relative change in the log-likelihood values is smaller than a threshold or if the relative absolute change in the parameters is smaller than a threshold.A partition (P 1 , ..., P k ) of the sample is obtained assigning x i to the component for which the aposteriori probability is the largest, that is, and when t j This algorithm is denoted by soft-movMF algorithm by Banerjee et al. (2005Banerjee et al. ( , p. 1357) ) [6].These authors also proposed the hard-movMF algorithm (p.1358), which is a modification of the soft-movMF by adding a hardening step (H -step) between E -step and M -step.This step is: Replace the aposteriori probabilities by assigning each observation with probability 1 to the component for which its aposteriori probability is maximum.
Celeux and Govaert (1992) [7] denoted the previous algorithm by Classification EM algorithm and proposed another variant of the EM algorithm, the stochastic EM, where instead of the hardening step, a stochastic step (S -step) is added between E -step and M -step.This step is:

S -Step
Assign at random each observation to one component with probability equal to its aposteriori probability.
These three variants of the EM algorithm are implemented in a R package called movMF [13].

Dynamic Clusters Type Algorithm
Let E be a finite sample.The aim is to determine a partition P = (P 1 , P 2 , ..., P k ) of E into k classes, so that for every j (1 ≤ j ≤ k), P j may be considered as a sample from a population with density f θ .

Let
be the family of probability densities, from which the distributions of the different components belong: θ is a vectorial parameter and L its definition space: The two following functions f and g are successively applied until obtaining stable elements of L and P: where L = (θ 1 , ...,θ k ) and P = (P 1 , ..., P k ), so that for is the set of observations, which are less distant from the distribution f θ i than from others.Then, it is important to define a function D, which measures the distance from an observation to a distribution f θ : The distance is defined by where C is a constant defined by Then and each group P i is defined by The function is such that for satisfies the condition where card (P i ) is the number of observations of P i .So the parameters θ I are estimated based on the k classes P i : the function g defines the estimation by the maximum likelihood method and the function f enables us to define again k new classes P i and then, evaluate again the value of the criterion W.

COMPARISON OF THE ALGORITHMS
For comparing the solutions obtained with the algorithms, we define next the between-groups and within-groups variability measures, in the decomposition of the total variability used to test the null hypothesis of a common mean vector across k von Mises-Fisher populations with concentration parameters not necessarily equal.This test was considered in the literature for the particular case of equal concentration parameters, for the circle or the sphere, see for instance, Mardia and Jupp (2000, pp. 222-226) [12], Watson (1956) [14], Watson and Williams (1956) [15] and Harrison et al. (1986) [16].
Let x i1 , ...,x in i (i = 1, ..., k) be k independent random samples of sizes n 1 , ..., n k from populations M p (µ i , κ i ) , with vectors of the mean direction µ i and concentration parameters κ i , i = 1, ..., k.Let n = n 1 + ... + n k be the global sample size.The null hypothesis of interest is against the alternative hypothesis that at least one of the equalities is not satisfied.
Next we consider the concentration parameters κ i unknown, but if these parameters are unknown, we have to estimate them through their maximum likelihood estimates for instance.Let's consider the following identity The optimum value of θ i is the maximum likelihood estimator of θ i associated with P i and the optimum criterion is function of the partition P * and L * obtained in convergence: where C is the constant previously defined and the maximum likelihood estimators of µ i and κ i respectively, based on the sample P i .Then, Summing from i = 1 to k, j = 1 to n i and replacing µ and µ i by their maximum likelihood estimates, the following identity is obtained where The previous identity can be written as and the hypothesis H 0 is rejected for large values of F .When all concentration parameters κ i are equal to κ, the statistic (20) reduces to the following statistic given in Mardia and Jupp (2000, pp.222-223): where R i is resultant length of the ith sample and R is the resultant length of the global sample.The F -statistic has under H 0 approximately the F (k−1)(p−1),(n−k)(p−1) distribution for large κ.

Simulation Study
We generated samples of size n from a mixture of equal proportions of two Fisher distributions F (e 3 , κ) and F (µ, κ), with a common concentration parameter κ.We considered without loss of generality, e 3 = (0, 0, 1) T and µ = (0, (1 − For generating observations from the Fisher distribution, we used the method given in Wood (1994) [17].We supposed that the parameters of the mixture are unknown and we estimated these parameters based on each generated sample, using the three variants of the EM algorithm (soft-movMF, hard-movMF and stochastic EM) described in Subsection 3.1, and the dynamic clusters type algorithm described in Subsection 3.2.We obtained the estimates the concentration parameters and the angle between the estimated mean directions, indicated in the Table 1. for the sample size of 20 and concentration parameters of the components equal to 5 or 10 and in Table 2 for the sample size of 40 and concentration parameters of the components equal to 5. In these tables, we also present for each sample, the classification results (confusion matrix), the sizes of the groups and within-groups and between-groups variability measures for the final solution, which were obtained by the expressions given in the previous subsection, where the concentration parameters were replaced by their maximum likelihood estimates.We note that when the angle θ = 30 o , or equivalently, where and R i is the resultant length of the i th sample.This identity is the decomposition of total variability into within-groups variability and between-groups variability .The test statistic is defined by , i.e., the components are poorly separated, the stochastic EM algorithm did not converge for any run.When the components are reasonably or well-separated.i.e., θ = 90° or θ = 150°, all algorithms lead to the same solution, in general.
From the results indicated in Tables 1-2, we conclude the following: The algorithms gave the same solution when the two components are well separated, that is, when θ = 90° for κ = 5 and when θ = 90° or θ = 150° for κ = 10.Therefore, in these cases, the confusion matrix is the same and the variability measures coincide for all the algorithms, as well as the estimates of the concentration parameters and the estimate of the angle between the mean directions.When the concentration of the components increases, the rate of misclassified observations decreases (or remains equal) and the between-groups variability increases, along with the F -statistic for components with moderate or large separation.For each algorithm, the rate of misclassified observations decreases as the separation between the components increases, and for well-separated components, this error rate is equal to 0. This error rate decreases or is equal to 0 when the sample size increases.For each algorithm, the between-groups variability increases as the separation between the two components increases and for well separated components, the between-groups variability exceeds largely the within-groups variability.The F -statistic also increases when the angle between the mean directions of the components increases.When the sample size increases, the between-groups variability and F -statistic increase for moderate or large separation of the components of the mixture.

Example
We used the spherical data given in Wood (1982) [18], which consist of a set of 33 estimates of a previous magnetic pole position of the earth obtained using palaeomagnetic techniques.Each estimate is associated with a different site, the 33 sites being spread over a large of Tasmania.As the data appear to fall into two main groups, Wood (1982) [18] estimated the parameters of a bimodal model for the data.
We obtained a partition of these data into two groups based on the estimation of a mixture of two Fisher distributions through the three variants of the EM algorithm (soft-movMF, hard-movMF and stochastic EM) described in Subsection 3.1 and dynamic clusters type algorithm described in Subsection 3.2.For obtaining the final solutions of the variants of the EM algorithm, we used the R package, movMF.For the dynamic clusters type algorithm, as it depends on the initial solution, we considered several initial partitions randomly chosen for the algorithm and for all initial partitions, the algorithm converged and the final solution obtained was the same.The final solutions obtained with the algorithms are given in the Table 3.The solutions obtained with the several algorithms do not coincide, probably because in this case the components are not well-separated (the estimated angle between the mean directions is around 30 o ).But, the solutions obtained with soft-movMF, stochastic EM and dynamic clusters algorithm are rather similar, as we may observe for these solutions, a large number of observations is stable in the partitions, i.e., 87.8% of the observations stay always together in the same group.We compared the solutions obtained in the algorithms through the between-groups variability measure and Fstatistic, where we estimated the concentration parameters.(See Table 4).
The solution obtained in stochastic EM is preferable in what concerns to the between-groups variability, but considering the F -statistic, the solution obtained with dynamic clusters algorithm is preferable.

CONCLUDING REMARKS
The simulations revealed that only for poorly or moderately separated components, the variants of the EM algorithm and the dynamic clusters type algorithm lead to different solutions in general.For very well separated components, the algorithms seem to originate the same result.Additionally, the larger concentration parameters associated with the components, greater is the tendency to obtain the same solution for the algorithms.
For each algorithm, as expected, the between-groups variability and the F -statistic increase when the separation between components increases or when the concentration of components increases, since these components are not badly separated (i.e., the angle θ is 30 o ).

CONSENT FOR PUBLICATION
Not applicable.

CONFLICT OF INTEREST
The author declares no conflict of interest, financial or otherwise.

ACKNOWLEDGEMENTS
• The author is grateful to the referees of this journal for their helpful comments.
• This work is financed by the ERDF European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project FCOMP-01-0124-FEDER-037281.

Derivation of the Expected Log-Likelihood of the Complete Sample
The vectors z i are independent and have multinomial distribution with parameters (1, π 1 ,...π k ) and the probability density function is given by The density function of x i |z i is given by Then, the density function of (x i , z i ) is defined by the product Replacing the densities g (z i |Q) and l (x i |z i , Q) in the previous expression, we obtain The complete data log-likelihood of (x 1 , x 2 , ..., x n , Z) is given by The expected value of Z ij |x i is given by the expression This expected value is the aposteriori probability of x i belonging to C j , which we denote by t j (x i ) .Then, the expected complete data log-likelihood may be written as

Derivation of the Maximum Likelihood Estimators
First, consider the function L 1 (Q) subject to the constraint µ T j µ j = 1 where λ 1 is a Lagrange multiplier and t j (x i ) is defined by (7).The maximum likelihood estimator of µ j is the solution of the following equation As µ T j µ j = 1; then the Lagrange multiplier is given by L(Q|x 1 , x 2 , ..., x n , Z) t j (x i ) ln c p (κ j ) + κ j µ T j x i − λ 1 (µ T j µ j − 1),

( 13 )
Let P k be the set of partitions of E into k classes and let be the set of vectors of dimension k of .The method starts with an initial partition of E or starts with a vector of dimension k of values of the unknown parameter vector of dimension k of values of the unknown parameter .

where θ is the angle between µ and e 3 .
Should other mean directions have been used, which form an angle θ, the same results would have been obtained.We considered two sample sizes n = 20, 40, several angles of separation between the two components, θ = 30 o , 90 o , 150 o and two values of the common concentration parameter κ = 5, 10.

Table 1 .
Confusion matrices, size groups, estimates of the parameters, variability measures and withingroups) and F -statistic for the EM algorithm (soft-movMF, hard-movMF, stochastic EM) and dynamic clusters algorithm (DC), for the sample size of 20 and concentrations equal to 5 or 10 (*: the results for the other three methods are equal).
h(x i , z i |Q) = k j=1 π z ij j f (x i |θ j ) z ij .L(Q|x 1 , x 2 , ..., x n , Z) = ln n i=1 h(x i , z i |Q)and replacing h (x i , z i |Q), we obtain the expression The density function of z i |x i is given by Replacing the densities h and ψ, we obtain