mapclust: Divisive Hierarchical Clustering using Spatials Patches

SPARTAAS | mapclust v1.0
SPARTAAS [Bellanger,Coulon,Husi]

Introduction

Hierarchical classification can also be considered by introducing constraints that allow, for example, geographical proximity to be taken into account.

Spatialpatches

The spatial distribution of data can be heterogeneous and present local aggregations or spatial patches. P. Petitgas proposed an algorithm to identify them in the context of fish population density (WOILLEZ et al., 2009). The parameters are the geographical coordinates (X,Y) and a variable of interest (var).

The position of a patch is then determined by its center of gravity. The algorithm starts with the highest value of var and then considers each observation in decreasing order of value of var. The highest value initiates the first patch. Then, the observation considered is assigned to the nearest patch, provided that its distance from the center of gravity of the patch is smaller than the threshold distance dlim. Otherwise, the observation forms a new patch. The results on spatial patches are of course influenced by the choice of the dlim threshold and the location of the highest values of var.

mapclust

In order to better understand the spatial distribution of individuals, spatial patch construction is applied hierarchically top-down by varying the maximum acceptable distance (dlim) between observation points and the centre of gravity of a patch. This approach leads to a Hierarchical Top-down Classification (EVERITT et al., 2001) based at each step on the Spatialpatches algorithm: at each node (parent group) of the hierarchy we divide into two patches (child nodes). Spatialpatches has been developed to describe the spatial distribution patterns of a fish population based on density data, so the variable of interest is assumed to be positive. var can therefore be related to a positive count, frequency or real variable.

In the case where the variable of interest var is real, means that it can take negative values, it was necessary to adapt our mapclust classification algorithm. We no longer work with the values of var but with the values of the probability density f associated with var. In the same way we were able to adapt the method to the case where var is multidimensional by working on a Kernel density estimate.

mapclust

Console



              

Map



Silhouette


Summary



            

Authors:


L. Bellanger

mail: <lise.bellanger@univ-nantes.fr>

P. Husi

mail: <philippe.husi@univ-tours.fr>

A. Coulon

Maintainer:


A. Coulon

mail: <arthur.coulon@univ-tours.fr>

Contributor:

B. Desachy

B. Martineau

Get started with the mapclust application

Table of content


Introduction


Hierarchical classification can also be considered by introducing constraints that allow, for example, geographical proximity to be taken into account.

Spatialpatches

The spatial distribution of data can be heterogeneous and present local aggregations or spatial patches. P. Petitgas proposed an algorithm to identify them in the context of fish population density (WOILLEZ et al., 2009). The parameters are the geographical coordinates (X,Y) and a variable of interest (var).

The position of a patch is then determined by its center of gravity. The algorithm starts with the highest value of var and then considers each observation in decreasing order of value of var. The highest value initiates the first patch. Then, the observation considered is assigned to the nearest patch, provided that its distance from the center of gravity of the patch is smaller than the threshold distance dlim. Otherwise, the observation forms a new patch. The results on spatial patches are of course influenced by the choice of the dlim threshold and the location of the highest values of var.

mapclust

In order to better understand the spatial distribution of individuals, spatial patch construction is applied hierarchically top-down by varying the maximum acceptable distance (dlim) between observation points and the centre of gravity of a patch. This approach leads to a Hierarchical Top-down Classification (EVERITT et al., 2001) based at each step on the Spatialpatches algorithm: at each node (parent group) of the hierarchy we divide into two patches (child nodes). Spatialpatches has been developed to describe the spatial distribution patterns of a fish population based on density data, so the variable of interest is assumed to be positive. var can therefore be related to a positive count, frequency or real variable.

In the case where the variable of interest var is real, means that it can take negative values, it was necessary to adapt our mapclust classification algorithm. We no longer work with the values of var but with the values of the probability density f associated with var. In the same way we were able to adapt the method to the case where var is multidimensional by working on a Kernel density estimate.


The App


Run (first time)

The function therefore requires as data geographical coordinates, longitude and latitude (coord) and the variable(s) of interest (var) in the same table.

For your first time you can use our dataset. The first one it's call datarcheo. Just select it on the mapclust tab(it will be select by default) and run. You also can try datacancer the other dataset.

Print label option

Placed just under the choice of the dataset, it is used to define if the labels should be displayed on the dendrogram.

Select dataset and print label option

Import your data

You can import your data. Since they may contain only one or more variables of interest, you must adjust the parameters.

Let's take a look of the interface. There are three main parts. The import space with only one import button. The second part is more complex with more options. This is the parameter part where you must configure the import tool according to your data. Finally, there is the preview area where you can see how your data should look and how your data is. There is still something left, the concordance indicator that allows you to quickly check if your data matches what it should look like.

CSV Format and write.table


A csv file. It is a data.frame with 3 colunms with headers and separate by semicolon ";".

The input format for importing data is the ".csv" format but also supports the".txt" format as a csv file.

In R you can export your data frame into a csv file using write.csv2 or write.table. In a csv you can choose a character to separate the columns. In the same way, you can define the character to indicate the decimal point.

write.table(data,file="path/to/name_file.csv",sep=";",dec=".",row.names=FALSE,quote=FALSE)

In Excel you can save as CSV format in order to import your data frame.

The import interface allows you to setup this values with the "header", "decimal", "separator" and "quote" option.

In order to compute exactly as you want we need to know some things. The first one is the presence or not of label in your data. The label have to be in the last colunm. We also need to know how many of variable of interest you have in addition of the two coordinates variables.

If the setup don't match with your real data frame you will not be able to import anything. You have to change the data or the parameters. When it was good a green check icon will appear if not you will see a red cross.

Label

Yes or not option. Do you have labels in your data ? If you have labels you have to put them on the last colunms.

Yes or not option. Do you have headers on your colunms ?

Separator

Choose the character use to separate the colunms.

Quote

Choose the quote use to strings.

Decimal

Choose the character use to indicate decimal.

Univariate data

Put the slider to "Univariate" then import your data. If you have labels let the option check, if not uncheck label.

Configure all settings for the csv format. Which symbol is used to separate the columns, which decimal symbol, which quotation marks to use for the character string and whether your columns contain a header.

Multivariate data

The only différence is that you have to put the slider on "Multivariate". When you do this, you hace acces to another slider. You have to indicate the number of variables of interest. Warning: You must not count the two coordinates variables in.


Evaluation Plot


It is essential to be able to evaluate the different partitions of the hierarchy in order to identify the one(s) that is (are) most relevant. The number k of classes of the partition to be retained is based in our case on several indicators calculated for different values of K : the total within-class sum of square (WSS) and the global average of the silhouette widths.

During the execution of the mapclust method you have to make a choice. You have to cut the dendrogram. This operation select the partition. In order to compare all the possibilities you can see the evaluation plot (WSSPlot and AveSilPlot). This two plot evaluate the relative good quality of the partition.

Within Sum of Square Plot (WSSPlot)

This is the plot of within-groups sum of squares against number of clusters. The Within Sum of Square decrease when the number of cluster increase. In this plot the best partition is when add one or more clusters don’t decrease the WSS value. It’s call the Elbow method.

Example:

On this graph we start by looking at the value for the lowest number of groups: 2. if I add a third group we see that the WSS value will decrease (from 0.09 to 0.03). If I add another group I will decrease this value again (from 0.03 to 0.01). After that, adding a group no longer or only slightly affects the value. Adding a group is therefore not interesting, we keep a partition with 4 groups.

Average silhouette Plot

This graph shows the average silhouette width of each partition (ROUSSEEUW 1987). The silhouette width is a limited index between -1 and 1, which is calculated for each observation. The closer the value is to 1, the better the observation is classified. We look for the average value for a partition closest to 1.

Example:

On this graph we look for the maximum value. The best evaluation corresponds to the division into 7 groups. Looking at the second best partition we identify the one with 4 groups. Although the one with 7 groups is higher for this silhouette index we will select the partition in 4. we make this choice because of the WSSPlot which advised us the partition in 4.


Output


The selection of a partition is done by clicking on the dendrogram at the desired height.

You can change it at any moment.


Map


Silhouette

You can see on this plot the silhouette index for each observation of the selected partition. The observation are sorted in decreasing order and by cluster.


Additionnal informations

above the map you can see a little description of the cluster. You can find the average value of each variable.

Below the silhouette plot you have a summary of the diffrents values for each partition. You can find the dlim, the number of cluster, the WSS value the average sil_widht and if you are in univariate case the Moran index (MORAN, 1950) for each partition. The Moran index is a measure of spatial autocorrelation. It is calculated for each cluster of the partition and then the average is calculated to characterize the partition.


References


EVERITT, B.S., S. LANDAU et M. LEESE(2001). “Cluster Analysis”. In : London UK: Arnold.

MORAN, P. A. P. (1950). "Notes on Continuous Stochastic Phenomena". Biometrika. 37 (1): 17–23. doi.org/10.2307/2332142. JSTOR 2332142.

ROUSSEEUW, P.J. (1987). “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In : J. Comput. Appl. Math. 20, 53–65. DOI:https://doi.org/10.1016/0377-0427(87)90125-7.

WOILLEZ, M., J.RIVOIRARD et P.PETITGAS(2009). “Notes on survey-based spatial indicators for monitoring fish populations”. In : Aquat. Living Resour.22, p. 155–164.