hclustcompro: Hierarchical Agglomerative Clustering method by compromise

SPARTAAS | hclustcompro v1.0
SPARTAAS [Bellanger,Coulon,Husi]

Introduction

hclustcompro is a hierarchical agglomerative clustering (HAC) taking into account two sources of information associated with the same objects. This method, called CAH by compromise (hclustcompro), allows a compromise between the hierarchies obtained from each source taken separately. A convex combination of the dissimilarities associated with each of the sources is used to modify the dissimilarity measure in the classical CAH algorithm. The choice of the mixing parameter is the key point of the method. We propose an objective function to be minimized based on the absolute difference of the correlations between initial dissimilarities and kinetic distances, as well as a resampling procedure to ensure the robustness of the choice of the mixing parameter.

The data

hclustcompro uses two types of input data. The first is a contingency table (counting ceramics or other furniture). The second can be either stratigraphic connections or dates (time range) for the same individuals as the first source.

See example below.

Go to wiki for more detail

First data source (contingency table)


Second data source (Stratigraphic connection or Time range data)


Stratigraphic connection
Timerange data)


Select Alpha








Select the number of groupe in the partition


Dendrogram



                  









Warnings: You must go to the clustering settings tab to set up the clustering first.



Seriograph




Download the contingency table


Timerange plot




Authors:


L. Bellanger

mail: <lise.bellanger@univ-nantes.fr>

P. Husi

mail: <philippe.husi@univ-tours.fr>

A. Coulon

Maintainer:


A. Coulon

mail: <arthur.coulon@univ-tours.fr>

Contributor:

B. Desachy

B. Martineau

Get started with the hclustcompro application

Table of content


Introduction


Hierarchical bottom-up classification method with constraints. The method use two sources of informations.

hclustcompro

The merging of the two data sources is done by a parameter (alpha) which allows to weight each source.

The first one is a contingency table. Rows must be individuals (archaeological site, ...) and columns must be categories (type,...). The second concerns the same individuals and can be stratigraphic connection type or time range data. The stratigraphic connection object is a data frame with two columns. The first column contains the network elements (same number as the number of lines of D1) and the second column contains a list of all the other elements connected to it. The list is a string composed of the names of the elements separated by a comma.

The data frame for time range data contains the same first column. The second column contains the lower temporal bound and the third column contains the upper temporal bound.

Construction of D1

Our first source of information is a contingency table. We need to construct a distance matrix from this table. We perform a correspondence analysis (CA) on the contingency table and then use the distances of the components (chi-square metric).

You must choose the number of axes (CA components) to be used to construct the distance matrix.

Examination of the eigenvalues makes it possible to determine the number of principal axes to be considered. The eigenvalues correspond to the amount of information retained by each axis.

A heuristic method is often used: one constructs the graph of eigenvalues sorted in descending order and retains the eigenvalues (and associated principal components) that precede the first "elbow".

Another approach is based on maintaining an approximation quality of the initial data, measured by the explained inertia rate (e.g. 85%). As many axes as necessary are chosen so that the sum of the corresponding eigenvalues exceeds the targeted inertia rate.

Construction of D2 for stratigraphic connection

From this dataset we construct a adjacency matrix. Base on this matrix we generate a dissmilarity matrix. The matrix contain only 0 or 1, 1 if there is not a relationship and 0 if there is a relationship.

Construction of D2 for time range data

The overlap index is the ratio between the overlap or separation and cumulative extend of both individual. We define the cumulative extend of both as follows: the minimum of the lower limits of the pair of individuals and the maximum of the upper limits. We define the the overlap or separation as follows: the maximum of the lower limits and the minimum of the upper limits.

From this ratio we construct a dissimilarity by setting the value between 0 and 1 and ensuring that the closer the value is to 1 the further apart the time ranges are.

Correlation Criterion

A criterion for choosing alpha IN [0;1] must be determined by balancing the weights between the two information sources in the final classification. To obtain alpha, we define the following criterion:

The CorCrit_alpha criterium in (1) represents the difference in absolute value between two cophenetic correlation (Cophenetic correlation is defined as the correlation between two distances matrices. It is calculated by considering the half distances matrices as vectors. It measures of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points). The first correlation is associated with the comparison between D1 and ultrametric distances from the HAC with alpha fixed; while the second compares D2 and ultrametric distances from the HAC with alpha fixed. Then, in order to compromise between the information provided by D1 and D2, we decided to estimate alpha with hat(alpha) such that:


The inputs


Alpha

The slider input allow you to choose the value of the mixing parameter. You select the weight to give at the first dissimilarity matrix (alpha) and the second (1-alpha). By default the value if estimate using the Correlation Criterion define before.

Number of class

The slider input defines the number of cluster you want. In order to help the decision you can look the WSS plot and the silhouette information.

Aggregation method

The hierarchical bottom-up classification can use several method of aggregation during the construction of the dendrogram. You can choose one of the methods except for the methods with risk of inversion (if you know what you do, you can use package version for that).

Ward.D2

Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr. Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.

complete

At each step, the two clusters separated by the shortest distance are combined. In complete-linkage clustering, the link between two clusters contains all element pairs, and the distance between clusters equals the distance between those two elements (one in each cluster) that are farthest away from each other. The shortest of these links that remains at any step causes the fusion of the two clusters whose elements are involved.

single

At each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other. the distance between two clusters is determined by a single element pair, namely those two elements (one in each cluster) that are closest to each other. The shortest of these links that remains at any step causes the fusion of the two clusters whose elements are involved.

average (UPGMA)

UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener. The UPGMA method is similar to its weighted variant, the WPGMA method. Note that the unweighted term indicates that all distances contribute equally to each average that is computed and does not refer to the math by which it is achieved.

At each step, the nearest two clusters are combined into a higher-level cluster. The distance between any two clusters A and B, each of size (i.e., cardinality) | A | and | B |, is taken to be the average of all distances d ( x , y ) between pairs of objects x in A and y in B, that is, the mean distance between elements of each cluster.

mcquitty (WPGMA)

WPGMA (Weighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up) hierarchical clustering method, generally attributed to Sokal and Michener. The WPGMA method is similar to its unweighted variant, the UPGMA method.

At each step, the nearest two clusters, say A and B, are combined into a higher-level cluster A ∪ B. Then, its distance to another cluster C, d ( A U B , C ), is simply the arithmetic mean of the distances d ( A , C ) and d ( B , C ).

Subdivide

You can subdivide a cluster. clic the button select the cluster you want subdivide and select the number of subclusters. You can reset the dendrogram with the reset button next to it

Seriation

This input allows you to activate or not the seriation of the columns. Matrix permutation uses an algorithm called "reciprocal averages". Each line is assigned a rank ranging from 1 to n the number of lines. A barycentre is calculated for each column by weighting according to the row rank. Finally, the columns are reorganized by sorting them by their barycentre.

Weight color indicator

This input allows you to activate or not the coloration of the weight column in order to highlight the confidence related to the quantity of data.

Visualization

This input let you choose which element to plot. There are tree options : plot the Positive deviation from the average percentage (EPPM in French), plot the frequency or plot the both. The average percentage is calculated for each category (columns) on the total number of accounts (all classes combined). From the average percentage we recover for each category and for each rows the difference between the percentage of the category in the class with the average percentage. The EPPM corresponds to the notion of independence deviation (between rows and columns, between categories and time classes) in a chi-square test approach. Although this approach is fundamental in statistical analysis, independence deviations are here purely indicative and are not associated with a p_value that could determine the significance of deviations.

Sort clusters

The rows are initially in the order of appearance on the dendrogram. It is possible to re-order the classes in a temporal way. In the interface you can drag and drop the class (green) to change the order. You can add a Hiatus between two clusters and also remove a cluster by drag and drop this cluster in remove section.

Import your data

You can import your data. There are two step. You have to upload a first csv file for the first source of information. The data must be a contingency table.

The second step consist on the upload of the second source. You can choose between two data type (Stratigraphic connection or time range data). The stratigraphic connection object is a data frame with two columns. The first column contains the elements (same number as the number of lines of the table of contingency) and the second column contains a list of all the other elements connected to it. The list is a string composed of the names (those in column one) of the elements separated by a comma.

The data frame for time range data contains the same first column. The second column contains the lower temporal bound and the third column contains the upper temporal bound.

The settings allow you to import diffrents data frame organization (Header, separator of column, ...).

Yes or not option. Do you have headers on your colunms ?

Rownames

Yes or not option. Do you have rownames on your rows ?

Separator

Choose the character use to separate the colunms.

Quote

Choose the quote use to strings.

Decimal

Choose the character use to indicate decimal.

CSV Format and write.table

It is a data.frame with colunms separate by comma "," or semicolon ";".

The input format for importing data is the ".csv" format but also supports the".txt" format as a csv file.

In R you can export your data frame into a csv file using write.csv2 or write.table. In a csv you can choose a character to separate the columns. In the same way, you can define the character to indicate the decimal point.

write.table(data,file="path/to/name_file.csv",sep=";",dec=".",row.names=FALSE,quote=FALSE)

In Excel you can save as CSV format in order to import your data frame.

The import interface allows you to setup this values with the "header", "decimal", "separator" and "quote" option.


The outputs


Evaluation Plot

It is essential to be able to evaluate the different partitions of the hierarchy in order to identify the one(s) that is (are) most relevant. The number k of clusters of the partition to be retained is based in our case on several indicators calculated for different values of K : the total within-class sum of square (WSS) and the global average of the silhouette widths.

During the execution of the hclustcompro method you have to make a choice. You have to cut the dendrogram. This operation select the partition. In order to compare all the possibilities you can see the evaluation plot (WSSPlot and AveSilPlot). This two plot evaluate the relative good quality of the partition.

Within Sum of Square Plot (WSSPlot)

This is the plot of within-groups sum of squares against number of clusters. The Within Sum of Square decrease when the number of cluster increase. In this plot the best partition is when add one or more clusters don’t decrease the WSS value. It’s call the Elbow method.

Example:

On this graph we start by looking at the value for the lowest number of classes: 2. if I add a third class we see that the WSS value will decrease (from 2.4 to 1.8). If I add another class I will decrease this value again (from 1.8 to 1.4). We looking for the moment where ading a class is therefore not interesting. In this case, we can keep a partition with 8 or 9 classes

Average silhouette Plot

This graph shows the average silhouette width of each partition (ROUSSEEUW 1987). The silhouette width is a limited index between -1 and 1, which is calculated for each observation. The closer the value is to 1, the better the observation is classified. We look for the average value for a partition closest to 1.

Example:

On this graph we look for the maximum value. The best evaluation corresponds to the division into 9 classes. Looking at the second best partition we identify the one with 10 classes.

Select alpha

This plot show the Correlation criterion for each alpha.

Dendrogram

The dendrogram is the result of the clustering process. You can see the clusters and subclusters.

Seriograph

We have chosen the serigraph (DESACHY 2004). This tool makes it possible to highlight the evolution of ceramics over time as well as to understand the commercial relations thanks to the imported ceramics. The percentages of each category of ceramics per set are displayed. The percentages are calculated independently for each set (row). The display of the percentages allows comparison of the different sets but does not provide information on the differences in numbers. To fill this gap, the proportion of the numbers in each class is displayed on the seriograph (weight column).

We have generalized this representation for other contingency data with hclustcompro object.

In order to facilitate the exploitation of the data tables, we propose here a computerised graphic processing tool (EPPM serigraph - for Ecart Positif aux Pourcentages Moyens - positive deviation from the average percentage), which does not require specialised statistical skills and is adapted to the case of stratified sites, where the study of the evolution of artefacts can be based on the relative chronology provided by the excavation.

The treatment consists firstly of transforming this table of counts into a table of percentages, the total number in each set (each row) being reduced to 100; these are the proportions, or frequencies, of the types in the sets are thus compared.

The display of positive deviations from the mean percentages (EPPM) shows in black on a lighter background the percentage shares that are higher than the mean percentage of the variable, so as to highlight the most significant part of the values in the table.This display is simply adapted to the seriograph: when a percentage is greater than the average percentage of the type, the excess share (called here EPPM: positive deviation from the average percentage) is shown in black, centred around the axis of the type, on the grey background of the percentage bar.

The table is then transformed into a graphic matrix where these percentages are expressed, for each type, by horizontal bars centred on the column axis. When the rows are ordered chronologically, the silhouette formed by the superposition of these frequency bars bars makes it possible to visualise the evolution over time of the type concerned.

The display of the percentages allows comparison of the different sets but does not provide information on the differences in numbers. To fill this gap, the proportion of the numbers in each class is displayed on the seriograph (weight column).

The processing technique applies to sets whose chronological order is not known; the lines of the graph are to be reorganised so as to obtain boat-shaped silhouettes following the hypothesis of a chronological evolution corresponding to the seriation model.

Timerange plot

Show the diffrent time range of observations for each cluster. (you have to import time range data)

Correspondences analysis

You can see the different planes of the correspondence analysis used in the construction of the first dissimilarity matrix D1.


References



Rousseeuw, P.J. (1987). “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In : J. Comput. Appl. Math. 20, 53–65. ⟨doi:10.1016/0377-0427(87)90125-7⟩.

Ward, J. H., Jr. (1963). "Hierarchical Grouping to Optimize an Objective Function". Journal of the American Statistical Association, 58, 236–244.

Sokal R. and Michener C. (1958). "A statistical method for evaluating systematic relationships". University of Kansas Science Bulletin. 38: 1409–1438.

Desachy B. (2004). "Le sériographe EPPM : un outil informatisé de sériation graphique pour tableaux de comptages. In: Revue archéologique de Picardie, n°3-4, 2004. Céramiques domestiques et terres cuites architecturales. Actes des journées d'étude d'Amiens (2001-2002-2003) pp. 39-56 ⟨doi:10.3406/pica.2004.2396⟩.