Get started with the hclustcompro application
Table of content
Introduction
Hierarchical bottom-up classification method with constraints. The method use two sources of informations.
hclustcompro
The merging of the two data sources is done by a parameter (alpha) which allows to weight each source.
The first one is a contingency table. Rows must be individuals (archaeological site, ...)
and columns must be categories (type,...).
The second concerns the same individuals and can be stratigraphic connection type or time range data.
The stratigraphic connection object is a data frame with two columns. The first column contains the network elements
(same number as the number of lines of D1) and the second column contains a
list of all the other elements connected to it. The list is a string composed of the names of the
elements separated by a comma.
The data frame for time range data contains the same first column. The second column contains the
lower temporal bound and the third column contains the upper temporal bound.
Construction of D1
Our first source of information is a contingency table. We need to construct a distance matrix from this table. We perform a correspondence analysis (CA) on the contingency table and then use the distances of the components (chi-square metric).
You must choose the number of axes (CA components) to be used to construct the distance matrix.
Examination of the eigenvalues makes it possible to determine the number of principal axes to be considered. The eigenvalues correspond to the amount of information retained by each axis.
A heuristic method is often used: one constructs the graph of eigenvalues sorted in descending order and retains the eigenvalues (and associated principal components) that precede the first "elbow".
Another approach is based on maintaining an approximation quality of the initial data, measured by the explained inertia rate (e.g. 85%). As many axes as necessary are chosen so that the sum of the corresponding eigenvalues exceeds the targeted inertia rate.
Construction of D2 for stratigraphic connection
From this dataset we construct a adjacency matrix. Base on this matrix we generate
a dissmilarity matrix.
The matrix contain only 0 or 1, 1 if there is not a relationship and 0 if there is a relationship.
Construction of D2 for time range data
The overlap index is the ratio between the overlap or separation and cumulative extend of both individual. We define the
cumulative extend of both as follows: the minimum of the lower limits of the pair of individuals and the
maximum of the upper limits. We define the the overlap or separation as follows: the maximum of
the lower limits and the minimum of the upper limits.
From this ratio we construct a dissimilarity by setting the value between 0 and 1 and ensuring that the closer the value is to 1 the further apart the time ranges are.
Correlation Criterion
A criterion for choosing alpha IN [0;1] must be determined by balancing the weights between the two
information sources in the final classification. To obtain alpha, we define the following criterion:
The CorCrit_alpha criterium in (1) represents the difference in absolute value between two cophenetic
correlation (Cophenetic correlation is defined as the correlation between two distances matrices.
It is calculated by considering the half distances matrices as vectors. It measures of how faithfully a
dendrogram preserves the pairwise distances between the original unmodeled data points).
The first correlation is associated with the comparison between D1 and ultrametric distances from the HAC
with alpha fixed; while the second compares D2 and ultrametric distances from the HAC with alpha fixed.
Then, in order to compromise between the information provided by D1 and D2, we decided to estimate alpha
with hat(alpha) such that:
The inputs
Alpha
The slider input allow you to choose the value of the mixing parameter. You select the weight to give at the
first dissimilarity matrix (alpha) and the second (1-alpha). By default the value if estimate using the Correlation
Criterion define before.
Number of class
The slider input defines the number of cluster you want. In order to help the decision you can look
the WSS plot and the silhouette information.
Aggregation method
The hierarchical bottom-up classification can use several method of aggregation during the construction of
the dendrogram. You can choose one of the methods except for the methods with risk of inversion
(if you know what you do, you can use package version for that).
Ward.D2
Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method
is a special case of the objective function approach originally presented by Joe H. Ward, Jr.
Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing
the pair of clusters to merge at each step is based on the optimal value of an objective function.
complete
At each step, the two clusters separated by the shortest distance are combined. In complete-linkage clustering,
the link between two clusters contains all element pairs, and the distance between clusters equals the distance
between those two elements (one in each cluster) that are farthest away from each other. The shortest of these
links that remains at any step causes the fusion of the two clusters whose elements are involved.
single
At each step combining two clusters that contain the closest pair of elements not yet belonging to the same
cluster as each other. the distance between two clusters is determined by a single element pair, namely those
two elements (one in each cluster) that are closest to each other. The shortest of these links that remains at
any step causes the fusion of the two clusters whose elements are involved.
average (UPGMA)
UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up)
hierarchical clustering method. The method is generally attributed to Sokal and Michener.
The UPGMA method is similar to its weighted variant, the WPGMA method. Note that the unweighted term
indicates that all distances contribute equally to each average that is computed and does not refer to the math
by which it is achieved.
At each step, the nearest two clusters are combined into a higher-level cluster. The distance between any
two clusters A and B, each of size (i.e., cardinality) | A | and | B |, is taken to be the average of all
distances d ( x , y ) between pairs of objects x in A and y in B, that is, the mean distance between elements
of each cluster.
mcquitty (WPGMA)
WPGMA (Weighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up)
hierarchical clustering method, generally attributed to Sokal and Michener.
The WPGMA method is similar to its unweighted variant, the UPGMA method.
At each step, the nearest two clusters, say A and B, are combined into a higher-level cluster A ∪ B. Then,
its distance to another cluster C, d ( A U B , C ), is simply the arithmetic mean of the distances d ( A , C )
and d ( B , C ).
Subdivide
You can subdivide a cluster. clic the button select the cluster you want subdivide and select the number of subclusters.
You can reset the dendrogram with the reset button next to it
Seriation
This input allows you to activate or not the seriation of the columns. Matrix permutation uses an algorithm
called "reciprocal averages". Each line is assigned a rank ranging from 1 to n the number of lines.
A barycentre is calculated for each column by weighting according to the row rank. Finally,
the columns are reorganized by sorting them by their barycentre.
Weight color indicator
This input allows you to activate or not the coloration of the weight column in order to highlight
the confidence related to the quantity of data.
Visualization
This input let you choose which element to plot. There are tree options : plot
the Positive deviation from the average percentage (EPPM in French), plot the frequency or plot the both.
The average percentage is calculated for each category (columns) on the total number of accounts
(all classes combined).
From the average percentage we recover for each category and for each rows the difference between
the percentage of the
category in the class with the average percentage. The EPPM corresponds to the notion of independence deviation
(between rows and columns, between categories and time classes) in a chi-square test approach.
Although this approach is fundamental in statistical analysis, independence deviations are here purely indicative
and are not associated with a p_value that could determine the significance of deviations.
Sort clusters
The rows are initially in the order of appearance on the dendrogram. It is possible to re-order
the classes in a temporal way. In the interface you can drag and drop the class (green) to change the order.
You can add a Hiatus between two clusters and also remove a cluster by drag and drop this cluster in remove section.
Import your data
You can import your data. There are two step. You have to upload a first csv file for the first source of information.
The data must be a contingency table.
The second step consist on the upload of the second source. You can choose between two data type
(Stratigraphic connection or time range data). The stratigraphic connection object is a data frame with two columns. The first column contains
the elements (same number as the number of lines of the table of contingency) and the second column
contains a list of all the other elements connected to it.
The list is a string composed of the names (those in column one) of the elements separated by a comma.
The data frame for time range data contains the same first column.
The second column contains the lower temporal bound and the third column contains the upper temporal bound.
The settings allow you to import diffrents data frame organization (Header, separator of column, ...).
Yes or not option. Do you have headers on your colunms ?
Rownames
Yes or not option. Do you have rownames on your rows ?
Separator
Choose the character use to separate the colunms.
Quote
Choose the quote use to strings.
Decimal
Choose the character use to indicate decimal.
CSV Format and write.table
It is a data.frame with colunms separate by comma "," or semicolon ";".
The input format for importing data is the ".csv" format but also supports the".txt" format as a csv file.
In R you can export your data frame into a csv file using write.csv2 or write.table.
In a csv you can choose a character to separate the columns.
In the same way, you can define the character to indicate the decimal point.
write.table(data,file="path/to/name_file.csv",sep=";",dec=".",row.names=FALSE,quote=FALSE)
In Excel you can save as CSV format in order to import your data frame.
The import interface allows you to setup this values with the "header",
"decimal", "separator" and "quote" option.
The outputs
Evaluation Plot
It is essential to be able to evaluate the different partitions of the hierarchy in order to identify
the one(s) that is (are) most relevant. The number k of clusters of the partition to be retained is based
in our case on several indicators calculated for different values of K : the total within-class sum of square
(WSS
) and the global average of the silhouette widths.
During the execution of the hclustcompro method you have to make a choice. You have to cut the dendrogram.
This operation select the partition. In order to compare all the possibilities you can see the evaluation plot
(WSSPlot and AveSilPlot). This two plot evaluate the relative good quality of the partition.
Within Sum of Square Plot (WSSPlot)
This is the plot of within-groups sum of squares against number of clusters.
The Within Sum of Square decrease when the number of cluster increase.
In this plot the best partition is when add one or more clusters don’t decrease the WSS value.
It’s call the Elbow method.
Example:
On this graph we start by looking at the value for the lowest number of classes: 2.
if I add a third class we see that the WSS value will decrease (from 2.4 to 1.8). If I add another class
I will decrease this value again (from 1.8 to 1.4). We looking for the moment where ading a class is
therefore not interesting. In this case, we can keep a partition with 8 or 9 classes
Average silhouette Plot
This graph shows the average silhouette width of each partition (ROUSSEEUW 1987).
The silhouette width is a limited index between -1 and 1, which is calculated for each observation.
The closer the value is to 1, the better the observation is classified. We look for the average value
for a partition closest to 1.
Example:
On this graph we look for the maximum value. The best evaluation corresponds to the division into 9 classes.
Looking at the second best partition we identify the one with 10 classes.
Select alpha
This plot show the Correlation criterion for each alpha.
Dendrogram
The dendrogram is the result of the clustering process. You can see the clusters and subclusters.
Seriograph
We have chosen the serigraph (DESACHY 2004). This tool makes it possible to highlight the evolution of ceramics over time as well as to understand the commercial relations thanks to the imported ceramics. The percentages of each category of ceramics per set are displayed. The percentages are calculated independently for each set (row). The display of the percentages allows comparison of the different sets but does not provide information on the differences in numbers. To fill this gap, the proportion of the numbers in each class is displayed on the seriograph (weight column).
We have generalized this representation for other contingency data with hclustcompro object.
In order to facilitate the exploitation of the data tables, we propose here a computerised graphic processing tool (EPPM serigraph - for Ecart Positif aux Pourcentages Moyens - positive deviation from the average percentage), which does not require specialised statistical skills and is adapted to the case of stratified sites, where the study of the evolution of artefacts can be based on the relative chronology provided by the excavation.
The treatment consists firstly of transforming this table of counts into a table of percentages, the total number in each set (each row) being reduced to 100; these are the proportions, or frequencies, of the types in the sets are thus compared.
The display of positive deviations from the mean percentages (EPPM) shows in black on a lighter background the percentage shares that are higher than the mean percentage of the variable, so as to highlight the most significant part of the values in the table.This display is simply adapted to the seriograph: when a percentage is greater than the average percentage of the type, the excess share (called here EPPM: positive deviation from the average percentage) is shown in black, centred around the axis of the type, on the grey background of the percentage bar.
The table is then transformed into a graphic matrix where these percentages are expressed, for each type, by horizontal bars centred on the column axis. When the rows are ordered chronologically, the silhouette formed by the superposition of these frequency bars bars makes it possible to visualise the evolution over time of the type concerned.
The display of the percentages allows comparison of the different sets but does not provide information on the differences in numbers. To fill this gap, the proportion of the numbers in each class is displayed on the seriograph (weight column).
The processing technique applies to sets whose chronological order is not known; the lines of the graph are to be reorganised so as to obtain boat-shaped silhouettes following the hypothesis of a chronological evolution corresponding to the seriation model.
Timerange plot
Show the diffrent time range of observations for each cluster. (you have to import time range data)
Correspondences analysis
You can see the different planes of the correspondence analysis used in the construction of the first dissimilarity matrix D1.
References
Rousseeuw, P.J. (1987). “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”.
In : J. Comput. Appl. Math. 20, 53–65. ⟨doi:10.1016/0377-0427(87)90125-7⟩.
Ward, J. H., Jr. (1963). "Hierarchical Grouping to Optimize an Objective Function".
Journal of the American Statistical Association, 58, 236–244.
Sokal R. and Michener C. (1958). "A statistical method for evaluating systematic
relationships". University of Kansas Science Bulletin. 38: 1409–1438.
Desachy B. (2004). "Le sériographe EPPM : un outil informatisé de sériation graphique pour tableaux de comptages. In: Revue archéologique de Picardie, n°3-4, 2004. Céramiques domestiques et terres cuites architecturales. Actes des journées d'étude d'Amiens (2001-2002-2003) pp. 39-56 ⟨doi:10.3406/pica.2004.2396⟩.