^{1}

^{2}

^{1}

The problem of taking a set of data and separating it into subgroups where the elements of each subgroup are more similar to each other than they are to elements not in the subgroup has been extensively studied through the statistical method of cluster analysis. In this paper we want to discuss the application of this method to the field of education: particularly, we want to present the use of cluster analysis to separate students into groups that can be recognized and characterized by common traits in their answers to a questionnaire, without any prior knowledge of what form those groups would take (unsupervised classification). We start from a detailed study of the data processing needed by cluster analysis. Then two methods commonly used in cluster analysis are before described only from a theoretical point a view and after in the Section 4 through an example of application to data coming from an open-ended questionnaire administered to a sample of university students. In particular we describe and criticize the variables and parameters used to show the results of the cluster analysis methods.

Many quantitative and qualitative research studies involving open- and closed-ended questionnaire analysis have provided instructors/teachers with tools to investigate students’ conceptual knowledge of various fields of physics. Many of these studies examined the consistency of students’ answers in a variety of situations [

The problem of separating a group of students into subgroups where the elements of each subgroup are more similar to each other than they are to elements not in the subgroup has been studied through the methods of Cluster Analysis (ClA), but the use of the various available techniques have hardly been deepened to reveal their strength and weakness points. ClA can separate students into groups that can be recognized and characterized by common traits in their answers, without any prior knowledge of what form those groups would take (unsupervised classification [

ClA, introduced in Psychology by R.C. Tyron in 1939 [

ClA techniques [

Some studies using ClA methods are found in the literature concerning research in education. They group and characterize students’ responses by using open-ended questionnaires [

ClA can be carried out using various algorithms that significantly differ in their notion of what constitutes a cluster and how to effectively find them. Moreover, a deep ana- lysis of the ClA procedures applied is needed, because they often include approximations strongly influencing the interpretation of results. For this reason, in this paper we start from a detailed analysis of the data setting needed by ClA. Then, two methods com- monly used in ClA are described and the variables and parameters involved are outlined and criticized. Section IV deals with an example of application of these methods to the analysis of data from an open-ended questionnaire administered to a sample of university students, and discusses the significance and validity of information that can be obtained by using the two different solutions to clustering. Finally a comparison of the results obtained by using the two methods is done in order to reveal their coherence.

The application of ClA methods to answers to a closed-ended questionnaire does not pose to the researcher difficulties in the classification of student answers, as the categories can be considered the answers themselves.

On the other hand, research in education that uses open-ended questions and aims at performing a quantitative analysis of student answers usually involves the prior development of coding procedures aimed at categorizing student answers in a limited number of “typical” ways to answer each question. However, it is well known that there are inherent difficulties in the classification and coding of student responses. Hammer & Berland [

On the basis of the approach previously described, the logical steps that the researcher can use in a research to process data coming from student answers to an open- or closed-ended questionnaire can be synthesized by the flow chart represented in

discussion among researchers, these themes are then developed and grouped in a number of categories whose definition take into account as much as possible the words, the phrase, the wording used by students [

At the end of this phase, the whole set of answers given by students to the open- ended questionnaire is grouped into a limited number, M, of typical answers, i.e. the student answering strategies. M is obtained by adding all the answering strategies used by students when answering to each question.

In the case of closed-ended questions, the preliminary analysis described above is often not necessary, as the answers to each question are often already “classified” in a limited number, that are the explicit options for the respondent to select from.

The next step is unique for both the kind of questionnaire and involves the binary coding of student answers,^{1} according to the defined categories, generating a binary matrix (as shown in _{i}, composed of M components 1 and 0, where 1 means that the student used a given answering strategy/answer option to respond to a question and 0 means that he/she did not use it. Then, a M × N binary matrix (the

Strategy | Student | |||
---|---|---|---|---|

S_{1} | S_{2} | … | S_{N} | |

AS_{1} | 0 | 0 | … | 0 |

AS_{2} | 1 | 0 | … | 1 |

AS_{3} | 1 | ... | ... | ... |

AS_{4} | 0 | … | … | … |

AS_{5} | 1 | ... | ... | ... |

… | 0 | … | … | … |

AS_{M} | 0 | 1 | … | 0 |

“matrix of answering strategies”) modeled on the one shown in _{i}, and the rows represent the M components of each array, i.e. the M answering strategies/answer options.

For example, let us say that student S_{1} used answering strategies AS_{2}, AS_{3} and AS_{5} to respond to the questionnaire questions. Therefore, column S_{1} in

The matrix depicted in

Particularly, ClA requires the definition of new quantities that are used to build the grouping, like the “similarity” or “distance” indexes. These indexes are defined by starting from the M × N binary matrix discussed above.

In the literature [_{ij}, between them (which actually expresses their “dissimilarity”, in the sense that a higher value of distance involves a lower similarity).

A distance index can be defined by starting from the Pearson’s correlation coefficient. It allows the researcher to study the correlation between students i and j if the related variables describing them are numerical. If these variables are non-numerical variables (as in our case, where we are dealing with the arrays a_{i} and a_{j} containing a binary symbolic coding of the answers of students i and j, respectively), we need to use a modified form of the Pearson’s correlation coefficient, R_{mod}, similar to that defined by Tumminello et al. [_{mod} as,^{2}

where p(a_{i}), p(a_{j}) are the number of properties of a_{i} and a_{j} explicitly present in our students (i.e. the numbers of 1’s in the arrays a_{i} and a_{j}, respectively), M is the total number of properties to study (in our case, the answering strategies) and _{i} and a_{j}). _{i} and a_{j}.

By following Equation (1) it is possible to find for each student, i, the N − 1 correlation coefficients R_{mod} between him/her and the others students (and the correlation coefficient with him/herself, that is, clearly, 1). All these correlation coefficients can be placed in a N × N matrix that contains the information we need to discuss the mutual relationships between our students.

The similarity between students i and j can be defined by choosing a type of metric to calculate the distance d_{ij}. Such a choice is often complex and depends on many factors. If we want two students, represented by arrays a_{i} and a_{j} and negatively correlated, to be more dissimilar than two positively correlated, a possible definition of the distance between a_{i} and a_{j}, making use of the modified correlation coefficient, R_{mod}(a_{i}, a_{j}), is:

This function defines an Euclidean metric [_{ij} between two students equal to zero means that they are completely similar (R_{mod} = 1), while a distance d_{ij} = 2 shows that the students are completely dissimilar (R_{mod} = −1). When the correlation between two students is 0 their distance is

By following Equation (2) we can, then build a new N ´ N matrix, D (the distance matrix), containing all the mutual distances between the students. The main diagonal of D is composed by 0s (the distance between a student and him/herself is zero). Moreover, D is symmetrical with respect to the main diagonal.

Clustering Analysis methods can be roughly distinguished in Non-Hierarchical (or Centroid-Based), and Hierarchical ones (also known as connectivity based clustering methods). The first category of methods basically takes to partitions of the data space into a structure known as a Voronoi Diagram (a number of regions including subsets of similar data). The second one is based on the core idea of building a binary tree of the data that are then merged into similar groups. This tree is a useful summary of the data that are connected to form clusters based on their known distance, and it is sometimes referred to as a dendrogram.

A. Non-Hierarchical Clustering Analysis (NH-ClA)

Non-hierarchical clustering analysis is used to generate groupings of a sample of elements (in our case, students) by partitioning it and producing a smaller set of non-overlapping clusters with not hierarchical relationships between them. Among the currently used NH-ClA algorithms, we will consider k-means, which was first proposed by MacQueen in 1963 [

The starting point is the choice of the number, q, of clusters one wants to populate and of an equal number of “seed points”, randomly selected in the bi-dimensional Cartesian space representing the data. The students are then grouped on the basis of the minimum distance between them and the seed points. Starting from an initial classification, students are iteratively transferred from one cluster to another or swapped with students from other clusters, until no further improvement can be made. The students belonging to a given cluster are used to find a new point, representing the average position of their spatial distribution. This is done for each cluster Cl_{k} (_{k}. This process is repeated and ends when the new centroids coincide with the old ones. As we said above, the spatial distribution of the set elements is represented in a 2-dimensional Cartesian space, creating what is known as the k-means graph (see

NH-ClA has some points of weakness and here we will describe how it is possible to overcome them. The first involves the a-priori choice of the initial positions of the centroids. This is usually resolved in the literature [

When the k-means clustering method is applied, in order to choose the number of clusters, q, to be initially used to perform the calculations, the so-called Silhouette Function, S, [

For each selected number of clusters, q, and for each sample student, i, assigned to a cluster k, with_{i}(q), is calculated as

where the first term of the numerator is the average distance of the i-th student in cluster k to l-th student placed in a different cluster p (

S_{i}(q) gives a measure of how similar student i is to the other students in its own cluster, when compared to students in other clusters. It ranges from −1 to +1: a value near +1 indicates that student i is well-matched to its own cluster, and poorly-matched to neighboring clusters. If most students have a high silhouette value, then the clustering solution is appropriate. If many students have a low or negative silhouette value, then the clustering solution could have either too many or too few clusters (i.e. the chosen number, q, of clusters should be modified).

Subsequently, the values S_{i}(q) can be averaged on each cluster, k, to find the average silhouette value in the cluster,

The k-means results can be plotted in a 2-dimensional Cartesian space containing points that represent the students of the sample placed in the plane according to their mutual distances. As we said before, for each student, i, we know the N distances, d_{ij} between such a student and all the students of the sample (being d_{ii} = 0. It is, then, necessary to define a procedure to find two Cartesian coordinates for each student, starting from these N distances. This procedure consists in a linear transformation between a N-dimensional vector space and a 2-dimensional one and it is well known in the specialized literature as multidimensional scaling [

^{3} First three clusters (q = 3 in _{1} is denser, and more compact than the other ones.

It is interesting to study how well a centroid geometrically characterizes its cluster. Two parameters affect this: both the cluster density and the number of its elements.^{4}

For this purpose, we propose a coefficient, r_{k}, defined as the centroid reliability. It is calculated as follows:

where n_{k} is the number of students contained in cluster Cl_{k} and ^{5} High values of r_{k} indicate that the centroid characterizes the cluster well.

In order to compare the reliability values of different clusters in a given partition the r_{k} values can be normalized according to the following formula

where _{k} on the different clusters, respectively.

Once the appropriate partition of data has been found, we want to characterize each cluster in terms of the most prominent answering strategies. Such characterizations will help us to compare clusters. To do this, we start by creating a “virtual student” for each of the q clusters, Cl_{k} (with_{i} composed by 0 and 1 values for each of the M answering strategies, the array for the virtual student, _{k} and 1 for strategies that do characterize Cl_{k}. It is possible to demonstrate that _{k}.^{6} In fact, since a centroid is defined as the geometric point that minimizes the sum of the distances between it and all the cluster elements, by minimizing this sum the correlation coefficients between the cluster elements and the virtual student are maximized and this happens when each virtual student has the largest number of common strategies with all the students that are part of its cluster. This is a remarkable feature of_{k}.

Another way to find the array that describes the centroid of a cluster starts from the coordinates of the centroid in the 2-dimensional Cartesian Plane reporting the results of a k-means analysis. We devised a method that consists of repeating the k-means procedure in reverse, by using the iterative method described as follows. For each cluster, Cl_{k}, we define a random array

where _{k}) and

By using an iterative procedure that permutes the values of the random array ^{7}, of the real centroid of C_{k}.

B. Hierarchical Clustering Analysis (H-ClA)

Hierarchical clustering is a method of cluster construction that starts from the idea of elements (again students in our case) of a set being more related to nearby students than to farther away ones, and tries to arrange students representing them as being “above”, “below”, or “at the same level as” one another. This method connects students to form clusters based on the presence of common characteristics. As a hierarchy of clusters, which merge with each other at certain distances, is provided, the term “hierarchical clustering” has been used in the literature.

In H-ClA, which is sometimes used in education to analyze the answers given by students to open- and closed-ended questionnaires (see, for example, [

Among the many linkage methods described in the literature, the following have been taken into account in education studies: Single, Complete, Average and Weighted Average. Each method differs in how it measures the distance between two clusters r and s by means of the definition of a new metric (an “ultrametric”), and consequently influences the interpretation of the word “closest”. Single linkage, also called nearest neighbor linkage, links r and s by using the smallest distance between the students in r and those in s; complete linkage, also called farthest neighbor linkage, uses the largest distance between the students in r and the ones in s; average linkage uses the average distance between the students in the two clusters; weighted average linkage uses a recursive definition for the distance between two clusters. If cluster r was created by combining clusters p and q, the distance between r and another cluster s is defined as the average of the distance between p and s and the distance between q and s.

To better represent the differences and approximations involved in the various linkages, an example is displayed in

_{ij}, represented in matrix, D, (see Section 2).

Suppose r, p and q are existing clusters and cluster r is the cluster formed by merging p and q (r = p ں q). The distances between the elements of r and the elements of another cluster s are defined for the four linkage methods, as shown in _{r} indicates the number of students in cluster r, n_{s} indicates the number of students in cluster s, x_{ri} is the i-th student in r and x_{sj} is the j-th student in s.

Number of clusters (q) | Silhouette average value (S(q)) (CI) | Silhouette Average value for cluster (S(q)_{k}), | |||
---|---|---|---|---|---|

3 | 0.795 (0.780 - 0.805) | k | |||

1 | 2 | 3 | |||

0.953 (0.951 - 0.956) | 0.79 (0.78 - 0.81) | 0.66 (0.63 - 0.68) | |||

4 | 0.729 (0.711 - 0.744) | k | |||

1 | 2 | 3 | 4 | ||

0.953 (0.951 - 0.956) | 0.67 (0.64 - 0.69) | 0.77 (0.74 - 0.79) | 0.44 (0.40 - 0.47) |

Single linkage links the two clusters r and s by using the smallest distance between the students in r and those in s; complete linkage uses the largest distance between the students in r and the ones in s; average linkage uses the average distance between the students in the two clusters; weighted average linkage uses a recursive definition for the distance between two clusters. If cluster r was created by combining clusters p and q, the distance between r and another cluster s is defined as the average of the distance between p and s and the distance between q and s.

It is important to note that the difference between dendrograms obtained by using the average and the weighted average methods are evident only when the number of elements is not too low. Here, we report an example for a sample of 7 elements.

Several conditions can determine the choice of a specific linkage method. For instance, when the source data are in binary form (as in our case) the single and complete linkage methods do not give a smooth progression of the distances [

In the specialized literature it is easy to find numerical indexes driving the choice of a specific linkage method, such as the “cophenetic correlation coefficient” [

Single linkage | |
---|---|

Complete linkage | |

Average linkage | |

Weighted average linkage |

A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|

A | 0 | 0.2 | 0.28 | 0.2 | 0.14 | 0.42 | 1.01 |

B | 0 | 0.2 | 0.28 | 0.14 | 0.42 | 1.01 | |

C | 0 | 0.2 | 0.14 | 0.42 | 1.01 | ||

D | 0 | 0.14 | 0.42 | 1.01 | |||

E | 0 | 0.4 | 1 | ||||

F | 0 | 1.4 | |||||

G | 0 |

The cophenetic correlation coefficient, c_{coph}, gives a measure of the concordance between the two matrixes: matrix D of the distances and matrix Δ of the ultrametric distances. It is defined as

where:

・ d_{ij} is the distance between elements i and j in D.

・ δ_{ij} is the ultrametric distance between elements i and j in Δ, i.e. the height of the link at which the two elements i and j are first joined together.

・

High values of c_{coph} indicate how much the matrix Δ is actually representative of matrix D and, consequently, how much ultrametric distances, δ_{ij}, are representative of distances, d_{ij}.

Its value is based on the correlation (like the Pearson one [

Reading a dendrogram and finding clusters in it can be a rather arbitrary process. There is not a widely accepted criterion that can be applied in order to determine the distance values to be chosen for identifying the clusters. Different criteria, named stopping criteria, aimed at finding the optimal number of clusters are discussed in the literature (see, for example, Springuel [

One way to decide if the grouping in a data set is adequate is to compare the height of each link in a cluster tree with the heights of neighboring links below it in the tree. A link that is approximately the same height as the links below it indicates that there are no distinct divisions between the objects joined at this level of the hierarchy. These links are said to exhibit a high level of consistency, because the distance between the objects being joined is approximately the same as the distances between the objects they contain. On the other hand, a link whose height differs noticeably from the height of the links below it indicates that the objects joined at this level in the cluster tree are much farther apart from each other than their components were when they were joined. This link is said to be inconsistent with the links below it.

The relative consistency of each link in a hierarchical cluster tree can be quantified through the inconsistency coefficient, I_{k} [

The inconsistency coefficient compares the height of each link in a cluster tree made of N elements, with the heights of neighboring links above it in the tree.

The calculations of inconsistency coefficients are performed on the matrix of the ultrametric distances, Δ, generated by the chosen linkage method.

We consider two clusters, s and t, whose distance value is reported in matrix Δ, and that converge in a new link, k, (with

where

This formula shows that a link whose height differs noticeably from the height of the n links below it indicates that the objects joined at this level in the cluster tree are much farther apart from each other than their n components. Such a link has an high value of I_{k}. On the contrary, if the link, k, is approximately the same height as the links below it, no distinct divisions between the objects joined at this level of the hierarchy can be identified. Such a link has a low value of I_{k}.

The higher is the value of this coefficient, the less consistent is the link connecting the students. A link that joins distinct clusters has a high inconsistency coefficient; a link that joins indistinct clusters has a low inconsistency coefficient.

The choice of I_{k} value to be considered significant in order to define a threshold is arbitrary and involves the choice of the significant number of clusters that can describe the whole sample. Moreover, in the specialized literature [_{k} value of a given link is considered by also taking into account the ultrametric distance of the link, in order to avoid a too low or too high fragmentation^{9} of the sample clusters. This means that, after having disregarded the links that produce a too low fragmentation, the I_{k} of the links just below are taken into account.

The variation ratio criterion (VRC) [

For a partition of N elements in q cluster, the VRC value is defined as:

where WGSS (Within Group Squared Sum) represents the sum of the distance squares between the elements belonging to a same cluster and BGSS (Between Group Squared Sum), defines the sum of the distance squares between elements of a given cluster group and the external ones.

It measures the ratio between the sum of the squares of the distances between the elements belonging to the same cluster and the sum of the squares of the distances between the elements of a given cluster and the external ones. The larger is the VRC value, the better is the clustering.

It is worth noting that the evaluation of the number of cluster to be consider significant for an education-focused research should also be influenced by pedagogic considerations, related to the interpretation of the clusters that are formed. Although it could be desirable to have a fine grain description of our sample students, this can make the search for common trends in the sample too complicated, and perhaps less interesting if too many “micro-behaviors” related to the various small clusters are found and have to be explained.

As a final consideration, we want to point out that the comparison of different clustering methods (in our case NH-ClA and H-ClA methods) is a relevant point. As Meila et al. [

In the following sections we will present an application of the described ClA procedures to the analysis of data from an open-ended questionnaire administered to a sample of university students, and we will discuss the results of the application on these data of the two methods of Cluster Analysis we outlined above. Similarly to what we have done in

Each set of clusters that is obtained by means of H-ClA and/or NH-ClA is interpreted on the basis of the answering strategy set (as explained in Section IV) and these interpretations, together with a possible comparison of the results obtained by the two methods, leads us to the final results of the study.

In this Section we want to describe an application of the methods discussed above to the analysis of the answers to a questionnaire composed by 4 open-ended questions, each with 5 possible answering strategies resulting from the preliminary analysis

discussed in Section 2.^{10} 124 students participated in the survey and completed the questionnaire.

A. Non-hierarchical clustering analysis (NH-ClA)

All the clustering calculations were made using a custom software, written in C language, for the NH-ClA (k-means method), as well as for H-ClA where the weighted average linkage method was applied. The graphical representations of clusters in both cases were obtained using the well-known software MATLAB [

In order to define the number q of clusters that best partitions our sample, the mean value of S-function, ^{11} The figure shows that the best partition of our sample is achieved by choosing four clusters, where

The four clusters Cl_{k} (_{k}. They arethe four points in the graph whose arrays,

The _{1} is denser than the others, and Cl_{4} is the most spread out. Furthermore, the values of _{1} best represents its cluster, whereas C_{3} is the centroid that represents its cluster the least.

B. Hierarchical clustering analysis (H-ClA)

In order to apply the H-ClA method to our data, we first had to choose what kind of linkage to use. Since we could not use simple or complete linkages (see Section 3(B)), we calculated the cophenetic correlation coefficient for the average and weighted average linkages, which gave a measure of the accordance between the distances calculated by (2) and the ultrametric distances introduced by the linkage. We obtained the values 0.61 and 0.68 for average and weighted average linkages, respectively. We chose to use

Cluster centroid | C_{1} | C_{2} | C_{3} | C_{4} |
---|---|---|---|---|

1B, 2C, 3B, 4A | 1B, 2B, 3E, 4A | 1C, 2B, 3A, 4A | 1C, 2C, 3B, 4B | |

Number of students | 18 | 19 | 63 | 24 |

0.750, CI = (0.730, 0.763) | 0.62, CI = (0.58, 0.64) | 0.604, CI = (0.590, 0.616) | 0.56, CI = (0.53, 0.58) | |

1.40 | −0.02 | −0.92 | −0.46 |

the weighted average link and

In this figure the vertical axis represents the ultrametric distance between two clusters when they are joined; the horizontal axis is divided in 124 ticks, each representing a student. Furthermore, vertical lines represent students or groups of students and horizontal lines represent the joining of two clusters. Vertical lines are always placed in the center of the group of students in a cluster and horizontal lines are placed at the height which corresponds to the distance between the two clusters that they join.

By describing the cluster tree from the top down, as if clusters are splitting apart, we can see that all the students come together into a single cluster, located at the top of the figure. In this cluster, for each pair of students, i and j, the ultrametric distance is_{k} (see Section III), we can define a specific threshold and neglect some links because they are inconsistent. In fact, this coefficient characterizes each link in a cluster tree by comparing its height with the average height of other links at the same level of the hierarchy. The choice of the threshold is arbitrary and should be limited to the links in a specific range of distances [

If we disregard the higher links (δ ≈ 1.8, black, dashed links in _{k} > 1.6, we can accept all the links just below, including the red, dotted ones in _{k} equal to 1.4 and 1.6, respectively). So, we find a partition of our sample into 4 clusters. If, on the other hand, we introduce a lower threshold for the I_{k} value, but still not producing a too high fragmentation, like for example I_{k} > 1.25, we must disregard the dotted links in the dendrogram in

In order to verify the validity of our choice we also used the VRC (see Section 3).

_{3} by NH-ClA, into different sub groups, and a redistribution of students located on the edges of cluster Cl_{4}. Furthermore, the students in cluster Cl_{1} are all located in cluster β and students in cluster Cl_{2} are all located in cluster γ. This is in accordance with the high values of the _{1} and Cl_{2} and the low value for clusters Cl_{3} and Cl_{4}.

In conclusion, we can say that although the two partitions of our student sample are different, they are consistent. The characterization via the dendrogram allows us to obtain more detail. This happens in particular, in the case of cluster Cl_{3}, which turns out to be very extensive, with a large number of students and a low value of

In order to better compare the results obtained by NH-ClA and H-ClA methods, we applied the variation of information (VI) criterion (see Section III), that measures the amount of information gained and lost when switching from one type of clustering to another. We calculated the value of VI to compare the 4-clustering results of k-means method with the 4-clustering, 5-clustering and 6-clustering results of H-ClA method and obtained the values of 0.34, 0.38 and 0.28, respectively. We can conclude that the best agreement can be found between the 4-clustering results of k-means method and the 6-clustering results of H-ClA method.

Cluster | α | β | γ | δ | ε | ζ |
---|---|---|---|---|---|---|

Most frequently given answers | 1C, 2C, 3B, 4B | 1B, 2C, 3B, 4A/4B | 1B, 2B, 3E, 4A | 1C,2B, 3D, 4A | 1D, 2C, 3A, 4B | 1A, 2A, 3A, 4D |

Number of students | 17 | 21 | 29 | 21 | 19 | 17 |

Characterization of students in cluster by the k-means method (*) | (14)Cl_{4} + (3)Cl_{3} | (18)Cl_{1} + (3)Cl_{4} | (19)Cl_{2} + (10)Cl_{3} | (19)Cl_{3} + (2)Cl_{4} | (14)Cl_{3} + (5)Cl_{4} | (17)Cl_{3} |

*i.e. (14)Cl_{4}+(3)Cl_{3}, means that cluster α contains 14 students part of the cluster Cl_{4} (in NH-ClA) + 3 students part of cluster Cl_{3}.

The use of cluster analysis techniques is common in many fields of research as, for example, information technology, biology, medicine, archeology, econophysics and market research. These techniques allow the researcher to locate subsets or clusters within a set of objects of any nature that have a tendency to be homogeneous “in some sense”. The results of the analysis can reveal a high homogeneity within each group (intra- cluster), and high heterogeneity between groups (inter-clusters), in line with the chosen criteria. However, only a limited number of examples of application of ClA in the field of education are available, and many aspects of the use of the various available techniques have hardly been deepened to reveal their strength and weakness points.

In this paper we started from some preliminary considerations about the problems arising from the coding procedures of student answers to closed- and open-ended questionnaires. These procedures are aimed at categorizing student answers in a limited number of “typical” ways to answer each question. We gave some examples of procedures that can be used according to the questionnaire type (closed- and open-ended), and then we presented and discussed two ClA methodologies that can be sometimes found in the education literature. We started describing a not-hierarchical ClA method, the k-means one that allows the researcher in Education to easily separate students into groups that can be recognized and characterized by common traits in their answers to a questionnaire. It is also possible to easily represent these groups in a 2-dimensional Cartesian graph containing points that represent the students of the sample on the basis of their mutual distances, related to the mutual correlation among students answering the questionnaire. Each of the clusters found by the analysis can be characterized by a point, the “centroid”, representing the answers most frequently given by the students comprised in the cluster. Some functions and parameters useful to carefully evaluate the reliability of the results obtained have also been discussed.

Following this, we described a different method of analysis, based on hierarchical clustering that can also help the researcher to find student groups where the elements (the students) are linked by common traits in their answers to a questionnaire. This method allows the researcher to visualize the clustering results in a graphic tree, called “dendrogram” that easily shows the links between couples and/or groups of students on the basis of their mutual distances. Each cluster can be characterized on the basis of the answers most frequently given by the students in it. Again, functions and parameters useful to evaluate the reliability of the results have been discussed.

Finally, an application of these two methods to the analysis of the answers to a real questionnaire has been given, in order to clearly show what the choices that the researcher must do are, and what parameters he/she must use in order to obtain the best partitions of the whole student groups and check the reliability of the result. In order to study the coherence of the results obtained by using hierarchical and not-hierarchical ClA methods, we compared the results each other. We found that many of the clusters found by NH-ClA are also present in H-ClA; yet some of the clusters found with NH- ClA are further splitted, and can be, so, better characterized, by means of H-ClA.

We can conclude that the H-ClA method we discussed here allows the researcher to easily obtain and visualize in a 2-D graph a global view of the student behaviour with respect to the answers to a questionnaire and to obtain a first characterization of student behaviour in terms of their most frequently used answering strategies. The NH- ClA method, on the other hand, although producing a graph not easy to read as the one produced with the other method (dendrogram vs. Voronoi diagram), allows the researcher to obtain results coherent with the NH-ClA ones and can offer a finer grain detail of student behaviour.

Battaglia, O.R., Di Paola, B. and Fazio, C. (2016) A New Approach to Investigate Students’ Behavior by Using Cluster Analysis as an Unsupervised Methodology in the Field of Education. Ap- plied Mathematics, 7, 1649-1673. http://dx.doi.org/10.4236/am.2016.715142