Analysis of regional development disparities in Ukraine with fuzzy clustering technique

Disparities in the development of regions in any country affect the entire national economy. Detecting the disparities can help formulate the proper economic policies for each region by taking action against the factors that slow down the economic growth. This study was conducted with the aim of applying clustering methods to analyse regional disparities based on the economic development indicators of the regions of Ukraine. There were considered fuzzy clustering methods, which generalize partition clustering methods by allowing objects to be partially classified into more than one cluster. Fuzzy clustering technique was applied using R packages to the data sets with the statistic indicators concerned to the economic activities in all administrative regions of Ukraine in 2017. Sets of development indicators for different sectors of economic activity, such as industry, agriculture, construction and services, were reviewed and analysed. The study showed that the regional cluster classification results strongly depend on the input development indicators and the clustering technique used for this purpose. Consideration of different partitions into fuzzy clusters opens up new opportunities in developing recommendations on how to differentiate economic policies in order to achieve maximum growth for the regions and the entire country.


Introduction
Economic policies that take into account differences in regional development should be coordinated using scientific approaches to achieve maximum results in each region and for the whole country. This article is dedicated to the problem of clustering Ukrainian regions in different groups accordingly to their economic development levels. The usefulness of such division is obvious. Really, having at disposal the partitioning into different clusters based on economic indicators, a decision maker can elaborate economic policy measures, which are specific for every cluster and similar for all the regions inside the same cluster. So, the number of policy options substantially reduce in comparison with the case, when the decision is made on each particular region.
Clustering also provides an opportunity to identify groups of regions that are most attractive as objects of domestic and foreign investment. Undoubtedly, the use of cluster analysis for improving regional policy will increase the efficiency of the economic system as a whole, which is especially important for today's Ukraine and is a necessary condition for its economic growth.
Nowadays, a good deal of research representing manifold of cluster analysis approaches and tools has been conducted and reflected at the relevant literary sources. Nevertheless, search for the most acceptable clustering methods still retains its relevance. The reason is that every method has its own advantages and disadvantages.
Fuzzy clustering methods permit the gradual assessment of the membership of data elements in a cluster which is described by a membership function valued in the real unit interval [0; 1]. So, in fuzzy clustering it is assumed that the boundaries between groups are not well defined, like in the case of most natural systems. Therefore, fuzzy clustering approaches make it possible to more adequately describe and solve the real problem, such as estimating regional development disparities.
This article presents a study on application of hard cluster analysis methods and clustering methods based on fuzzy sets theory. A new approach to evaluating regional disparities in Ukraine using a fuzzy clustering technique is given. There were used statistical data on indicators of economic activities in different regions of Ukraine in 2017. The considered methods are especially useful for the case of qualitative economic indicators.
This article consists of six sections. The first one substantiates the background of the conducted research. In the second section, review of the scientific literature on research topic is presented. The third part reveals the theoretical basis of the proposed clustering techniques. The course of the study and its main results are presented in the fourth and fifth parts of this paper. The final part contains conclusions based on the research results and discussing areas for the further studies in the field of exploring fuzzy clustering methods and adapting them to regional clustering tasks. differences in regional qualitative and quantitative economic indicators. Economic disparities are generally assessed using such indicator as gross national product (GNP), combined with the analysis of tax revenues, the growth of industry and agriculture, demographic trend, infrastructure and services [1].
The Klassen typology and the developed fuzzy-Klassen model are discussed in the paper [3] along with giving the recommendations on their use in modelling regional development disparities.
The use of clustering techniques in the tasks of classification of regions by the level of economic indicators is represented in articles [1,2,6]. Also, there were proposed to join the traditional clustering approaches with fuzzy methods, based on fuzzy sets theory of L. Zadeh [10], and a lot of researches were done to apply them in practice.

Research methodology
Clustering is one of the important data mining techniques that enable the discovery of hidden relationships from data [15]. The goal of the clustering is to divide the set of data items into several number of groups c, called clusters. The result of any cluster algorithm is the mapping of data items to a specific group.
In general, clustering techniques are divided into two types, Hierarchical and Partitioned clustering [22]. Partition clustering algorithms divide the data sets into clusters assigning dissimilar data objects to different clusters.
Hierarchical cluster techniques are generally classified into two types, which are agglomerative and divisive clustering [22]. These cluster methods form a dendrogram, which represents nested grouping pattern and similarity level in classification process. At certain group level, dendrogram will break into another group level, thus producing a different data group. In hierarchical clustering, objects that belong to a child cluster also belong to the parent cluster [13].
Hierarchical cluster methods classify data by similarity of distance between two data points. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow [19]: where x and y -two vectors of length n; d euc (x, y) -Euclidean distance; d man (x, y) -Manhattan distance.
Also, there are many other methods to calculate the distance information, but the right choice of distance measures, which depends on the type of the data and the researcher questions, is very important, as it has a strong influence on the clustering results [19].
The conventional (hard or hard) clustering methods restrict that each point of the data set belongs to exactly one cluster [14]. Fuzzy set theory proposed by Zadeh [10] in 1965 gave an idea to describe the uncertainty of belonging to particular class by a membership function. Applications of fuzzy set theory in cluster analysis were early proposed in the work of Bellman, Kalaba, Zadeh [23] and Ruspini [17].
Basic fuzzy clustering techniques include: fuzzy clustering based on fuzzy relation, fuzzy clustering based on objective functions, and the fuzzy generalized Knearest neighbour rule -one of the powerful nonparametric classifiers [14].
For all fuzzy clustering algorithms, it is necessary to pre-assume the number c of clusters because, in general, the number c should be unknown [14]. The quality of the classification of data into partitions depends on the value of the parameter c that is provided to the algorithm [15].
Fuzzy clustering is a soft clustering technique for classifying data into groups. In fuzzy clustering each data point belongs to all the clusters with varying memberships and these membership values range between zero and one [15].
Most of the clustering algorithms follow a similar structure [11]: (1) select initial cluster centers, (2) calculate the distances between all points and all cluster centers, (3) update the partition matrix until some termination threshold is met. In particular, the classification of fuzzy algorithms is represented in [11].
The fuzzy c-means (FCM) algorithm involves the processes in which there is calculation of cluster centers and assignment of points to these centers using a formula of Euclidian distance [13]. The fuzzy c-means algorithm is one of the most widely used fuzzy clustering algorithms. It is a soft clustering algorithm which was firstly studied by Dunn (1973) [28] and generalized by Bezdek (1974;1981) [29,30]. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster [19]. The above process is kept on repeating itself until the stabilization of cluster centers.
This algorithm assigns a membership value to the data items for the clusters within a range of 0 to 1. Thus, the concepts of fuzzy sets of partial membership are incorporated and forms overlapping clusters for supporting it [13]. Consequently, the data objects closer to the centers of clusters have higher degrees of membership than objects scattered in the borders of clusters [20].
We can apply clustering algorithms using the R software. The following R packages are used for calculations in our research: 1) cluster, ppclust and fclust for computing fuzzy clustering and 2) factoextra for visualizing clusters [27].
The function hclust() (cluster R package) performs a hard hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed according to the particular clustering method being used [26].
The function fanny() (cluster R package) can be used to compute fuzzy clustering [26]. It stands for fuzzy analysis clustering and returns an object including the following components: the fuzzy membership matrix containing the degree to which each observation belongs to a given cluster; Dunn's partition coefficient (a low value indicates a very fuzzy clustering, whereas a value close to 1 indicates a near-hard clustering); the clustering vector containing the nearest hard grouping of observations etc. [19].
The function fcm() (pplcust R package), which applies the fuzzy c-means algorithm also can be used to compute fuzzy clustering. It returns an object including the following components: the fuzzy membership matrix; Initial and final cluster prototypes matrices; the Dunn's Fuzziness Coefficients; the within cluster sum of squares by cluster etc. [19].

Data set description
The data for our study was taken from the State Statistics Service of Ukraine [31]. We used the statistic information about the economic activities in 2017 taken by all regions. There we selected some basic indicators of economic activities and we divided them into two groups by their meaning. So, the first group included the indicators of the extraction of aquatic bioresources and the agriculture activities, and the second group included the indicators of the retail trade, services and the industrial activities. All of them were explored and their corresponding values were used in clustering analysis of the regional development. The list of those indicators and their summary statistics are presented in the Tables 1, 2.
In the Table 3, the column "Id" contains the inner identification number of the region which is used for convenience for all following computing results and outputs. We considered the values of these indicators, gathered in 2017, for all 24 administrative regions in Ukraine (Table 3). So, there were built two data sets accordingly to each set of indicators. We denoted them as the First data set and the Second Data set. Then, we used both data sets for clustering the regions, based on different groups of indicators, and compared the results.

Clustering results
Before starting the fuzzy clustering analysis, we can apply the hierarchical clustering method, using a linkage method "single", to both data sets. The results of clustering are illustrated by the cluster dendrograms ( Fig. 1, 2), where we can see the data points hierarchically arranged into larger groups dependently on the distances between them.

Three clusters
For the number of clusters equal to three (c = 3) we conducted the hierarchical clustering by hclust() function [26], using a linkage method "complete", and obtain the hard clusters for two data sets ( Fig. 3-4  The fuzzy clustering methods, applied to both data sets, allowed us to obtain the fuzzy clusters which are characterized by membership coefficients indicated the strength of belonging to the particular cluster for all regions. We illustrated the fuzzy clusters by several charts (Fig. 5-6) and the table with the values of the membership coefficients obtained by the fcm() function [19] (Table 5). The values of membership coefficients vary from 0 to 1 and indicate with different conditional formatting pattern the strength of belonging to the particular cluster for all regions.    The next plot (Fig. 5) shows the overlapping clusters on the set of all data points. It is the scatterplot of the first two principal components which were derived from the data. It also says that, in our case, 85.3% (62.8%+22.5%) of the information about the multivariate data is captured by this plot.
On the following plot (Fig. 5, 6), the data points with the highest values of the membership coefficients are combined into three different clusters to determine which data points more likely are in each cluster.  The similar information is shown on the scatterplot (Fig. 7), which says that 85.33 % of the information about the multivariate data is explained by two principal components. Another fuzzy clustering method fanny() [26] gave us a slightly different result (Fig. 8).
To estimate the goodness of the clustering results, we can plot the silhouette coefficients which quantify the quality of clustering achieved. The silhouette plot (Fig. 9) displays a measure of how close each point in one cluster is to points in the neighbouring clusters and allows to determine the optimal number of clusters visually.  The plot of silhouette coefficients, built by the last clustering results, shows the average level of the silhouette width 0.38. It is not sufficient result and we can see that some data points are not enough close to points in the neighbouring clusters. Especially, the points in the third cluster are very close to the decision boundary between two neighbouring clusters or even might have been assigned to the wrong cluster.
A similar analysis was performed for the Second data set (Fig. 10-11). The scatterplot of two principal components (Fig. 10), which were derived from the data, shows the overlapping clusters on the set of all data points, and also, we can see that around 96.3% (77.8%+18.5%) of the information about the multivariate data is explained by these components.
Then, the data points with the highest values of the membership coefficients combined into three different clusters are presented in the Table 6 and show which of them more likely are in each cluster.
The plot of silhouette coefficients (Fig. 11), built by the clustering results of fanny() method applied to the Second data set, shows the average level of the silhouette width 0.56. It is rather sufficient result and we can see that most of data points are assigned to the right cluster. But some of them are still on the wrong place. The summarized results of fuzzy clustering by fcm() function applied to both data sets are presented in the Table 6. As we can see, there were obtained the three fuzzy clusters for each set of economic indicators, and the different partitions of Ukrainian regions show the regional development disparities, which could be analysed and used in decision making process concerned to the economic strategies.
Looking at the fuzziness of these partitions, we can admit that the regions with the average values of membership coefficients are on the boundary of the neighbour clusters, and the strategies for them must be the mixture of the corresponding strategies of the neighbour clusters.

Four clusters
The similar clustering analysis (Fig. 12) were conducted for the case of four clusters (c = 4). The results obtained by hierarchical clustering (hclust(), "complete") are in the Table 7.  2, 4, 5, 6, 8, 11, 12, 16, 18, 20, 23 1, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 The fuzzy clusters also are presented by different values of membership coefficients (we do not place them here because of the size). But these fuzzy clusters are quite completely described by the overlapping shapes at Fig. 13 and we can say that the plot of two principal components capture around 85.3% of the information about the multivariate data. The fuzzy clusters based on the Second data set we represented by the plot, where the data points with the highest values of the membership coefficients are combined into four different clusters (Fig. 14). Here we have the only two big groups of data points and two data points are stand alone in different clusters. So, the further analysis with larger number of clusters is not rational.
The summarized results of fuzzy clustering by fcm() function applied to both data sets are presented in the Table 8.  So, there were obtained the four fuzzy clusters for each set of economic indicators. These different classifications of Ukrainian regions show the disparities in regional development, which can be analysed and used in the decision-making process concerning economic strategies. Including into the analysis the fuzzy nature of obtained partitions, we will gain the new quality of forming of the economic strategies for different regions.

Results and discussion
The results of fuzzy clustering obtained in this study allows to consider in more detail the similarities in the economic development levels of the Ukrainian regions, which are assigned to the same clusters, and reveal the dissimilarities between the regions assigned to the different clusters. The membership coefficients give us the information how far are the development levels within clusters and between clusters.
This alternative approach can help determine the regional development disparities according to certain indicators. As we showed in this research, the results of partitioning strongly depend on the indicators selected for the analysis, and any clustering technique should be used only along with the substantial analysis of the subject of interest. Before conducting fuzzy clustering, in order to ensure proper economic interpretation of clustering results, a profound analysis of the nature of all economic indicators and relationships between them should be used.
In general, fuzzy clustering results could not be significantly different from hard clustering results. It is quite reasonable, and we could see this in practice. Although the concepts of hard and fuzzy clustering are rather different, they have common features, and the clusters obtained by different methods predominantly overlap.
The main findings in this research were the conclusions about the regional disparities in the levels of different kinds of economic activities in Ukraine in 2017. Thus, after the analysis of most agricultural indicators, we mark that among Ukrainian regions, Zaporizhya is the region, which level is significantly different from others. But the analysis of most industrial indicators allows to sign that Dnipropetrovsk and Donetsk regions, as well, are the regions, which levels significantly differ from others.

Conclusion
Regional disparities in economic development level had been analysed in this study by different clustering techniques. We obtained the classifications based on two groups of economic indicators observed in 2017 for all Ukrainian regions. Now, we can conclude that the regional inequalities across Ukrainian regions can be reduced by the right economic policies if the information about the actual magnitude of differences between the regions will be available before the decision-making process. The fuzzy clustering methods give us the instrument for the estimating these degrees of differences based on the analysis of regional economic activities in target sectors.
We showed, that implementation of fuzzy clustering methods in analysis of regional disparities have many advantages, but it needs to be accompanied with the cluster validity process and substantial analysis of the economic indicators, which we take as the base of the clustering investigation. In further researches, we need to take into consideration the necessity of aggregating the different fuzzy clustering results for developing recommendations on how to differentiate economic policies in order to achieve maximum growth for the regions and the entire country.