The impact of Data structure on classification ability of financial failure prediction model

The creation of prediction models to reveal the threat of financial difficulties of the companies is realized by the application of various multivariate statistical methods. From a global perspective, prediction models serve to classify a company into a group of prosperous or non-prosperous companies, or to quantify the probability of financial difficulties in the company. In many countries around the world, real financial data about the companies are used in developing these prediction models. In Slovakia, standard data from the financial statements and annual reports of Slovak companies are used for the creation of the company’s failure model. Since in this case there are generally large data files, it is necessary to pre-process the data by the selected methods before the prediction model is constructed. A database of the companies needs to be prepared for the subsequent application of statistical methods, and it is also highly appropriate to focus globally on the detection of potential extreme and remote observations. Therefore, the article will focus on quantifying the impact of the data structure detected, for example, the occurrence of extreme and remote observations in the data set, on the resulting overall classification of the prediction ability of the models created.


Introduction
Initial data processing for further statistical and econometric analyzes is a very important part of the analyst's work. This preparation of data requires a lot of time, analyst experience, knowledge of the data and the situation we are trying to analyze. Such data preprocessing is also needed when analyzing the issue of predicting the financial difficulties of businesses, a topic that is current in recent years.
The issue of predicting the financial situation of companies is relatively young field of economic research. Its origin dates back to the 30s´ of the 20th century, but the first prediction models have appeared only in the 60s´ of the 20th century [1]. As a first study focused on finding the main differences between companies with and without financial problems, based on the analysis of the financial ratios, can be considered the work of Fitzpatrick from 1932 [2]. Since then, prediction financial analysis has undergone significant developments, from one-dimensional and multidimensional discriminant analysis, through logistic regression to artificial intelligence. At present, experts' views on various methods of predicting the financial situation of the companies differ. Some authors deal with the possibility of using models developed in the last century for predicting bankruptcy of existing companies at present. This results in different adjustments and recalculations in the original models. Other authors focus on creating new models using new ratios and new methods [3]. As a result of the development of artificial intelligence, new methods such as machine learning techniques, neural networks and genetic algorithms are being introduced into prediction financial analysis. Given the different opinions of experts on various prediction methods, it can be argued that every method has its advantages and disadvantages, and also limitations of its use [4]. But the constant research in this area proves currentness of this topic even today. In any case, the issue of predicting the financial situation of a company is still up to date due to its great importance not only for the company itself but also for all stakeholders [5].
The created prediction models are evaluated in terms of their success in the correct classification of companies, especially in the correct prediction of financial difficulties. In this article we will focus on the analysis of the impact of data preparation in the process of developing prediction models on the results of the correct classification. The aim of the paper is to find out whether the identification and removal of remote and extreme values in the data set results in an improvement of the classification ability of the model created by discriminant analysis, logistic regression and classification tree method (namely CART). The contribution of the paper is a new view of the prediction ability of the created models, where the emphasis is on thorough data preparation.
Our study is organized in four chapters. The first one provide the introduction to the issue of bankruptcy prediction and highlight the current state of the issue in the form of literature review. The second chapter describes briefly the methods and the data used in this study. The third chapter provide the results of the analysis of the impact of outlier occurrence on the classification ability of the prediction model created by three selected methods. Discussion compares the results of the methods used in this study and discuss the weaknesses and next direction of the study.

Literature review
The prediction of bankruptcy is a topic that has been in recent years dealt with by economists in many countries of the world. As a first study focused on creation of prediction model based on the analysis of the differences of financial ratios, is considered the study of Fitzpatrick from 1932 [6]. Since then, prediction financial analysis has undergone significant developments, from one-dimensional and multidimensional discriminant analysis, through logistic regression to artificial intelligence. The method of discriminant analysis was used for the first time by Beaver in 1966, who also formed the basis for prediction models. Based on his research, in 1968 Altman used multivariate discriminant analysis to develop the probably most famous bankruptcy prediction model [2]. Ohlson's study from 1980 was the first who used the method of logistic regression for creating the model to predict the probability of company failure [7].
At present, authors used different methods of creating of the prediction models: from the older methods of discriminant analysis and logistic regression, to more modern methods of neural networks, genetic algorithms, classification trees, and random forests [8]. Several prediction models were in the last few years also created in Slovakia. As the best known and frequently used we can call models by Gurcik from 2002 and Chrastinova from 1998 [5]. In recent years, several authors created new prediction models in the conditions of Slovakia. Kovacova and Kliestik in [9] developed models for bankruptcy prediction of Slovak companies by using logit and probit method and provide the comparison of overall prediction power of the two developed models. Gavurova et al. in the study [10] analyzed the impact of trend variables on the prediction power of the models constructed using discriminant analysis and decision trees. They developed a new model for Slovak companies by using the decision tree technique. Mihalovic in [11] also dedicated his study to development of bankruptcy prediction models in Slovak Republic, the first one is estimated via discriminant analysis, and another is based on a logistic regression. Other authors in Slovakia deal with the application of existing models to predict the financial difficulties of companies in Slovakia, for example [12], [13].
Several authors have also dealt with the occurrence of outliers in data used for bankruptcy prediction models in recent years. They mostly examined the impact of outliers on the resulting prediction power of the created models. For example, Tsai and Cheng in [14] studied bankruptcy prediction performance achieved after removal of different outlier volumes from datasets. Linares-Mustaros et al. in [15] dealt with problems of the occurrence of outliers in using cluster analysis to classify firms according to their financial structures. Alrawashdeh et al. in [16] tried to eliminate the problem of high sensitivity of linear discriminant analysis to the occurrence of outliers in data and to improve the classification ability of created models also in bankruptcy prediction. Figini et al. in their study [17] describes novel approaches to predict default for SMEs by detecting multivariate outliers. Pawelek et al. in [18] made an empirical study about the influence of detecting and eliminating outliers on the effectiveness of the bankruptcy prediction logit model for Polish companies. A similar issue is addressed in their subsequent studies [19] and [20].

Materials and Methods
Outlying and extreme observations are observations in the statistical set that are significantly smaller or larger than other values. These values can occur in the data file for various reasons. They can occur as an error in records, most often caused by a human factor, for example when manually rewriting records into electronic form. Outlier may also appear in the file as a measurement that is actually significantly different from the others [18].
Outlying and extreme observations may signal various anomalies in the data that need to be addressed in the pre-data phase to further apply more advanced statistical methods. Some methods are very sensitive to the occurrence of such values in the file. In general, it is recommended to first detect extreme (but also outlying) observations and then analyze them and consider removing them from the data file [20]. It is advisable to remove those which, from an expert point of view, represent problematic points and misrepresent the parameters of the regression function. The solution of course depends on the specific application and the analyst's decision.
A special group of outliers observed are the so-called multivariate outliers. Multidimensional observations can become outliers if their values of multiple variables are some unique combination, different from the combination of variable values for other units in the set. A suitable metric to identify outlying multivariate outliers is the Mahalanobis distance. This metric measures the multidimensional distance of each observation from the group centroid. In this paper, we focus on detecting multivariate outliers using Mahalanobis distance according to [21].
Subsequently, we will focus on comparing the predictive ability of models detecting the financial difficulties of companies in Slovakia. We will compare models created using discriminant analysis, logistic regression and the CART binomial tree method, both models that originate from a data file from which multivariate outlying values have been removed, and also from the one in which they were left.

Data used in the study
In this study we use the database of financial indicators of Slovak enterprises from 2016 and 2017. In total, there are 45,458 enterprises in the database. Data are from Amadeus -A database of comparable financial information for public and private companies across Europe. The data contains values of 21 financial ratios of Slovak enterprises from 2016. At the same time, the database contains a variable identifying the financial difficulties of the enterprise in 2017. A more detailed description of the variables as well as the identification of non-prosperous enterprises is given in [22]. The following table lists the numbers of prosperous and non-prosperous enterprises in the database. Using the multivariate outliers identification methodology described in [21], we identified a total of 555 outlying companies in the dataset, see the following table. Of these, 342 companies were prosperous and 213 were non-prosperous. On this sample of Slovak enterprises we created a model of prediction of non-prosperity by three methods: discriminant analysis, logistic regression and CART method of binomial trees. Then we analyze the sensitivity of the models prediction power to the presence of outliers in the data file. The prediction power of models is assessed on the basis of the classification table, mainly on the basis of the percentage of correctly identified non-prosperous enterprises. We also use the AUC value under the ROC curve.

Models created by Discriminant analysis
In the first step, we created a model from a database that does not remove companies that were marked as multivariate outliers. The first model, created by discriminant analysis, achieved a total prediction ability of 70.4%. At the same time, 74.4% of non-prosperous enterprises were correctly classified. AUC of this model is 0.778.
We then removed companies that were identified as multivariate outliers from the database and created the same model using discriminant analysis without them. Table 3 shows a comparison of the classification of the created models. The overall classification capability of the model created improved by 1.2% when multivariate outliers were removed. However, for non-prosperous businesses, the correct classification has improved by up to 12%. The size of the AUC increased to 0.807. a. 71.6% of original grouped cases correctly classified.

Models created by Logistic regression
The first logistic regression model, created from the original data set, achieved a total of 75% of the correct classification, while 83.4% of the non-prosperous enterprises were correctly classified. The AUC of this model was 0.841. The second model, created from the dataset from which we removed multivariate outliers, achieved a better overall classification of 85.8%. The correctly classified nonprosperous enterprises in this case were 84.1%. The AUC of this model is 0.921.
The following table compares the classification of both logistic regression models.

Models created by CART tree
Using the binomial tree method, we also created two models to predict the financial difficulties of a company. The first model, created from the original database, achieved an overall correct classification of 89.1%. However, the correct classification of nonprosperous enterprises is only 88.1%. The AUC of this model is 0.911. The second model, created after the removal of multivariate outliers, achieved an overall correct classification of 88.7%. This is even a little bit worse classification in the test sample than the original data set. In this case, non-prosperous businesses were correctly classified at 87.8%. The AUC of this model is 0.945.
The following table compares the classification capability of both CART models in a test sample that was 20% of the data set.

Discussion
Multivariate outliers have been identified in the enterprise database as those enterprises that have a significantly different combination of financial ratios than other enterprises in the database. From the classification results in predicting financial difficulties, it can be concluded that removing multivariate outliers from the database improves the results achieved.
The discriminant analysis model improved by 1.2% in the overall classification, but up to 12.1% improved in the classification of non-prosperous enterprises. So that, removing outliers in the discriminant analysis significantly improved the percentage of correctly classified non-prosperous companies. The elimination of outliers therefore has a significant impact on the prediction of the company's non-prosperity. The logistic regression model has improved the overall percentage of correct business classification. The model improved by 10.7% in the overall classification and by 0.7% in the classification of non-prosperous enterprises.
On the model created by the binomial tree method CART, the elimination of outliers from the database does not have a significant impact on prediction power of the model. After removing the outliers, the CART model achieved almost the same classification results.
The weakness of this study can be considered that the used financial ratios of the companies have not been analyzed in terms of other assumptions that should be met in the methods. Therefore, we would see the further direction of this study in the analysis of the impact of multicolinearity among variables on the prediction ability of the created models.

Conclusion
In this paper, we focused on the prediction ability of prediction models of non-prosperity of Slovak companies. Models were created using three frequently used methods: discriminant analysis, logistic regression, and CART binomial tree method. We investigated the impact of identifying multivariate outliers in the enterprise database and removing them from the database on the resulting prediction power of the models. We assessed both the percentage of correct classification and the percentage of correctly classified non-prosperous enterprises. In summary, removing outliers from the database improves the classification ability of the generated discriminant model as well as the logistic regression model. Removing outlying companies from the database does not affect the classification of CART model.