Predicting Global Ranking of Universities Across the World Using Machine Learning Regression Technique

Digital transformation in the field of education plays a significant role especially when used for analysis of various teaching and learning parameters to predict global ranking index of the universities across the world. Machine learning is a subset of computer science facilitates machine to learn the data using various algorithms and predict the results. This research explores the Quacquarelli Symonds approach for evaluating global university rankings and develop machine learning models for predicting global rankings. The research uses exploratory data analysis for analysing the dataset and then evaluate machine learning algorithms using regression techniques for predicting the global rankings. The research also addresses the future scope towards evaluating machine learning algorithms for predicting outcomes using classification and clustering techniques.


Introduction
Digital transformation facilitates organisations to understand their data and information, analyse these data for improve operational performances and competitive advantages using digital technologies (Peter et al., 2021).Data science as a part of digital transformation enables the organisations to generate new business models, develop strategy, create roadmap and build competitive advantage based on the understanding of information and patterns contained within the data.In order to establish new benchmark in the higher education domain, it is also important that the higher educational institutes needs to understand the best practices from the global context, indicators used for measuring performances and evaluation criteria (Vitenko et al., 2021).
Machine learning is a subset of artificial intelligence, which facilitates machine to learn from data using algorithms and predict the results without explicitly programming (Awad & Khanna, 2015).Machine learning helps the organisation to study the business operations, customer behaviours, analyse the patterns derived from the data, develop predictions and prepare relevant strategies.Machine learning problems are broadly classified as classification, clustering and regression problems (Sarkar,2021).Classification and regression are categorised under supervised learning and clustering is categorised under unsupervised learning (Alloghani et al., 2020).If the output variable is discrete, then classification or clustering can be applied, whereas if the output variable is continuous, then regression technique can be applied.Hence, machine learning technique can be used to study and learn historic global university ranking data, identify the patterns and predict the university rankings (Estrada & Cantu, 2022).Further, global ranking of top universities are continuous and regression technique is most suitable.

Study of machine learning regression framework
Machine learning framework begins with data gathering, and data pre-processing as shown in the below figure 1.

Figure 1. Machine learning regression technique framework
After the pre-processing, feature selection will be carried out.Exploratory data analysis will be performed after feature selection to find out various statistical parameters pertain to the selected dataset.Exploratory data analysis helps to understand behaviour of dataset, behaviour of variables, understanding of data attributes and characteristics of dataset.Once the exploratory analysis completed, Machine learning algorithms are applied to predict the outcomes.The dataset will be split as training and testing dataset before input to the machine.Machine learns the dataset using various algorithms and predicts the outcomes.

Study of existing world university ranking framework
World university ranking is an annual declaration of top university ranking popularly conducted and published by Quacquarelli Symonds organisation.Relevant indicators and corresponding weightings are showcased in table 1.
Table 1.Indicators and weightages QS system rankings are comprising of three parts named as global overall ranking, subject wise global ranking and region wise rankings.QS world rankings are based on performance evaluation of key aspects such as teaching, research, employability, university mission and internationalisation.

Data collection, pre-processing and exploratory analysis
Top universities world ranking dataset for the year 2022 are used as a basis for predicting the outcomes using machine learning techniques.21 column variables in the form of indicators are used as features and one column variable named global rankings is used as target variable.Total 1300 rows of data were used for pre-processing.After the removal of missing values, total 1225 rows of data were used for analysis.In order to understand characteristics of various data and relationship between variables, correlation test is applied into the dataset (Kumar & Chong, 2018).

Result analysis
Application of regression algorithm into global university ranking dataset derives 66 trees and to achieve better accuracy, boosting regression is applied (Velthoen et al., 2021).To get increased predictive outcomes, Gaussian loss function is used within the boosting regression (Sigrist,2021) Relative influence helps to communicate the importance and percentage contributions of each variable for the results.Table 4 demonstrates the variable details and relevant percentage contributions for deriving the outputs.From the evaluation matrices, we observe that the R 2 value is.0841.Hence goodness of fit is 84.1%, which showcase that the data can be fitted well within the regression model and good prediction of global ranking possible using the selected variables in terms of information.
Relative Relative influence plot provides visual illustrations of variable contributions for the overall results in terms of a plot.Figure 4 illustrates the relative influence plot.From the plot, we observe that scaled score contributes the most for predicting the ranks.We observe that as the tree approaches to 66, accuracies established and deviances reduced.

Figure 4 .
Figure 4. Relative influence plot Out of bag improvement plot provides distribution of training data with reference to out of bag changes in Gaussian deviance versus number of trees.The out of bags improvement plot helps to estimate the prediction accuracies of boosting regression.Figure 5 illustrates OOB changes versus number of trees for the training dataset.We observe that as the tree approaches to 66, accuracies established and deviances reduced.

Figure 5 .
Figure 5. OOB improvement plot Predictive performance plot provides the visual representation of predicted value versus observed value.Figure 6 demonstrates there is a linear relationship exists between of observed test values versus predicted test values.

Figure 6 .
Figure 6.Predictive performance plotDeviance plot provides the graphical representation of Gaussian deviance with reference number of tree formation.Figure7illustrates as the number of tree increases; deviance reduces and more prediction accuracy can be achieved.

Figure
Figure .Predictive performance plot

Table 2 .
. The resultant outcome of boosting regression is demonstrated in Table 2. Test results are evaluated using mean square error(MSE) values.The lower mean square error value depicts better test result.Boosting regression resultIn regression problems, to check the performance of algorithms or models, evaluation metrics are used(Alexei,2018).In evaluation metrics, test results are compared using various values such as root mean square error (RMSE) value, mean absolute deviation (MAD), mean absolute error (MAE), mean absolute percentage error (MAPE) and R 2 (R-Squared or coefficient of determination).Table2demonstrates evaluation metrics.

Table 4 .
Relative influence of variables