Using Random Forest for Future Sea Level Prediction

. This research paper presents an investigation into using the random forest algorithm for predicting future sea level. Sea level is a critical indicator of the health of our oceans and coastal areas and is measured in total weight observations. The study employs the random forest algorithm, a powerful machine learning technique, to analyze a dataset of sea level observations. The results of the analysis demonstrate the effectiveness of the random forest algorithm in accurately predicting future sea level changes. The findings of this research have important implications for coastal management and adaptation strategies. This research provides a valuable tool for decision-makers and coastal managers, allowing for more informed and proactive planning for sea level rise. Overall, the paper shows that the random forest algorithm is a promising method for sea level prediction and highlights the importance of continued research in this area.


Introduction
Chen, Xiaoyu, et al. [1] released a research in 2017 about the rate of increase in global mean sea level (GMSL) between 1993 and 2014, indicating that the risk of flooding and erosion to coastal areas is increasing. Meier, Mark F., et al. [2] discovered that the primary contributors to the world average sea level rise are the melting of ice in the polar regions and glaciers. In 1987, Wigley, Tom ML., et al. [3] observed that the thermal expansion of the ocean's water mass generates tons of redundant water from the ocean. Warmer water temperatures will also cause the expansion of the ocean. The influence of the coupled oceanatmosphere system can result in periods in which the ocean warms more slowly or more quickly. According to studies by Cutler, K. B., et al. [4], temperature changes are a dynamical system that can vary in unforeseen ways. From this perspective, applications of modern statistical analysis, such as machine learning (ML) techniques, can give a more efficient method for estimating the future sea level trend. Several Machine Learning techniques have been utilized to discover patterns in data that can be used to forecast certain phenomena. Recent years have seen the application of machine learning to classify samples and predict diverse outcomes. There is room for growth in the application of machine learning to marine studies. These strategies can aid in identifying complex interactions between variables and determining their relative significance. Nieves, Veronica, et al. [5] employed nonlinear gaussian processes and recurrent neural network approaches to measure coastal sea level fluctuation across a range of timescales in distinct locales. Nicolas Guillou et al. [6] evaluated the performance of two types of machine learning algorithms: multiple regression methods based on linear and polynomial regression functions, and an artificial neural network called the multilayer perceptron. This work will use the Random Forest (RF) algorithm to determine the future sea level. These forecasts were evaluated in light of satellite-observed trends in sea level since 1993.
This report's remaining portions are separated into four subsections. The first section offers information about the research methodologies utilized in this project, including data preprocessing and dataset descriptions. The evaluation of the project with numbers and test results comprises the second section. The end of the test, including any challenges or problems that arose, is covered in the third section. This section concludes with a discussion of the files related to this project.

Research Method
Random forest is an ensemble learning method for classification, regression and other problems that constructs numerous Decision Trees during training [7]. Decision Trees are one of the most popular and effective methods for supervised learning due to their clarity and practicality. It consists of leaves and nodes, which are connected by edges. In which data is continually classified by posing questions at each node. The leaves symbolize decisions or the final results. There are two sorts of trees: classification and regression. If the conclusion was a binary variable, such as "appropriate" or "unsuitable," the structure is a classification tree. It is also a regression tree if the goal variable of the decision tree may take on continuous values, such as real numbers.
Ensemble learning use many learning algorithms to achieve greater predictive performance than each of the constituent learning algorithms alone; it is a technique that integrates multiple classifiers to solve complicated problems [8].
Compared to Decision Tree, Random Forest has various advantages. First, Random forests consist of many single trees, each of which is based on a random sample of the training data, and they are often more accurate than individual decision trees. Second, whereas a single decision tree requires pruning to prevent overfitting, because the testing performance of Random Forests does not decline as the number of trees increases, the performance tends to stabilize after a given number of trees, lowering the risk of overfitting [9].
Extra Trees and Random Forest are both ensemble methods that create multiple decision trees and combine their predictions to make a final prediction. The main difference between the two methods is how the decision trees are constructed.
In Random Forest, the decision trees are constructed using the random subspace method and bagging. The random subspace method involves selecting a random subset of features to consider at each split in the tree, while bagging involves training each decision tree on a different random subset of the data. Additionally, feature importance is taken into account during tree construction in Random Forest.
On the other hand, in Extra Trees, the decision trees are constructed using a top-down, greedy splitting approach. However, the top-down splitting process is randomized, meaning that instead of calculating the optimal split point for each feature, a random split point is chosen. This ran After picking the model, cross-validation can be performed to evaluate the model's performance. Cross-validation is a statistical method used in machine learning to evaluate the performance of a model by training it on a portion of the data and testing it on the remaining portion. It is often used to compare and select the best model for a predictive modeling problem because it is easy to understand and implement, and it typically provides less biased estimates of model skill than other methods. In the context of a predictive modeling problem, a model is typically trained on a dataset of known data and evaluated on a dataset of unknown data. Cross-validation helps to assess the model's ability to make predictions on unseen data and avoid issues such as overfitting, by training the model on a portion of the data and testing it on the remaining portion.
There are several types of cross-validation, in this research, k-fold cross-validation is used. In k-fold cross-validation, the original sample is randomly divided into k equal-sized samples. One of the k samples is used as the validation set, while the remaining k-1 samples are used as the training set. This process is repeated k times, with each sample serving as the validation set once.
There are several advantages to using k-fold cross-validation. First, it can reduce bias. Second, all data points will be used both as training and validation data at some point, which helps to prevent the model from overfitting to the training data. Third, as k increases, the variance of the resulting model skill estimate decreases. However, there are also some disadvantages to k-fold cross-validation. One disadvantage is that the model must be trained and evaluated k times, which can increase the computational cost. Additionally, the results of k-fold cross-validation may depend on the specific data partitioning, which can introduce some variability in the results.
The model must be tuned for a number of hyperparameters. A parameter is a value that can vary independently in the learning algorithm, and a hyperparameter is a parameter whose value controls the learning process [10]. Parameters are estimated automatically based on data, whereas hyperparameters are defined explicitly and utilized to estimate model parameters. Important because they govern the entire behavior of a machine learning model, hyperparameters are crucial. They can have a significant effect on model training in terms of training time, required computational resources, and model accuracy. After optimizing hyperparameters, a predetermined loss function must be minimized for improved performance. Setting the optimal hyperparameters is essential for maximizing model performance.
Tuning ideal hyperparameters can be accomplished in a variety of ways, including random search, grid search, Bayesian optimization, Tree-structured Parzen estimators (TPE), etc. The grid search method has been selected for this project. There is no way to determine the appropriate hyperparameter values prior to tuning; all possible values must be attempted to determine the optimal values. Grid search enables you to select the optimal parameters for the optimization issue from a provided list of parameter alternatives, which would otherwise require a significant amount of time and resources to accomplish manually. It brute-forces all possible combinations before selecting the optimal one. It is well recognized for its application in determining the hyperparameters at which the model provides the highest accuracy.
Dataset is a collection of data bits that must be used in the machine learning process; dataset can store information for the computer as a unit for the purposes of analysis and prediction. A excellent dataset comprises of all potentially relevant information that is uniformly formatted and normalized. The dataset used in this project was obtained from Kaggle and consists of global sea level statistics from 1993 to 2021. It includes seven features: year, total ocean area in square kilometers, standard deviation of GMSL without Glacial Isostatic Adjustment (GIA) in millimeters, smoothed GMSL without GIA in millimeters, variance of GMSL with GIA with respect to the 20-year TOPEX/Jason collinear mean in millimeters, standard deviation of GMSL with GIA in millimeters, and smoothed GMSL with GIA in millimeters. The label for the project is GMSL without GIA variance with respect to the 20-year TOPEX/Jason collinear mean in millimeters. GMSL refers to the average height of the entire ocean surface, and GIA refers to the adjustments made for the effects of melting ice on sea level. TOPEX/Jason are satellites owned by NASA that are used to measure global sea level variations with millimeter accuracy over time. As shown in Fig.1-2, between 1993 and 2021, the sea has been affected in every measurable way, and the sea level continues to increase. We must perform data preprocessing on the dataset to make it appropriate for the machine learning model. Data preparation is essential in machine learning to improve data quality and facilitate the extraction of valuable insights from the data. Data preparation increases the model's precision and effectiveness.
Importing predefined Python libraries is required for data preprocessing in Python. Since predefined Python libraries may perform certain tasks, such as data preprocessing, three specific libraries can be used for this purpose: NumPy, Matplotlib, and Pandas. In addition to importing the libraries, the datasets had to be imported in order to extract dependent and independent variables. Using the sklearn library, the next step is to handle missing numeric data and encode categorical data so that the dataset can be divided into training and testing sets for the model. The next step is feature scaling, a method for normalizing the range of independent variables or data features. Upon completion of all phases, the dataset can fully match the model. Fig.3 depicts a heatmap indicating the association between various features and label.

Evaluation
Before discussing hyperparameter tuning, a quick review: we are working with a random forest regression problem. We are attempting to predict the future sea level using historical data from 1993 to 2021. The ratio of training data, testing data, and validation data is 75:10:15, and eight features are used to predict the results.
Because the data has already been cleaned, there is no longer a need for data pretreatment; hyperparameter tuning can begin immediately. Examine accessible hyperparameters by inspecting their default settings. In this project, the hyperparameter n estimators will be primarily tuned.
The n estimators parameter indicates the number of trees within the model's forest. The default value for this argument is 10, therefore the random forest will generate 10 distinct decision trees. Using k-fold cross-validation, this method will involve multiple iterations. However, it is a lengthy procedure because numerous training loops must be established. Using the RandomizedSearchCV method from sklearn, however, we can build a grid of hyperparameter ranges and randomly choose from the grid while running k-fold cross-validation with each value combination.
In order to use RandomizedSearchCV, a parameter grid will be generated. Each iteration of the algorithm will select a different combination of features. However, we do not need to test every possible combination using random search; instead, we can sample values at random. The optimal parameters can be determined by fitting the random search.
We were able to limit the range for each hyperparameter by random search. Then we may utilize GridSearchCV to define every possible parameter combination in an effort to identify the optimal values. Depicted in Fig.4 and 5 are the results of hyperparameter tuning.

Conclusion
Using machine learning with the random forest model to forecast sea level increases is a novel approach for marine studies. However, this project still has room for improvement in a few areas. First, instead of projecting global changes in sea level, different oceans such as the Pacific, Indian, and Atlantic should be anticipated separately. Second, for greater precision, should include additional data, such as other aspects connected to sea level rise, rather than relying solely on sea level data.