Possibilities of application of logistic regression in hydrological forecasts (on the example of the mountain river Samur)

. The possibilities of constructing regression equations for predicting the runoff of mountain and semi-mountain rivers are considered. A predictive equation for the Samur river catchment has been obtained, which connects water discharge and predictors by casual relationships: water levels, air temperature, air humidity, atmospheric pressure, precipitation, dew point temperature, wind direction, and cloudiness. Logistic regressions are obtained, which allow using categorical variables as independent variables. The result of a logistic regression forecast is the probability of the occurrence or non-occurrence of the event of the predicted value. The positive and negative aspects of this approach for mountain rivers are revealed, which consist of the interpretation of the predicted probability of the event. Actions are proposed that allow for obtaining more reliable forecasts.


Introduction
A feature of logistic regression is the possibility of using it to study the logical relationships between categorical variables. This type of regression is not common in the practice of hydrological forecasts, since hydrological characteristics are mostly quantitative. But when assessing the relationship between hydrological and meteorological quantities, which can be categorical, the establishment of a logistic regression acquires practical relevance.
The effective use of regression dependencies for forecasting the hydrological characteristics of mountain rivers is noted by some researchers. So, for example, for some mountain rivers of Uzbekistan, multifactorial dependencies between river flow during the growing season and precipitation are calculated [1]. To predict the water level of the Mzymta River (Krasnodar Territory), methods based on regression analysis and the use of neural network technologies are proposed that give approximately equal results [2] and based on the theory of Markov processes with discrete time [3]. The use of multiple regression with two predictorswater discharge for the previous period and precipitation, led to an improvement in forecasts for the Narym River (a river in the East Kazakhstan region of Kazakhstan, the right tributary of the Irtysh) [4], and for the Amyl River (a mountain river in the Krasnoyarsk Territory) a comparative Analysis of methods for forecasting maximum water levels showed that one-factor dependencies have higher determination coefficients than the multiple regression model [5].
The mountain rivers of the Caucasus have significant rainfall throughout the warm hydrological year, and solid precipitation exceeds 40-50% of the total. Using the Terek River as an example, the work [6] presents a dependence for a short-term forecast of water discharges, and for the Gusurchay, Velvelichai, and Zagemchay rivers, [7] obtained multiple linear regression equations for predicting runoff based on precipitation.
The purpose of the study was to test the method of logistic regression to the watershed of the river Samur and evaluate the possibilities and effectiveness of this method for predicting water discharges.

Initial data
Identification of the possibilities of using logistic regression for hydrological forecasts was carried out in the catchment area of the Samur River ( fig. 1). Discharges and water levels at station c. Usukhchay for 2013, 2014, and 2015 [8]. Meteorological data for Usukhchay station included quantitative variablesair temperature, air humidity, atmospheric pressure, precipitation, and dew point temperature; categorical variables are wind direction, and cloudiness [9]. On fig. 2 shows examples of chronological graphs of changes in meteorological quantities.
The Samur river is located in southern Dagestan, along the part of the channel the border of the Russian Federation with Azerbaijan passes. The length of the river is 213 kilometers, the total fall is 2910 meters, the average slope is 17.7 ‰, the catchment area is 4990 km 2 , and the average height is 1970 meters. The main Caucasian ridge is the boundary of the river basin. Samur in the southwest, in the northeast of the border -the northern spurs of the Side Range. In the lower reaches of the river, the boundaries of the basin are not expressed. In the annual series of water discharges, the periods of spring flood and rain flood were singled out as the most demanded when issuing hydrological forecasts. Since river Samur is a mountain river with a peculiar formation of river flow [10], it was difficult to identify the indicated periods in the available initial data: in 2013 and 2015, the spring flood smoothly turns into a rain flood, in 2014 there is a low water period the period between spring flood and rain flood (see Fig. 3).

Methods
Regression models are a way to study causal relationships, i.e. assessment of the interaction of variables. When constructing a logistic regression, dependent variables (predictants) must be categorical, while independent variables can be both categorical and quantitative [11]. At the same time, the dichotomized dependent variable evaluates not the probability of occurrence, but the logarithm of the ratio of the probability of occurrence and non-occurrence:  The probability of the event P is encoded by the value 1 (100 % probability of occurrence), respectively, the probability of non-occurrence is 1-P, P/(1-P) is the chance of occurrence or non-occurrence.
Logit is calculated by the expression: where ) , 1 ( k i x i  independent variables; B icoefficients of multiple linear regression (show how much the logit will change on average when the independent variable changes).
The probability of an event occurrence is calculated using the following expressions: ( Qualitative assessmentthe resulting dichotomized dependent variable can characterize possible events: P>0.5occurrence; P<0.5non-occurrence.

Results and Discussion
In the first stage, the relationships between all hydrometeorological variables were evaluated using the calculated correlation matrix. The greatest relationship was naturally found between discharges and water levels, as well as between water discharges and air temperatures and dew points.
For categorical variables, the Spearman rank correlation coefficient was calculated, which belongs to the method of correlation analysis and reflects the ratios of variables sorted by increasing values [12]. The value of the Spearman correlation coefficient lies in the range of +1 and -1, characterizing the direction of the relationship between the features measured in the rank scale.
Spearman correlation coefficients were calculated for the number of clouds, wind direction, and water flow. The correlation between water flow and cloud amount reaches 0.77, and between water flow and wind direction 0.56.
When constructing regression equations, the lead time of the forecast was taken into account, which was taken equal to 1 day; with a longer lead time, the connection between the predictor and predictors is lost.
On fig. Figure 4 shows an example of hydrographs built based on actual and calculated data using the regression equation, with and without water levels (it is not advisable to exclude the water level from consideration).
As an example, Figure 5 shows the results of forecasting for the spring flood of 2015. The following results were obtained:  regression equations were built for 2013 and 2014, and the forecast was given for 2015, and for spring flood the technique is effective since the forms of flood hydrographs are similar for all three years;  hydrographs of rain floods are not similar, and the technique showed unsatisfactory results.
Expanding the initial database for constructing regression equations to include various forms of hydrographs, or vice versa, using only an analog year with the corresponding hydrograph form, will increase the efficiency of the approach under consideration.
The use of logistic regression made it possible to estimate the probability of occurrence of each predicted water discharge. But according to the calculated expression (3), a regularity is obviousfor large expenses, a high probability of occurrence is inherent, which can be seen in Fig. 5b.

Conclusion
The use of regression models for prognostic purposes is the simplest and most physically reasonable approach to predicting the characteristics of natural processes. Regression equations allow you to study the causal relationships that characterize the interaction of variables, in which some are causes and others are consequences. In practice, models that use several independent variables are most often used, and multiple regressions are built.
During the approbation of the method of logistic regression to the watershed of the river Samur and assessing the capabilities and effectiveness of this method for predicting water flow, the following was revealed:  the method of logistic regression for hydrological forecasts is effective if there is an analog year for the forecast year in the initial data;  the approach to assessing the probability of nonoccurrence is not reliable in forecasting extreme water flows;  the rank correlation method has shown its effectiveness for categorical hydrometeorological variables.
It is planned to develop an approach that would automatically select years for compiling a regression model [13] and it is possible to use artificial neural networks [14].