Using ARIMA and BP neural network to analyse incidence rate of AIDS in China

. To analyse the characteristics of AIDS transmission from incidence, we used ARIMA and BP neural networks to model the incidence of AIDS and predict them based on modelling. When the sequence is a small sample sequence and instability, the input of the BP neural network can use raw data or stationary sequence in the ARIMA. When using the stationary sequences of incidence as the input of the BP neural network, we can obtain the output corresponding to raw data by matrix operations. Results show that raw data combined with the stationary sequences as the input of the BP neural network can get better modelling results. Moreover, all the predicted values fall within the 95% CI of the ARIMA model. Although there was also a study (reference 14) using BP to predict the incidence of AIDS, it is the original used stationary series as the input of BP in this study.


Introduction
The acquired immune deficiency syndrome (AIDS) is a condition caused by infection with the human immunodeficiency virus (HIV) [1,2]. As time goes on, People with AIDS have a growing risk of developing various viral-induced cancers due to progressive failure of the immune system [3,4]. The first reported case of AIDS was a gay man in 1981 in the United States [5]. The first case of AIDS was found in China in 1985 [6].
According to CONFRONTING INEQUALITIES -Lessons for pandemic responses from 40 years of AIDS issued by UNAIDS in July 2021, the number of people living with HIV went up to about 37.7 million [30.2 million to 45.1 million] in 2020. About 1.5 million [1 million to 2 million] people were newly infected in 2020, and about 680,000 [480,000 to 1 million] died from AIDS-related diseases [7]. In 2021, there were about 61.03 thousand new HIV-infected people and about 20.26 thousand deaths in China, accounting for 89.61 percent of the total deaths from infectious diseases [8].
AIDS is seriously threatening human health, and the Current Highly Active Antiretroviral Therapy (HATTY) method still has limitations. Not only can HATTY not remove the virus in the body, but it also requires lifelong treatment [9]. Taking medicine for a long time may lead to drug resistance. The treatment is so expensive. According to CONFRONTING INEQUALITIES -Lessons for pandemic responses from 40 years of AIDS, by the end of 2020, low-and-middle-income Countries had spent about $21.5 billion on measures to control HIV/AIDS, 61% of which came from domestic funds [7].
Analysis epidemic situation of AIDS is one of the vital branches of the epidemiologic study. Currently, various mathematical models are used to forecast the incidence rate of infectious diseases. However, due to the complexity and variability of practical matters, it becomes more difficult to find an appropriate prediction model [10,11]. The objective of the present study is to develop an Autoregressive Integrated Moving Average (ARIMA) model and Back Propagation (BP) neural network model for the analysis of the AIDS epidemic from aspects of Incidence by analyzing actual data and combining with the characteristics of AIDS transmission.
Applications of artificial neural networks include waveform analysis of biomedical signals, medical image analyses, and outcome prediction, biochemical data and heart sound for valve diagnostics, eye tracking, diagnosis of myocardial infarction, automatic detection of diabetic retinopathy, nephritis, and heart disease [12], and so on. In most studies, the researchers use the raw data as the input in the neural network. In this study, we use the stationary sequences in the ARIMA model as the input of the BP neural network. The results indicate that raw data combined with the stationary sequences of raw data input can get better modeling results.

ARIMA model [13,14]
ARIMA model, that is, the Box-Jenkins model. According to whether the data show stationary in the different parts of regression, the ARIMA model has three basic types: Moving Average (MA), Auto-Regressive (AR), and ARIMA.
A non-seasonal ARIMA model is generally denoted Arima (p,d,q). If the series is the stationary series, the ARIMA model can expressed as: (1) where p is the number of autoregressive terms, q is the number of lagged forecast errors in the prediction equation, and y is the estimated parameter. If the series is non-stationary, d is the number of non-seasonal differences needed for stationary. Construction of the ARIMA model includes four steps: data stabilization, model identification, parameter estimation, diagnostic test, and model prediction results analysis and evaluation.
Firstly, data stabilization, if necessary, can be obtained by difference. Then, we calculate the autocorrelations function (ACF) and partial autocorrelations function (PACF) of the d-order difference sequence. PACF and ACF plots determine the p and q of the ARIMA model.
Thirdly, check the residual diagnostics of the model, particularly the residual ACF and PACF plots.
Finally, patterns that remain in the ACF and PACF may suggest the need for additional AR or MA terms.
ARIMA model construction is performed using the SPSS for Windows software package (ver.24.0, IBM).

BP neural network [14]
BP neural network is known as widely applied neural network models. The model is a feedforward neural network. The main characteristic of the model is that the signals are transmitted forward, and the errors are transmitted in reverse. In the learning process, backpropagation is used to update the weights and thresholds of the network to achieve the minimum error sum of square. Hecht-Nielsen proved that three layers of a feed-forward network with one hidden layer can be used to learn and store the relationship between the input and output.
{X1, X 2 , … , Xn} are the input values of the BP neural network. Xi preserves the coordinates of feature points on the preoperative model.

Construction of BP neural network
BP neural network model construction was performed using the tool of Neural Networks Toolbox for MATLAB from Math Works, Inc. [15].
For the present, the architectures of many BP neural networks are shown as Figure 2(a). The inputs of BP neural network are raw data (R 1 , R 2 , R 3 ,…, Rn). The outputs are RY i (i=1,2,3,…,n). In this study, the architectures of BP neural network are shown as in Figure 2(a) and 2(b) simultaneously. If raw data are non-stationary series, FD i (i=2,3,…,n) stands for the first difference of raw data. If the first difference is non-stationary series, second difference is required, and so on, until the S-Order difference (SOD i , i=S+1, S+2,…, n) is stationary. Then S-Order difference is used as the input of BP neural network. The output is SY' i (i=S+1,S+2,…,n).
In this study, the maximum order of difference is 2. In Figure 2(b), FD i (i=2,3,…,n) is the first difference of the raw data. When S=2, SOD i ( i=2,3,…,n) is the second difference of the raw data. The corresponding relationships of FD i (i=2,3,…,n), SOD i (i=3,4,…,n) and R i (i=1,2,…,n) are shown as follows.
FD i (i=2,3,…,n) is the first difference of the raw data: (2) Using FD i (i=2,3,…,n), R i (i=2,3,…,n) can be represented as: (3) Using a matrix to describe formula 3 as : SOD i (i=3,4,…,n) is the second difference of the raw data. Using FD i (i=2,3,…,n), SOD i (i=3,4,…,n) can be represented as: Using R 1 , FD i (i=2,3,…,n) and SOD i (i=3,4,…,n), R i (i=3,4,…,n)can be represented as: (6) Using a matrix to describe formula 6 as : If raw data is non-stationary series and FD i (i=2,3,…,n) is stationary, FD i (i=2,3,…,n) is input to BP neural network. SY' i (i=2,3,…,n) is output corresponding to FD i (i=2,3,…,n). The output corresponding to raw data was obtained by addition and subtraction using SY' i (i=2,3,…,n) in formula (3) instead of FD i (i=2,3,…,n). If FD i (i=2,3,…,n) is still nonstationary series and SOD i (i=3,4,…,n) is stationary, SOD i (i=3,4,…,n) is input to BP neural network. SY' i (i=3,4,…,n) is output corresponding to SOD i (i=3,4,…,n). The output corresponding to raw data was obtained by addition, subtraction and simple multiplication using SY' i (i=3,4,…,n) in formula (6) instead of SOD i (i=3,4,…,n). In brief, if the first difference is stationary, using output of BP neural network, the first data of raw data and the first difference, the output corresponding to raw data was obtained by addition, subtraction(using the matrix of formula (4)). If the first difference is non-stationary, the second difference is stationary, using output of BP neural network, the first data of raw data, the first data of the first difference and the second difference, the output corresponding to raw data was obtained by addition, subtraction and simple multiplication(using the matrix of formula (7)). Although in China, people found the first case of AIDS was in 1985, the official AIDS records of incidence rate were found until 1992. Five years from 1992 to 1996, the incidence of AIDS were 0.

Data description
In China, from 2001 to 2020, the incidence rate (1/100000) of AIDS is shown in Figure 3, which appears the upward trend and data sequence instability.
Apply the first 70% of the data sequence to modeling, use the built model to forecast the remaining 30% of data, and use the parameter Mean Absolute Error (MAE) as the evaluation criterion. MAE is expressed as formula 8: (8) where is the actual value at some time point, is the forecasted value at the same time point, and n is the number of forecast data.

ARIMA model
ARIMA and BP neural network model of incidence rate in China are shown in Figure 4. The 1st difference and 2nd difference of the first 70% incidence rate data are shown in Figure 4(a). The 2nd difference tends to be stationary. However, in the 1st difference, data from 2011 to 2012 are special [16]. The ACF and PACF of the 2nd difference are shown in Figure 4(b). The incidence rate is modelled with ARIMA (0,2,0). Residual plots of ACF and PACF are shown in Figure 4(c). In figure 4(d), the forecast remaining 30% data corresponding to the incidence rate forecasted by the model built based on the first 70% data. The MAE is 0.278. LCL (Lower Confidence Limits) indicates that the ARIMA model forecasts the lower 95% confidence interval (CI). UCL (Upper Confidence Limits) represents that the ARIMA model forecasts the upper 95% CI. The incidence rate in 2021 based on the established ARIMA(0,2,0) is 5.15.

BP neural network
We used the raw data and its second difference as inputs to the BP neural network separately. The raw data, modeling, and predicting results of the ARIMA model and BP neural network are shown in Figure 4(d). When the input is raw data and the second difference, the predicted value of the BP neural network all fall within the 95% CI of the ARIMA (0,2,0) model. The incidence rate in 2021 based on the BP neural network using raw data is, and its second difference is 5.53 and 5.24. The MAE based on the BP neural network using raw data is, and its second difference is 0.673 and 0.226.

Discussion and conclusions
AIDS has become a chronic rather than an acutely fatal disease. AIDS threatens human health and life and brings heaven economic burden to the country and the individual. This study confirmed that the artificial neural network combined ARIMA model can be used to analyze the AIDS epidemic from incidence in China. We used the raw data and the stationary sequence of AIDS incidence as inputs of the BP neural network separately. The results show that when the sequence is a small sample sequence and sequence instability, raw data, we can use the stationary sequence as the input of the BP neural network. If the input of the BP neural network is raw data and the result is unreasonable, we can use the stationary sequence as input to the BP neural network to see whether the output meets the requirements. For the present, the input of the BP neural network is usually raw data. In this study, the study participants used raw data and stationary sequence as the input of the BP neural network. Through the difference disposal, we can obtain the data stabilization sequence. The maximum order of difference is 2. If the first difference is stationary, using the output of the BP neural network, the first data of raw data and the first difference, the output corresponding to raw data was obtained by addition, subtraction(using the matrix of formula(4)). If the first difference is non-stationary, the second difference is stationary, using the output of the BP neural network, the first data of raw data, the first data of the first difference, and the second difference, we can obtain the output corresponding to raw data by addition, subtraction, and simple multiplication(using the matrix of formula (7)).
Using raw data and the stationary sequences of the raw data input BP neural network separately, enable all the predicted values to fall within the 95% CI of the ARIMA model. This work is supported by Beijing Natural Science Foundation (7202016).