Business process model with “RandomForest” algorithm

. Currently, the areas of application of machine learning are multifaceted: artificial intelligence, financial applications, bioinformatics, intellectual games, speech and text recognition, computer language processing, medical diagnostics, technical diagnostics, text search and rubrication. Machine learning is an area of scientific knowledge associated with learning-capable algorithms. The use of machine learning methods is explained by the fact that for most intelligent complex tasks (for example, speech recognition, etc.) it is almost impossible to develop an obvious algorithm for their solution. However, you can teach a computer to learn how to solve such problems. In our article, we propose a model based on the machine learning algorithm "RandomForest", which allows one to recognize bots by HTTP sessions. The chosen algorithm provides many advantages: non-iterative learning; high quality of the resulting models (comparable to neural networks and ensembles of neural networks); a small number of adjustable parameters. It works well with missing data (retains good accuracy); internal assessment of the generalizability of the model; able to work with raw data without preprocessing. The algorithm was trained on a dataset of more than 5000 sessions. The prospects of this direction are obvious, since robotic traffic accounts for more than 40% of the total Internet traffic.


Introduction
The new digital world assumes a gradual transition of business to the online space. The 2020 quarantine measures accelerated this process. The concept of "transferring activity to the online" appeared, thus the business underwent a digital transformation [1][2][3]. It helped companies that had succeeded in the quick restructuring to stay afloat, to take the audience of less prepared competitors. But has the activity of people only increased in the online space? After reviewing data from traffic data, Imperva'sThreatResearchLab [1] produced a report for 2020 that shows an increase in network activity of automated means of interacting with content on websites. In this case, the share of human activity stands out -62.8%; good bots -13.1% and bad bots -24.1% [4][5][6]. It should be noted that bad bot traffic has grown significantly and, at the moment, it is almost a quarter of all Internet traffic [7][8][9]. The damage is done to both businesses and individuals. In this regard, it is imperative to develop processes for filtering unwanted activity on websites [10,11].
A bot is software designed to simulate the behavior of a real user on online social networks [2]. Many methods have been proposed for detecting and removing them. Research is carried out on the characteristics, detection and assessment of the impact of bots.
There are various techniques for detecting web bots in network traffic: limiting the frequency of requests to a host; checking blacklists of IP addresses; parsing the value of the User-Agent HTTP header; taking prints of the device; CAPTCHA implementation; behavioral analysis of network activity using machine learning [3].
Bot traffic analysis tools have become a must for any website.

Materials and Methods
The process of detecting bots by HTTP session. Session is a sequence of requests from one node (unique value of the IP address and the User-Agent field in the HTTP request) in a fixed time interval [3,12]. The session will be evaluated offline using the data uploaded to the file, which will be divided into testset and trainingset, see Figure 1. This model uses key concepts: -A random sampling of samples from a dataset when constructing trees.
-When dividing nodes, random sets of parameters are selected.
"RandomForest" optimization can be done through random search using the Randomized Search CV in Scikit-Learn. Optimization involves finding the best hyper parameters for the model on the current dataset ( Figure 2) [14].  There is a significant difference from other models ; in this case, feature weighting is not used ; therefore any monotonous transformations will not affect the result (up to a numerical error) [4].
The algorithm for constructing a RandomForest consisting of N trees is as follows: For each n = 1, ..., N: Generate Xn fetch with Bootstrap.
Build the decision tree bn from the sample Xn: According to a given criterion, we choose the best feature, make a partition in the tree according to it, and so on until the sample is exhausted. The tree is built until each leaf contains no more than nmin objects or until we reach a certain tree height. At each partition, firstly, m random features are selected from the initial ones and the optimal division of the sample is sought only among them.

Results and Discussion
The final classifier is a (x) = 1N∑i = 1Nbi (x), in simple words -for the classification problem we choose the solution by voting by the majority, and in the regression problem -by the average [9].
It is recommended to take m = n in classification problems, and m = n3 in regression problems, where n is the number of features. It is also recommended in classification problems to build each tree until each sheet contains one object, and in regression problems -until each sheet contains five objects.
Thus, a RandomForest is Boots trapaggregating over decision trees, during training of which, for each partition, features are selected from some random subset of features [5].
Let's create a simulation model for calculation with the main metrics of the unit-economy (Figure 3) [13][14][15]. We use session characteristics as parameters (Table 1). We will take 0 BOT and 1 PERSON as labels; that is, we will build a clear binary classification model: network activity is created by a web bot (bot label); network activity is created by a human (human label) [10].
In this case, we use quality metrics: accuracy, completeness and F-measure (Table 2, Figure 4).
To test the effectiveness of the model, we consider the results.. On lazy sampling, the values of the quality metrics are almost identical to the values of these metrics when the model requirements are met. It follows from this that the model based on these data can generalize the knowledge gained during training [7,15].
Let's consider errors of the 1st kind. If these data are marked out expertly, then the error matrix will change significantly. This means that some errors were made when marking up the data for the model, but the model was still able to recognize such sessions correctly (Table 3, Figure 5) [15].

Conclusions
As a result, an approach to the detection of web bots using a machine learning algorithm and the use of statistical features is described. The "RandomForest" machine learning algorithm is selected as the most appropriate. Implementation of the similarity of the model for recognizing bots by session in Python using the scikit-learn machine learning library [6]. A classical measure is used to assess the accuracy. 11 features were used to train the algorithm. The trained algorithm was tested on a random data set and the F-metric value was 0.98 [8].