Machine Learning as a Method for Bot-Traffic Identification Among Web-Application’s Requests
by Yakovlev, E. A.
Yakovlev EA (2017). Machine Learning as a Method for Bot-Traffic Identification Among Web-Application’s Requests. In Young Scientist USA, Vol. 10 (p. 88). Auburn, WA: Lulu Press.
Abstract. Author compares several machine learning methods (Logical regression; SVM; Neural networks; «random forest» Gradient boosting above solving trees) in aim to find the most effective one for bot-traffic detection.
Keywords: bot-traffic, information security, machine learning, web-application.
According to development of modern web-technologies more and more the methods of illegitimate use of web-resources and content, which is provided for users, appear. For these purposes in most cases intruders use bots – programs masked as ordinary users of web-resource and doing particular actions. The aim of bot using can be:
- parsing of web-pages with the aim of content copy;
- DDoS-attacking for service degradation
- vulnerability scanning for hacker attacking and so on.
There are "useful" bots, for example, bots of researching machines or services of monitoring of service accessibility, but owners of these bots don't mask them and give information about bot in the field User-agent of inquiry heading to web-resource.
In these conditions the problem of bots' detection among all users is actual. This work is concentrated on research of possibility of application of machine learning and intellectual analysis' methods for bots' detection in common schedule of web-resources.
Detection of bot-traffic
The ways to detection of bots' requests in common traffic of app can be divided into groups according to following criteria:
- according to used classification methods (statistical, behavioral, hybrid);
- according to analysis' objects (single request, sessions, sessions' groups);
- according to used request indexes.
The main difficulty in bots' detection is deliberate disguise of attacking bot's traffic like traffic of ordinary users, and besides disguise methods are often refined enough.
The algorithms of requests generation from devices, included in attacking botnet, can be classified according to following indicators:
- correctness of request way: only existing ways, random ways, mixed ways;
- request content: the same or different;
- interval between requests: the same or different.
Besides that it should be mentioned that behavior patterns of different bots can be differed a lot, and the closer the behavior model of bots will be to behavior model of legitimate user, the more difficult to distinguish "useful" traffic from "parasitic".
Classical way to limitation of bots' access to web-resource is compulsion of all users to make a test "for robot" – enter text from picture, recognize distorted images and so on. This method is effective, but has side effect as delay of service use of users and losses for business in the case if actual user refuses to make a test.
The aim of work
The aim of work – research of possibility of using machine learning methods for solving problem of recognition of bot-traffic among all requests to web-resources.
The tasks are following:
- research and preparation data for analysis;
- metrics determination for evaluation of quality of bots' recognition;
- evaluation of machine learning algorithms' quality or algorithms' groups with use of chosen metrics.
For accomplishment of necessary volume of teaching selection in this work it is proposed to use logs of web- or proxy-server for some period of time and following processing of them. As a source of data for learning in this work web-server Nginx 1.10.2 is used with standard format of log, used as balancer of loading in real project.
Data from logs before learning are needed to be processed. In this work session is the unit of observation – request sequence from one IP- address, interval between which is not more than 30 minutes.
After all requests' distribution by sessions from selection the sessions are deleted that have less than 2 requests (as uninformative).
Then for every session following numerical characteristics are detected (which will be as features during learning):
- requests amount/ session duration in seconds;
- amount of pages' requests/ session duration in seconds;
- answer size (maximal, minimal and average);
- time between requests (maximal, minimal and average);
- time between pages' requests (maximal, minimal and average).
Common amount of used on this stage features: 11.
Then data are scaled on segment [0...1] for avoiding of misleading of machine learning algorithms, using distance measures between objects as criterion for classification.
Common amount of objects in selection: 11435, from them 229 are marked as bots.
The choice of quality metrics
F-score is chosen as quality metrics in this work .
For calculating F-score notions “precision” and “recall” are entered.
precision = (1)
recall = (2)
Where TP – true-positive solution; TN – true-negative solution FP – false-positive solution; FN – false-negative decision.
F-score is average harmonious between precision and recall, that is convenient for use in this work as metrics of classification quality.
The comparison of machine learning methods
Based on analysis of publications of this topic some perspective machine learning methods were determined:
- Logical regression;
- Neural networks;
- «random forest»;
- Gradient boosting above solving trees (AdaBoost, xgboost etc.).
All the above algorithms are realized in the library sklearn  for programming language python.
For simultaneous learning and evaluation of classifiers’ work the technique of cross-validation was used , allowing to estimate classifier’s quality with great accuracy without necessity of preparation of separate teaching and text selection.
Then for every classificatory process of optimal combination search of settings was run (so called “grid search”).
The comparison of work quality by classifier are shown in the following table:
Тable. The comparison of classification algorithms’ quality
Multilayer perceptron (2 layers)
Ensemble algorithm of gradient boosting above trees of solutions showed greater accuracy (in realization xgboost), further research will be based on it.
Uncreasing of classification precision
For increasing of bots’ classification precision the following ways were used:
- Sampling  of learning data selection. This method was used for decreasing classes imbalance (because amount of object, marked as “bot”, is less in selection than objects of class “human”);
- Clearing data from emissions and extreme values
- Using algorithm RFE (Recursive Feature Elimination) for selection of more important features for classification.
Value F-score, got in the result of the above actions, is 0.981.
The aim of work was research of possibility of using machine learning methods for solving problem of recognition of bot-traffic among all requests to web-resources. During this work data selection was prepared for learning, metrics of classification quality was chosen, comparative analysis of machine learning methods was made on prepared data with chosen metrics. According to results of classifiers’ comparison the best method was chosen ( according to criterion of chosen metrics), after all some actions were made for increasing work quality of classifiers.
During work F-score = 0.981 was got on cross-validation, that is positive result and it allows to consider the aim of this work realized.
1. A.K. Bolshev Algorithms for converting and classifying traffic for intrusion detection into computer networks : 05.13.11. – St.Petersburg, 2011. – 142p.
2. Scikit-learn: Machine Learning in Python / Fabian Pedregosa [and others]
// Journal of Machine Learning Research. – 2011. – №12.– pp. 2825-2830.
3. A survey of cross-validation procedures for model selection / Sylvain Arlot [идр.] // Statistics Surveys. – 2010. – №4. – pp.40-79.
4. SMOTE: Synthetic Minority Over-sampling Technique / Nitesh V. Chawla [идр.] // Journal of Artificial Intelligence Research. – 2002. – №16. – pp.321-357.