Huancheng Song Fanzeng (Alex) Xia Chunting Xu
1.Introduction
With the development of computer network and the growth in the means of programming, conventional malicious file detection methods seem obviously inadequate. Traditional embedded security mechanisms like distributed IDS and firewall are no longer enough to secure the next generation Internet because of the unbounded concerns over network access control and software verification. Recent research has confirmed the promise of machine learning for many kinds of anomaly detection. The malicious file detection based on behavior is a method which achieves detection via making use of the peculiar behavior features of malicious files. In order to protect our computer systems from signature-unknown malicious files, in this paper we focus on static & dynamic machine learning methods and describe their pros and cons.
2. Proposed method
File analysis methods can be divided to two categories: static analysis and dynamic analysis, and both of them can be used in conjunction with machine learning. Our methods are aimed at extracting prominent information from the examined file and use both static and dynamic analysis. After we get the signature-unknown files from the Internet, firstly all the embedded files(probably malicious) must be recursively extracted in order to analyze them as well. Then we check the compatibility of these files and send the suitable files to our detection model based on SVM and active learning using static and dynamic behaviors. Finally after several retrainings we can determine malicious labeled files and benign labeled files.
2.1 Static Analysis
Static analysis methods extract data from the examined file and analyze it without actually excuting the file. Before starting a static analysis of reading the binary code, the code needs to be translated to assembly level. By looking at the file's content and structure, we can extract discriminative behavior features and build general benign patterns. Afterwards, we could find the malicious files based on anomaly detection.
The advantages of static analysis is that it can scrutinize the file's "genes", and it's usually simple and efficient. Static analysis approaches are easy to implement, monitor and measure. Compared to dynamic approaches, static analysis is relatively faster which is good for inspection in real time systems. It's also safe for user's machine since the examiner machine cannot become infected without executing the files. However, static analysis is subjected to obfuscated techniques that can evade it. Also it ignores the changes that made to the code during execution. As sometimes we cannot fully expect the actual behavior of the file during runtime, and that is the reason why our proposed file analysis method uses both static and dynamic analysis.
2.2 Dynamic Analysis
Dynamic analysis is also known as "behavioral analysis". It examines the actions and behavior of the suspected file during runtime. The process of the analysis is usually in an isolated environment (Sandbox / VM) in order to protect the host machine. After the code is executed, the abstraction level can be varied between lowest level (the binary code itself) and the highest level (observable effects it has on the system as-a-whole). For example, some changes made to the file system, Registry keys, the OS's configurations, etc. can only be detected during runtime
To start a dynamic analysis, A clean system need to be start firstly, then a sample (script/code) is loaded into the system. By launching the analysis tool(s), the sample is executed. Afterwards, the report produced can be examined. Finally, the system is reverted to a clean state and repeat.
The advantage of dynamic analysis is that it can examines the behavior of the file from which the malicious files can not evade by code obfuscation techniques, encrypyion, etc. The disadvantages are that dynamic analysis is much slower in conparison to statc analysis and is hard to implement. Computational complexity, resource demands and time consumption must be considered in dynamic analysis. Also, It is difficult to simulate the appropriate conditions in which the malicious functions of the program will be activated (vulnerability that the malware exploits). And when executed, the examined file can also detect that it is being analyzed and change its behavior.
3. References
Bayer, U., Moser, A., Kruegel, C., & Kirda, E. (2006). Dynamic analysis of malicious code Journal in Computer Virology, 2(1),67-77.
Nissim, N., Cohen, A., Glezer, C., & Elovici, Y. (2015). Detection of malicious PDF files and directions for enhancements: a state-of-the art survey. Computers & Security, 48, 246-266.