Abstract
Recently, as more and more disasters caused by malware have been reported worldwide, people started to pay more attention to malware detection to prevent malicious attacks in advance. According to the diversity of the software platforms that people use, the malware also varies pretty much, for example: Xcode Ghost on iOS apps, FakePlayer on Android apps, and WannaCrypt on PC. Moreover, most of the time people ignore the potential security threats around us while surfing the internet, processing files or even reading email. The Portable Document Format (PDF) file, one of the most commonly used file types in the world, can be used to store texts, images, multimedia contents, and even scripts. However, with the increasing popularity and demands of PDF files, only a small fraction of people know how easy it could be to conceal malware in normal PDF files. In this paper, we propose a novel technique combining Malware Visualization and Image Classification to detect PDF files and identify which ones might be malicious. By extracting data from PDF files and traversing each object within, we can obtain the holistic tree-like structure of PDF files. Furthermore, according to the signature of the objects in the files, we assign different colors obtained from SimHash to generate RGB images. Lastly, our proposed model trained by the VGG19 with CNN architecture achieved up to 0.973 accuracy and 0.975 F1-score to distinguish malicious PDF files, which is viable for personal, or enterprise-wide use and easy to implement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
SentinelOne: Malicious PDFs - Revealing the Techniques Behind the Attacks. https://www.sentinelone.com/blog/malicious-pdfs-revealing-techniques-behind-attacks/. Accessed 27 Mar 2019
Cybersecurity Insiders: Cyber Attack with Ransomware hidden inside PDF Documents. https://www.cybersecurity-insiders.com/cyber-attack-with-ransomware-hidden-inside-pdf-documents/
Kaspersky: Top 4 dangerous file attachments. https://www.kaspersky.com/blog/top4-dangerous-attachments-2019/27147/. Accessed 31 May 2019
O’Shaughnessy, S.: Image-based malware classification: a space filling curve approach. In: 2019 IEEE Symposium on Visualization for Cyber Security (VizSec), pp. 1–10. IEEE, October 2019. https://doi.org/10.1109/VizSec48167.2019.9161583
Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.: Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security, pp. 1–7, July 2011. https://doi.org/10.1145/2016904.2016908
Fu, J., Xue, J., Wang, Y., Liu, Z., Shan, C.: Malware visualization for fine-grained classification. IEEE Access 6, 14510–14523 (2018). https://doi.org/10.1109/ACCESS.2018.2805301
Bhodia, N., Prajapati, P., Di Troia, F., Stamp, M.: Transfer learning for image-based malware classification. arXiv preprint arXiv:1903.11551 (2019)
Darus, F.M., Ahmad, N.A., Ariffin, A.F.M.: Android Malware classification using XGBoost on data image pattern. In: 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS), pp. 118–122. IEEE, November 2019. https://doi.org/10.1109/IoTaIS47347.2019.8980412
Kapoor, A., Dhavale, S.: Control flow graph based multiclass malware detection using bi-normal separation. Defence Sci. J. 66(2), 138–145 (2016). https://doi.org/10.14429/dsj.66.9701
Han, K., Kang, B., Im, E.G.: Malware analysis using visualized image matrices. Sci. World J. 2014, 106–120 (2014). https://doi.org/10.1155/2014/132713
Laskov, P., Šrndić, N.: Static detection of malicious JavaScript-bearing PDF documents. In: Proceedings of the 27th Annual Computer Security Applications Conference, pp. 373–382, December 2011. https://doi.org/10.1145/2076732.2076785
Maiorca, D., Ariu, D., Corona, I., Giacinto, G.: A structural and content-based approach for a precise and robust detection of malicious PDF files. In: 2015 International Conference on Information Systems Security and Privacy (ICISSP), pp. 27–36. IEEE, February 2015
Smutz, C., Stavrou, A.: Malicious PDF detection using metadata and structural features. In: Proceedings of the 28th Annual Computer Security Applications Conference, pp. 239–248, December 2012. https://doi.org/10.1145/2420950.2420987
Blonce, A., Filiol, E., Frayssignes, L.: Portable document format (pdf) security analysis and malware threats. In: Presentations of Europe BlackHat 2008 Conference, March 2008
Maiorca, D., Biggio, B.: Digital investigation of PDF files: unveiling traces of embedded malware. IEEE Secur. Privacy 17 (2017). https://doi.org/10.1109/MSEC.2018.2875879
Corum, A., Jenkins, D., Zheng, J.: Robust PDF malware detection with image visualization and processing techniques. In: 2019 2nd International Conference on Data Intelligence and Security (ICDIS), pp. 108–114. IEEE, June 2019. https://doi.org/10.1109/ICDIS.2019.00024
Whitington, J.: PDF Explained: The ISO Standard for Document Exchange, 1st edn. O’Reilly Media, Newton (2011)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition CORR.abs/1409.1556. arXiv preprint arXiv:1409.1556
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 IFIP International Federation for Information Processing
About this paper
Cite this paper
Liu, CY., Chiu, MY., Huang, QX., Sun, HM. (2021). PDF Malware Detection Using Visualization and Machine Learning. In: Barker, K., Ghazinour, K. (eds) Data and Applications Security and Privacy XXXV. DBSec 2021. Lecture Notes in Computer Science(), vol 12840. Springer, Cham. https://doi.org/10.1007/978-3-030-81242-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-81242-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81241-6
Online ISBN: 978-3-030-81242-3
eBook Packages: Computer ScienceComputer Science (R0)