Document Type

Journal Article

Publication Title

IEEE Access

Volume

First Page

13833

Last Page

13859

Publisher

IEEE

School

ECU Security Research Institute / School of Science

Funders

Edith Cowan University

Comments

Hossain, G. M. S., Deb, K., Janicke, H., & Sarker, I . H. (2024). PDF malware detection: Toward machine learning modeling with explainability analysis. IEEE Access, 12, 13833-13859. https://doi.org/10.1109/ACCESS.2024.3357620

Abstract

The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims' PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model's explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.

DOI

10.1109/ACCESS.2024.3357620

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Download

Included in

Information Security Commons

COinS

Link to publisher version (DOI)

10.1109/ACCESS.2024.3357620

Research outputs 2022 to 2026

PDF malware detection: Toward machine learning modeling with explainability analysis

Document Type

Publication Title

Volume

First Page

Last Page

Publisher

School

Funders

Comments

Abstract

DOI

Creative Commons License

Included in

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

PDF malware detection: Toward machine learning modeling with explainability analysis

Authors

Document Type

Publication Title

Volume

First Page

Last Page

Publisher

School

Funders

Comments

Abstract

DOI

Creative Commons License

Included in

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations