Author Identifiers
Matthew Gaber
https://orcid.org/0000-0003-1684-1392
Mohiuddin Ahmed
https://orcid.org/0000-0002-4559-4768
Helge Janicke
Publication Date
2024
Document Type
Dataset
Publisher
Edith Cowan University
Faculty
School of Science
Description
Cyber-attacks continue to evolve, increasing in frequency and sophistication where Artificial Intelligence (AI) is becoming essential in detecting modern malware. However, the accuracy of AI in malware detection is dependent on the quality of the features it is trained with. Static and dynamic analysis of malware is limited by the widespread use of obfuscation and anti-analysis techniques employed by malware authors, where if an analysis environment is detected the malware will hide its malicious behavior. However, Dynamic Binary Instrumentation (DBI) allows deep and precise control of the malware sample, thereby facilitating the extraction of authentic features from sophisticated and evasive malware. We developed Peekaboo, a DBI tool to defeat the anti-analysis techniques and extract authentic behavior from live malware samples. We collected 18,527 malware samples across ransomware, spyware, trojans, botnets, worms, Advanced Persistent Threats (APT) and post exploitation tools where every sample includes type, family, and variant information, for example Ransomware-WannaCry-SHA256. We also collected 1,973 benign software samples for analysis.
This dataset contains the results for each sample, that were run for up to 15 minutes, to observe not only the anti-analysis techniques used but also its complete behavior. For each malware sample, the network traffic, every opcode that is executed and every evasive technique that is used are captured.
Additional Information
There are three main folders in the linked repository.
- The Peekaboo Data folder contains zip files of the timestamped raw json files extracted by Peekaboo for each sample and are organised by the malware family. There is also a csv file generated with analysis.py for each family.
- The Peekaboo Network Traffic folder contains zip files of the .pcap files extracted by Peekaboo for every sample organised by family.
- The Python Scripts folder contains the Python scripts detailed below.
DOI
10.25958/85p1-4w32
Research Activity Title
Defeating Evasive Malware with Peekaboo: Extracting Authentic Malware Behavior with Dynamic Binary Instrumentation
Research Activity Description
The accuracy and effectiveness of AI for malware detection is dependent on the quality and quantity of the features it is trained with. That is, an analysis tool that forces malware to expose it malicious intent and then extracts genuine features, along with large and diverse repositories of malware and benign software, are necessary to train accurate AI models. This research had two primary objectives: investigation of the evasive techniques used by modern malware and the creation of the Peekaboo DBI tool to extract authentic behavior from live malware samples.
Methodology
Dynamic Binary Instrumentation
Start of data collection time period
2024
End of data collection time period
2024
File Format(s)
json, csv, python, pcap
Viewing Instructions
Notes for the Python scripts:
- analysis.py extracts the evasive techniques that are in the individual data json files and aggregates them into a family and writes a csv with the results.
- check-json-pcap.py deletes network traffic pcap files that do not have a corresponding data json file which reconciles those samples that did not execute.
- count-asm.py counts and categorizes the individual Assembly (ASM) instructions in the data json files.
- count-techniques.py counts the various evasive techniques. For each csv file generated for each family with the analysis.py script, this will count the number of times each technique was used by the individual samples, that is the rows in the analysis.csv file.
- extract-results.py is used to extract the data json files from the individual archives in the data folder.
- get-opcodes-threads.py extracts the Opcodes and Threads columns for each analysis csv file in a folder.
- splitjson.py is used to split huge json files into more manageable sizes.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
Contact
Matthew Gaber
Citation
Gaber, M., Ahmed, M., & Janicke, H. (2024). Matthew Gaber: Peekaboo. Edith Cowan University. https://doi.org/10.25958/85p1-4w32