SOFIA: Finding and Profiling Malware Source Code in Public Archives at Scale
Project Overview
Our work is motivated by the following insight: software archives, like GitHub, host a surprisingly large number of publicly-accessible malware repositories, which compile into stand-alone malware binaries. We argue that this constitutes a huge missed opportunity: GitHub has more than 32 million public repositories and there are many similar software platforms. Security research could greatly benefit by an extensive database of malware source code, which is currently unavailable. Researchers can use malware source code to develop better mitigation techniques and evaluate their tools systematically.
This proposal will develop methods and tools for identifying and profiling malware source code from public software repositories. Note that we use the term malware repositories to refer to repositories that provide the source-code for compiling a working malware binary (think of the hacker's software project). We want to develop methods to: (a) systematically mine these repositories, and (b) study and profile the malware and the related ecosystem. The tangible outcomes will be: (a) tools to extract an profile malware source code effectively and at scale, and (b) the largest annotated malware source code database. Our initial efforts show great promise. We have identified more than 7K GitHub malware source code repositories and many highly-collaborative communities with hundreds of malware authors.
Intellectual Merit
This project will develop novel techniques that revolve around identifying and profiling malware repositories. The first novelty is that we describe repositories with a comprehensive set of features along three dimensions: (a) metadata, such as title and description, (b) the source code and its structure, and (c) the social context, which captures the interactions among authors and repositories. The second key novelty is algorithmic, as we propose to develop new approaches and also evaluate and adapt: (a) state of the art data-mining techniques, such as word embedding, (b) code-specific profiling techniques, such as code2vec, and (c) techniques for the describing the interactions of authors, such as tensor decomposition. We elaborate on our algorithmic novelty per task.
Task 1: Identifying malware source code repositories. We propose to develop a systematic approach to identify malware repositories using features across the three dimensions. The scientific challenges lie in: (a) identifying the right feature representations, and (b) constructing the appropriate embedding space, that will ensure that similar repositories are “close” in the embedding space.
Task 2: Profiling malware repositories and their ecosystem. We propose to model the ecosystem dynamics by evaluating and adapting tensor-based and graph-mining techniques. We propose to introduce a hierarchical soft clustering approach that extends the current "one-level" tensor decomposition into a hierarchy using a recursive approach.
Using these novel capabilities as building blocks, we will develop approaches to profile and classify repositories, study the evolution and the phylogeny among malware repositories, and model the collaboration dynamics of the hacker ecosystem.
Initial Results
We gathered over 7500 malware repositories in an initial effort that we call Source Finder. To view these results head to this link: SourceFinder Platform
This work is supported by NSF CISE SATC Award ID#: 2132642