AAFS 2010 – Changes in Approach to Scalability in Digital Forensic Analysis

Print Friendly

This year I attended the American Academy of Forensic Sciences (“AAFS”) conference in Seattle and presented in the digital and multimedia section. The following post is a summary of the oral presentation along with my slide set.

For those who do not know, I hold an M.S. in computer science with concentration in Information Assurance. I am presently a Ph.D. student in Engineering and Applied Sciences at the University of New Orleans. I expect to be ABD by fall of this year when I start law school as a J.D. student. Professionally, I work for a litigation support vendor in New Orleans dealing primarily with the civil side of digital forensics, eDiscovery, and other related areas. I have a somewhat unique perspective on the field by having one foot in academia and the other in industry.

One cannot begin a discussion of future trends and the need for new approaches without first examining the current state. At present the field has three main phases of practice: acquisition, analysis, reporting. Acquisition originated in dead acquisition where the data storage medium, such as a hard drive, is imaged byte-for-byte to produce an exact duplicate when the system is powered off. The duplicate is hashed for later verification after analysis is complete. In a more modern twist, Live analysis involves acquiring data from a system while it is still running. Live acquisition allows for preserving more ephemeral data such as memory dumps, active network connections, logged on users, running programs, etc which would otherwise be lost in powering the system down for dead acquisition. Live acquisition risks the triggering of anti-forensics tools, malicious commands from still logged in users, and damaging the system state.

However the storage data is acquired, the image files are transferred to a tool suite where the examiner/analyst/investigator/researcher etc begins the careful analysis process. Depending on the size of the datasets, analysis may take a very long time to complete. Whether the examiner uses a tool suite, or a collection of individual tools some common tasks will be carried out. Files will be hashed to exclude known good files (operating system libraries, known executables, etc), suspicious files will be flagged (erroneous file extensions, unexpectedly large files, encrypted or password protected files) for closer scrutiny, and text will be searched for relevant keywords. When the examiner finishes the analysis, a report is drafted summarizing and interpreting the evidence ready for consumption by lawyers, law enforcement, and the courts.

What problems do we face?

No future trends presentation would be complete without flogging our arch-nemesis the dreaded hard drive size over time vs band width increase over time charts. Just in the last 5 years, the capacity of consumer hard drives has increased 400%, and more problematic the expansion of consumer desktop systems into assuming a “media center” role in the household has led to increased capacity in the entry level market segment. Bandwidth, on the other hand, has never enjoyed more than a 100% increase in bandwidth between generations and, in some cases, has had meager gains < 20% (IDE133 to SATAI). The potential sources of data are increasing over time as well. Where once a single computer would be shared by an entire family, now it is not unusual for each individual to have one or more than one system. Cell phones, music players, gaming consoles, thumb drives, and other devices increase the number of targets in an investigation as well as increasing the overall data set size. Looking at currently advertised systems from major OEM and retailer channels in each of the device categories makes targets in excess of a terabyte and approaching two terabytes plausible.

Increasing target sizes would not be prohibitive if the other required resources needed for an investigation increased proportionally. Putting aside access bandwidth concerns, the primary bottleneck in investigations which cannot be easily remedied is the human component. Bandwidth can be artificially increased by spreading the load over multiple drives, caching techniques, and other technological solutions, but the human factor cannot be quickened. The graduation rate of Computer Scientists has fallen at all levels (B.S., M.S., Ph.D.) since 2004, and even of those graduated the number focused on digital forensics is a small fraction.

In a nutshell, we are facing ever increasing dataset sizes, access speeds which do not increase at the same rate as capacity increases, and a drop in production of humans with sufficient training to perform analysis. All these factors increase the turnaround time from the start to the conclusion of an investigation and the production of its findings to the next stage.

What are our tools?

Many of the tools we use started out intended for system administration. More specific tools were created later such as foremost/scalpel for file carving, and stegdetect for stego detection. Many tools are created in an ad hoc manner to handle a specific task or situation encountered in the course of an investigation. The majority of these tools were designed with a single system / single thread in mind. As the tool sets matured, tool suites became available which combined single use tools into user friendly packages with GUIs to guide the investigator through the analysis process and carry out rudimentary case management.

Both sets of tools fall prey to scaling issues. They focus on a single investigator, do not provide for collaboration between investigators, and are highly dependent on the investigator to report findings to interested parties down the chain. They also suffer from updating issues in that coupling an individual tool into a suite limits the ability for the investigator to run the most updated tool between suite releases. Ayers in the 2009 proceedings of the Digital Forensics Research Workshop (DFRWS) identified a set of features he felt required to consider a forensics tool to be “second generation.” I differ slightly in my requirements by giving greatest precedence to distributed/parallel processing, collaboration, application specific extensibility, and increased focus on a pipeline approach to decrease turnaround time.

Where do we fit in?

We must recognize our role in examining a dataset is one part of a greater process. The end of the process lies with the law enforcement, attorneys, and ultimately the courts. Those parties cannot begin their work until they are furnished with our work, but they do not necessarily require all of our work to begin theirs. We can view our current process as a synchronous one in which everyone waits for us to complete, but by moving to a more asynchronous pipeline our partial results can be fed into the pipe so law enforcement and attorneys can begin building their cases in the courts. Ultimately those interested parties will require our final results, but a significant amount of preliminary work can be completed before the final report is submitted. Additionally, later parts of the pipeline can provide feedback to the investigator for better or more precisely targeted investigations.

Black Friar

Black Friar is an experimental prototype of the distributed/parallel pipelined process. Its aim is to distribute the initial processing, front load as much computationally expensive processing into the first file read as possible, provide an architecture capable of being extended to handle processing specific to proprietary file types in either the first pass or in a more exhaustive second pass, and allow for usable intermediate results to be shared among investigators for collaborative purposes or fed into the next stage of the pipeline. While proprietary suites are starting to use distributed processing, this capability is lacking in open source tools. The prototype, therefore, seeks to leverage the existing tools in such a way as to make tasks which use them distributable while retaining the ability to upgrade the individual tools to the latest version independently of the framework.

The current prototype used for these results extracts data from drive images using The Sleuth Kit. It distributes the file load over the nodes producing file hashes, extracting string data, and identifying file types. The document information along with the extracted string data is formed into an index using the Lucene project. The Lucene index produced by each node is fully usable on its own prior to merging allowing segments of interest to be immediately used before the overall process is complete.

The test systems included a quad core work station with 8gb of ram and I/O spread over multiple drives, and a 7 node cluster of older dual Xeon dual core based systems with 2 gb of ram and single drives. Results show the performance difference between the quad core system and an individual node at each thread number. In both cases performance gains were significant from 1 to 2, and 2 to 3 threads but not between 3 to 4 where the process again became I/O bound. Experimental results are separated into processing which generates the hashes, determines file type, extracts string data, and queries file metadata from the image, and indexing which combines the processing results into a Lucene index. In the reported processing results, the distributed metrics include the image file being present on each individual node and the image file being stored on the Gluster File System as communal storage with the systems interconnected with gigabit ethernet. Performance decreases using gluster as opposed to storing a copy locally on each node was minimal and presumably would disappear with an interconnect speed faster than local I/O.

On the processing side, the fully distributed processing was 18% of the single thread single node processing time and 26% of the time when compared against a fully threaded single node run. Indexing on the cluster was 9% of the time required for a single threaded single node run, and 14% of a fully threaded single node run. These results are derived from the DC3 2009 data set as a target. The performance gains are encouraging.

Black Friar will continue development, and will be released as an open source project under the GPL v3 when it is sufficiently developed to merit release.

AAFS 2010 Slides1559.56 KB
Posted in Digital Forensics Tagged with: , , ,