Tuesday, April 26, 2011

MapReduce for simpletons

Data reduction redux and map-reduce is the title of my latest post at the SANS Digital Forensics Blog. I mentioned in my previous post there on using least frequency of occurrence in string searching that there would be a follow up.

The point of the new post is to sing the praises of @strcpy over on the Twitters. He helped me out by writing a short shell script that is, in essence, map-reduce for simpletons like me. I am constantly amazed by some of the members of the info sec community who will take time to help out near total strangers.

strcpy's script wasn't just helpful, it was educational. I'd read about map-reduce before, but it never really clicked until I saw strcpy's script. The scales have fallen from my eyes and I'm now adapting his script for other kinds of tasks.

Check out the post and if you find it beneficial and you ever get to meet strcpy in person, buy him a drink or a meal and tell him thanks, I plan to do the same one day.

Monday, April 25, 2011

Scalpel and Foremost

The crew over at Digital Forensics Solutions announced the release of a new version of Scalpel with some exciting new features. Check out their post for the full details, but here are three I was most interested in:

  • Parallel architecture to take full advantage of multicore processors

  • Beta support for NVIDIA CUDA-based GPU acceleration of header / footer searches

  • An asynchronous IO architecture for significantly faster IO throughput


  • Digital forensics is time consuming so any speed gains we can make are welcome ones.

    Over the last few days, I've had a chance to play with the new version of scalpel on my 64-bit Ubuntu system with 7GB of RAM. I downloaded the source and followed the directions in the readme to configure and compile the binary.

    I then ran some carves against a 103GB disk image from a recent case. The command line I used was:

    scalpel -b -c /etc/scalpel.conf -o scalpel-out/ -q 4096 sda1.dd

    The -q option is similar to foremost's -q option in that it tells scalpel to only scan the start of each cluster boundary for header values that match those specified in the config file. In my test, I used the two doc file signatures in the supplied example scalpel config file. The nice thing about -q is that you can provide the cluster size. With foremost -q will scan the start of each sector by default, you'll have to also add -b to get similar functionality out of foremost.

    I ran scalpel with the Linux time command so I could determine how long the command took to complete. Scalpel carved 6464 items that had byte signatures matching those in the configuration file. According to the time command, this took 52 minutes and 40 seconds.

    Manually verifying that all 6464 files are Word docs would be time consuming. In lieu of that, I followed Andrew Case's suggestion and used the following command from within the scalpel-out directory:

    for i in $(find . | grep doc$); do file $i; done | grep -i corrupt | wc -l

    The result was that 2707 of the 6464 files were found to be "corrupt" according to the file command. This is not an exact measure of the accuracy of scalpel's work, but it gives us a ballpark figure. If my math is correct that's a false positive rate of 41%. Just remember, these are rough figures, not exactly scientific.

    Next I configured foremost to use the exact same configuration file options and similar command line arguments (recall I had to use -b with foremost) and ran the carve against the same image. The command line I used was:

    foremost -c /etc/foremost.conf -i sda1.dd -o foremost-out -q -b 4096

    Again, I used the time command to measure how long this took, 47 minutes and 32 seconds later, foremost finished having carved 6464 files. I used the same measure for accuracy as with scalpel, running the following command from within the foremost-out directory:

    for i in $(find . | grep doc$); do file $i; done | grep -i corrupt | wc -l

    The result was that 2743 files came back as "corrupt" according to the file command. Interesting. Both tools used the exact same signatures, both carved exactly the same number of files, yet foremost was approximately 1% less accurate, though at 47 minutes compared to scalpel's 52 minutes, it was almost 10% faster.

    Conclusion:
    It's hard to draw conclusions from one simple test. I think it's great that scalpel is under active development and for those who can take advantage of the CUDA support, it could be a huge win in terms of time and time is against us these days in the digital forensics world.

    The other big plus, is that it's great to have another tool that we can use to test the results of another tool. I will continue to experiment with scalpel and look forward to future developments and I thank the developers of both tools for their contributions to the community.

    Saturday, April 23, 2011

    Forensic string searching

    Can the principle of "least frequent occurrence" be applied to digital forensic string searches?

    Late last night (or painfully early this morning) I published a new post over at the SANS Digital Forensics Blog. The post is called "Least frequently occurring strings?" and attempts to shed some light on that question.

    I've used this approach on a couple of recent cases, one real and one from The Honeynet Project's forensic challenge image found here, this is the image the post contains data from.

    I really knew nothing about the Honeynet challenge case, but in less than half an hour, I'd located an IRC bot using the LFO approach to analyzing strings. Of course the Honeynet case is quite small, so the technique worked well, on larger cases from the real world, I expect it's going to take longer or maybe not work at all. Nevertheless, LFO is a concept that other practitioners have been applying for some time now.

    There's lots of other goodies in the post, like moving beyond just using strings to extract ASCII and Unicode text from disk images. If you have a decent system and a good dictionary file, you can reduce this set of data even further to lines that actually contain English words.

    Check it out, I hope the world finds it useful.

    Other thoughts from Lean In

    My previous posts in this series have touched on the core issues that Sheryl Sandberg addresses in her book  Lean In: Women, Work, and the W...