Monday, December 26, 2011

Check the uids and gids

While working on body-outliers, the Python script I wrote to do statistical analysis on fls bodyfiles in an effort to find malicious files in compromised file systems, one of the things I was ignoring completely, but that stuck out like a sore thumb when reviewing the data, was user and group IDs for files in Unix and Linux file systems.

When attackers build their kits that they intend to drop on remote hosts as backdoors, packet sniffers, key loggers, etc., they often use tar and gzip to create compressed archives of those files, then they can issue a command like wget to download the archive to the compromised host where they will "untar" the archive and move their malicious binaries into desired paths on the system.

One of the "features" of tar, as the manpage tells us is, "by default, newly-created files are owned by the user running tar." This means that if the attacker is logged into his own system as a non-root user and he's are compiling binaries which will replace legitimate binaries on the target system, those binaries will retain his user and group id information when they are tar'd up. Of course a careful, thoughtful attacker can take a variety of countermeasures to change this, but many are not so careful.

As a result, when they install malicious code on target systems, there's a chance those binaries will be installed with user IDs and group IDs (henceforth uid and gid) that don't match other files in those locations. These are obvious outliers and as I was working on the next version of body-outliers, I had written the code to calculate the average uid and gid values on a per directory basis, then calculate standard deviation, then alert on the outliers, but this sort of statistical analysis didn't make sense for uids and gids, because for the most part, they are uniform throughout the file system, with a few exceptions like /tmp, /var/spool/cron, /var/spool/mail and many custom software packages, but many system directories like /dev, /bin, /usr, etc. are set uid and gid 0, meaning the files are owned by the root account and belong to the root group. In this context, standard deviation didn't make much sense, so I modified my code to do another form of statistical analysis; namely calculating distributions.

Calculating distributions is just fancy talk for counting the occurrences of a thing, say, how many files are uid 0, how many are uid 1000, and so on, then displaying this information. This type of analysis lends itself well to finding oddball uid and gid files in compromised *nix file systems. On the hacked system I spoke of during my SECTor 2011 talk (video, slides), finding these unusual uid and gid files correlates very well to finding attacker code for precisely the reasons described above.

Here's a sample run of the script, which I'm calling body-ugid-dist.py, run against the same bodyfile as the one in the SECTor talk, this has been trimmed down a bit:
./body-ugid-dist.py --file sda1_bodyfile.txt --meta uid
[+] Checking command line arguments.
[+] sda1_bodyfile.txt may be a bodyfile.
[+] Discarded 0 files named .. or .
[+] Discarded 0 bad lines from sda1_bodyfile.txt.
[+] Added 20268 paths to meta.

...

Path:  /etc/cron.daily
==========================
Count:       1  uid:  1000
Count:       9  uid:     0

...

Path:  /usr/lib
==========================
Count:       1  uid:    10
Count:       1  uid:    37
Count:       1  uid:  1000
Count:    2082  uid:     0

...

In actuality this script returns 499 lines of output, representing about 350 "Counts," most of which were specific to the custom application running on the system. But the overall bodyfile had more than 200 thousand lines, so this is a considerable reduction in data, which is vital to any investigation. What the above output tells us is that of the 10 files in /etc/cron.daily, nine of them are uid 0 and one is uid 1000, that's a lead that may be worth pursuing and indeed, in this case, it is malicious code. The next entry shows tht /usr/lib contains 2085 files with 2082 of them being uid 0 and three others that are one offs and certainly worth looking into. In that case, two of the three are malicious code.

body-ugid-dist.py is available from my github repo. Unfortunately, it's only going to be useful for *nix cases. Running it is quite simple, the usage is shown below:
./body-ugid-dist.py 
usage: body-ugid-dist.py [-h] --file FILENAME [--meta META]

This script parses an fls bodyfile and returns the uid or gid distribution on
a per directory basis.

optional arguments:
  -h, --help       show this help message and exit
  --file FILENAME  An fls bodyfile, see The Sleuth Kit.
  --meta META      --meta can be "uid" or "gid." Default is "uid"
I wrote about this previously for the SANS Digital Forensics Blog If this kind of analysis interests you, join me for SANS 508: Advanced Computer Forensic Analysis & Incident Response in Phoenix in February of 2012.

Tuesday, November 22, 2011

Fourth Amendment Hard Disk Wipe

Recently I replied to a thread on a mailing list about wiping hard disk drives.
source: http://cheezburger.com/tehpeanutbutterkitteh/lolz/View/2735751680

I'd just spent a few hours over a recent weekend playing around with the hdparm command in Linux because it has the ability to use the ATA Secure Erase feature, which is much faster and more comprehensive than software wipe utilities like the trusty Darik's Boot and Nuke. For example, I recently wiped a 500GB drive in just over two hours.

I was experimenting with hdparm and secure erase because I wanted to try it out and because I was prepping an old drive to give to a friend. After the secure erase finished and I verified that the drive contained no data, I wrote a little shell script to overwrite the entire thing with the text of the 4th Amendment of the U.S. Constitution. Something I was inspired to do after reading about @ioerror's overwriting usb sticks with the Bill of Rights.


I mentioned this script on a mailing list and a friend replied that I was "so subversive." Now, I'm almost certain the reply was in jest and that he doesn't honestly feel that way, but I suspect there are folks who do think it's subversive. I think it's a sad commentary on the state of the U.S. collective psyche when we consider Constitutional guarantees as subversive.

A handful of people replied to me that they wanted the script. Well, it's not pretty, nor fast and Hal Pomeranz and a thousand other Unix beards could probably come up with a better solution, but it works. I've added a measure of protection to it because I imagine some people will screw themselves with this, so be careful, mind your devices.

#!/bin/bash
# This is a hack I wrote to overwrite $1 with the 4th Amendment.
# It's not pretty, it's not fast, but it works.
# If $1 is a device, when it's full, errors will be thrown and not handled.
# If $1 is not a device, the block device that it resides on will eventually
# fill up, if this script is left running.

# The next line will cause the script to exit on any errors, like
# when the device is full. Hey, I said it was a hack.
set -e

echo "This hack overwrites $1 with the text of the 4th Amendment."
echo "ALL DATA WILL BE LOST."

echo "Are you absofrigginlutely sure you want to continue?"
select yn in "Yes" "No"; do
    case $yn in
        Yes ) exec > $1
            while : 
                do echo "The right of the people to be secure in their persons, " \
                "houses, papers, and effects, against unreasonable searches and " \
                "seizures, shall not be violated, and no Warrants shall issue, " \
                "but upon probable cause, supported by Oath or affirmation, and " \
                "particularly describing the place to be searched, and the " \
                "persons or things to be seized.";
                done;;
        No ) exit;;
    esac
done

To use this save it as a shell script on a Linux system and invoke it from the command line as <command name> <device name>. When the device is full, the program will exit on error. Enjoy.

Sunday, October 23, 2011

Egress Filtering

“It is not what enters into the mouth that defiles the man, but what proceeds out of the mouth, this defiles the man.”
-- Jesus

White Hat Security's Jeremiah Grossman recently tweeted the following quotes from info sec legend Dan Geer:





Geer is a genius, there can be no doubt. However, when I read this, it bothered me. I have worked in large enterprises where knowing everything was nearly impossible and yet default-deny egress filtering was in place and effective at limiting loss.

Certainly implementing a default-deny egress filter without careful planning will be a resume generating event, but not implementing it due to incomplete knowledge may have the same result.

And as I said in response to Jeremiah's tweets on Twitter, implementing a default deny quickly leads to knowledge, but again, you're going to want to do this in a well-communicated and coordinated way, with careful planning throughout the organisation and management chain.

Friday, August 19, 2011

Fuzzy Hashing and E-Discovery

Recent work has made me consider an interesting role fuzzy hashes could play in E-Discovery.

In the last year I've worked a few intellectual property theft cases where Company A has sued Company B claiming Company B stole IP from Company A in the form of documents, design drawings, spreadsheets, contracts, etc.

In these cases Company A has requested that Company B turn over all documents that may pertain to Company A or Company A's work product, etc. with specific search terms provided and so on.

Company B argues they can't comply with Company A's request because they have documents relating to Company A and Company A's work product as a result of market research for the purposes of strategic planning and that turning over all of those documents would damage Company B.

In such cases, if Company A is concerned that Company B has stolen specific documents, maybe a better approach would be to request that Company B run ssdeep or another fuzzy hashing tool against all of their documents and turn over the fuzzy hashes.

Company A can then review the fuzzy hash results from Company B without knowing anything about the documents those hashes came from. They can compare the set of hashes provided by Company B against the set of fuzzy hashes generated from their own documents and make an argument to the judge to compel Company B to turn over those documents that match beyond a certain threshold.

24:DZL3MxMsqTzquAxQ+BP/te7hMHg9iGCTMyzGVmZWImQjXIvTvT/X7FJf8XLVw:J3oy+x/te7qmNmlYvX/xp8W

Sunday, August 14, 2011

Facebook Artifact Parser

If you have a Facebook account, take a look under the hood some time by viewing the source in your browser while you're logged in. Imagine having to deal with all of that for a digital forensics investigation. It's mind numbing, especially if all you want is who said what and when. I spent the better part of today brushing up on Python's regular expression implementation and put together this Facebook Artifact Parser that does a decent job of parsing through Facebook artifacts found on disk (as of the time of this writing).

In my case, I made use of this by first recovering several MB worth of Facebook artifacts from disk and I combined all of these elements into one file. Having done that, run this script from the command line giving the name of the file as the only argument. It works on multiple files as well.

Sunday, August 7, 2011

Yahoo! Messenger Decoder Updated

I'm working yet another case that involves Yahoo! Messenger Archives. I tried using JAD Software's excellent Internet Evidence Finder for this and it worked pretty well, but in the interest of double-checking my tools, I brushed off my old yahoo_msg_decoder.py script that I'd written a few years ago. It used to be interactive, meaning it was run with no arguments and would prompt for a username and a filename to parse, this was less than ideal for parsing a large number of files.

I have remedied that situation. The script now takes three arguments, one optional. The first is the username for the archive. Yahoo! Messenger Archives are xor'd with the username. The second argument is the name of the other party to the conversation and the third argument is the name of the dat file to process.

The nice thing about this is that you can now create a for loop like the following from a Linux environment and parse multiple files at once:

for i in $(ls *.dat); do echo; echo "== Parsing $i =="; yahoo_msg_decoder.py --username=joebob --other_party=billybob --file=$i; echo "== Finished parsing $i =="; echo; done


The output of this for loop can be redirected to a file.

My script is still not perfect. On some dat files it doesn't properly xor the data and yields garbage. I have not determined why that is the case yet.

As for IEF, I'm not sure why, but running it over the same dat files as my script, it dropped some portions of the conversation. I will be reporting the issue to JAD. But it's yet another reminder of the importance of testing your tools and confirming results.

update: After posting this, I remembered that Jeff Bryner had written a utility for this and it is still vastly superior to my own. I just verified that the link I have to his yim2text still works. Check it out.

Monday, May 30, 2011

Awk regtime bodyfile adjustment

Here's an awk one liner for adjusting regtime bodyfile time stamps, in this case we're adding 600 seconds:

awk -F'|' 'BEGIN {OFS="|"} {$9=$9+600;print}'


One thing to consider when adjusting time stamps to compensate for clock drift, clocks don't drift all at once, but over days, weeks and months. Adjusting time skews affects everything all at once.

Wednesday, May 18, 2011

Time again

I gave a version of the Time Line Analysis talk at Cyber Guardian earlier this week. Some in the room asked if the slides would be made available. As promised, here is a link to the deck. Enjoy.

Tuesday, April 26, 2011

MapReduce for simpletons

Data reduction redux and map-reduce is the title of my latest post at the SANS Digital Forensics Blog. I mentioned in my previous post there on using least frequency of occurrence in string searching that there would be a follow up.

The point of the new post is to sing the praises of @strcpy over on the Twitters. He helped me out by writing a short shell script that is, in essence, map-reduce for simpletons like me. I am constantly amazed by some of the members of the info sec community who will take time to help out near total strangers.

strcpy's script wasn't just helpful, it was educational. I'd read about map-reduce before, but it never really clicked until I saw strcpy's script. The scales have fallen from my eyes and I'm now adapting his script for other kinds of tasks.

Check out the post and if you find it beneficial and you ever get to meet strcpy in person, buy him a drink or a meal and tell him thanks, I plan to do the same one day.

Monday, April 25, 2011

Scalpel and Foremost

The crew over at Digital Forensics Solutions announced the release of a new version of Scalpel with some exciting new features. Check out their post for the full details, but here are three I was most interested in:

  • Parallel architecture to take full advantage of multicore processors

  • Beta support for NVIDIA CUDA-based GPU acceleration of header / footer searches

  • An asynchronous IO architecture for significantly faster IO throughput


  • Digital forensics is time consuming so any speed gains we can make are welcome ones.

    Over the last few days, I've had a chance to play with the new version of scalpel on my 64-bit Ubuntu system with 7GB of RAM. I downloaded the source and followed the directions in the readme to configure and compile the binary.

    I then ran some carves against a 103GB disk image from a recent case. The command line I used was:

    scalpel -b -c /etc/scalpel.conf -o scalpel-out/ -q 4096 sda1.dd

    The -q option is similar to foremost's -q option in that it tells scalpel to only scan the start of each cluster boundary for header values that match those specified in the config file. In my test, I used the two doc file signatures in the supplied example scalpel config file. The nice thing about -q is that you can provide the cluster size. With foremost -q will scan the start of each sector by default, you'll have to also add -b to get similar functionality out of foremost.

    I ran scalpel with the Linux time command so I could determine how long the command took to complete. Scalpel carved 6464 items that had byte signatures matching those in the configuration file. According to the time command, this took 52 minutes and 40 seconds.

    Manually verifying that all 6464 files are Word docs would be time consuming. In lieu of that, I followed Andrew Case's suggestion and used the following command from within the scalpel-out directory:

    for i in $(find . | grep doc$); do file $i; done | grep -i corrupt | wc -l

    The result was that 2707 of the 6464 files were found to be "corrupt" according to the file command. This is not an exact measure of the accuracy of scalpel's work, but it gives us a ballpark figure. If my math is correct that's a false positive rate of 41%. Just remember, these are rough figures, not exactly scientific.

    Next I configured foremost to use the exact same configuration file options and similar command line arguments (recall I had to use -b with foremost) and ran the carve against the same image. The command line I used was:

    foremost -c /etc/foremost.conf -i sda1.dd -o foremost-out -q -b 4096

    Again, I used the time command to measure how long this took, 47 minutes and 32 seconds later, foremost finished having carved 6464 files. I used the same measure for accuracy as with scalpel, running the following command from within the foremost-out directory:

    for i in $(find . | grep doc$); do file $i; done | grep -i corrupt | wc -l

    The result was that 2743 files came back as "corrupt" according to the file command. Interesting. Both tools used the exact same signatures, both carved exactly the same number of files, yet foremost was approximately 1% less accurate, though at 47 minutes compared to scalpel's 52 minutes, it was almost 10% faster.

    Conclusion:
    It's hard to draw conclusions from one simple test. I think it's great that scalpel is under active development and for those who can take advantage of the CUDA support, it could be a huge win in terms of time and time is against us these days in the digital forensics world.

    The other big plus, is that it's great to have another tool that we can use to test the results of another tool. I will continue to experiment with scalpel and look forward to future developments and I thank the developers of both tools for their contributions to the community.

    Saturday, April 23, 2011

    Forensic string searching

    Can the principle of "least frequent occurrence" be applied to digital forensic string searches?

    Late last night (or painfully early this morning) I published a new post over at the SANS Digital Forensics Blog. The post is called "Least frequently occurring strings?" and attempts to shed some light on that question.

    I've used this approach on a couple of recent cases, one real and one from The Honeynet Project's forensic challenge image found here, this is the image the post contains data from.

    I really knew nothing about the Honeynet challenge case, but in less than half an hour, I'd located an IRC bot using the LFO approach to analyzing strings. Of course the Honeynet case is quite small, so the technique worked well, on larger cases from the real world, I expect it's going to take longer or maybe not work at all. Nevertheless, LFO is a concept that other practitioners have been applying for some time now.

    There's lots of other goodies in the post, like moving beyond just using strings to extract ASCII and Unicode text from disk images. If you have a decent system and a good dictionary file, you can reduce this set of data even further to lines that actually contain English words.

    Check it out, I hope the world finds it useful.

    Wednesday, March 16, 2011

    Incident Response Triage

    Your phone rings, it's the Help Desk. They are calling you because they've got a few dozen systems that have been hit with malware that apparently came into the organization via phishing. Unfortunately, your team isn't large enough to respond to all of these systems simultaneously. You've got to quickly prioritize.

    You call the members of your team together and start delegating tasks. One person contacts the email admins and finds out who received the phishing email and compares that list against the one the Help Desk gave you. The email admins remove the offending message from user mailboxes and blacklist the sender. You ask them to send a copy of the message to you so you can dissect it and begin the process of analyzing the malware.

    You learn of another dozen potentially compromised hosts from conversations with the email admins.. You add them to your list. How do you prioritize your response to these victim systems? Let's say your company is very large, Fortune 100, and has been through a series of mergers and acquisitions over the last several years, nearly all of the names on the list of affected users are unknown to you. On the one hand, this may be good as it's likely none of these individuals are C-level execs. On the other hand, you've now got to figure out who these people are and what data they have on their systems and what data they have access to and who their local IT support personnel are.

    What are your next steps? Do you contact each user and survey them, asking what kind of data they deal with and have access to? Do you ask who their IT support person is? How accurate is the information you're going to get? What if some of these systems are multi-user and the user you're talking to is unaware of the special projects and associated data?

    Aside from questioning users, what other information gathering do you need to do? Does your organization have good exfiltration monitoring and logging in place? Do you have the ability to pull those logs and see what, if any, data has left the org? Do you have the ability to rapidly block outbound connections to the malware's command and control networks?

    I know I'm asking more questions than I'm answering, partly this is stream-of-consciousness writing, but I'm also soliciting input on IR triage for a project I'm working on. I've started a little IR triage tool I'm calling Windows Automated Incident Triage or WAIT. Here is the current capability roadmap for WAIT: identify users on a system and their privilege levels, catalog the data those users have recently accessed on their systems, create a list of file shares those users have recently accessed, gather available web history, collect information about the system's OS revision and a list of installed software.

    My hope is that this information will be useful to IR professionals in a situation like that above. I want a tool that can be used to help prioritize IR. What artifacts am I missing that may also be useful?

    And of course the tool will be open source, likely released under a BSD style license.

    Sunday, January 9, 2011

    How to find base64 encoded evidence

    Today I released a post over at the SANS Digital Forensics Blog discussing how to find evidence that may have been base64 encoded and therefore not found by traditional tools that categorize files based on magic numbers.

    The technique is really simple, but I hadn't seen it discussed elsewhere, perhaps because it's so obvious.

    Enjoy.

    Update: Here's a text file containing some magic byte sequences for common image types that have been base64 encoded: http://trustedsignal.com/forensics/b64_enc_img_types.txt.

    Paperclip Maximizers, Artificial Intelligence and Natural Stupidity

    Existential risk from AI Some believe an existential risk accompanies the development or emergence of artificial general intelligence (AGI)...