Digital Forensics - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Digital Forensics

Description:

Title: Example: Data Mining for the NBA Author: Chris Clifton Last modified by: bxt043000 Created Date: 8/31/1999 4:11:00 PM Document presentation format – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 32

Provided by: ChrisCl152

Learn more at: https://personal.utdallas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Digital Forensics

1
Digital Forensics

Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Evidence Correlation
November 4, 2008

2
Papers to discuss

Forensic feature extraction and cross-drive
analysis
http//dfrws.org/2006/proceedings/10-Garfinkel.pdf
A correlation method for establishing provenance
of timestamps in digital evidence
http//dfrws.org/2006/proceedings/13-20Schatz.pdf
md5bloom Forensic file system hashing revisited
(OPTIONAL)
http//dfrws.org/2006/proceedings/11-Roussev.pdf
Identifying almost identical files using context
triggered piecewise hashing (OPTIONAL)
http//dfrws.org/2006/proceedings/12-Kornblum.pdf

3
Abstract of Paper 1

This paper introduces Forensic Feature Extraction
(FFE) and Cross-Drive Analysis (CDA), two new
approaches for analyzing large data sets of disk
images and other forensic data. FFE uses a
variety of lexigraphic techniques for extracting
information from bulk data CDA uses statistical
techniques for correlating this information
within a single disk image and across multiple
disk images. An architecture for these techniques
is presented that consists of five discrete
steps imaging, feature extraction, first-order
cross-drive analysis, cross-drive correlation,
and report generation. CDA was used to analyze
750 images of drives acquired on the secondary
market it automatically identified drives
containing a high concentration of confidential
financial records as well as clusters of drives
that came from the same organization. FFE and CDA
are promising techniques for prioritizing work
and automatically identifying members of social
networks under investigation. Authors believe it
is likely to have other uses as well.

4
Outline

Introduction
Forensics Feature Extraction
Single Drive Analysis
Cross drive analysis
Implementation
Directions

5
Introduction Why?

Improper prioritization. In these days of cheap
storage and fast computers, the critical resource
to be optimized is the attention of the examiner
or analyst. Today work is not prioritized based
on the information that the drive contains.
Lost opportunities for data correlation. Because
each drive is examined independently, there is no
opportunity to automatically connect the dots
on a large case involving multiple storage
devices.
Improper emphasis on document recovery. Because
todays forensic tools are based on document
recovery, they have taught examiners, analysts,
and customers to be primarily concerned with
obtaining documents.

6
Feature Extraction

An email address extractor, which can recognize
RFC822- style email addresses.
An email Message-ID extractor.
An email Subject extractor.
A Date extractor, which can extract date and
time stamps in a variety of formats.
A cookie extractor, which can identify cookies
from the Set-Cookie header in web page cache
files.
A US social security number extractor, which
identifies the patterns -- and
when preceded with the letters SSN and an
optional colon.
A Credit card number extractor.

7
Single Drive analysis

Extracted features can be used to speed initial
analysis and answer specific questions about a
drive image.
Authors have successfully used extracted features
for drive image attribution and to build a tool
that scans disks to report the likely existence
of information that should have been destroyed
under Fair and Accurate Credit Transactions Act
Drive attribution an analyst might encounter a
hard drive and wish to determine to whom that
drive previously belonged. For example, the drive
might have been purchased on eBay and the analyst
might be attempting to return it to its previous
owner.
powerful technique for making this determination
is to create a histogram of the email addresses
on the drive (as returned by the email address
feature extractor).

8
Cross drive analysis (CDA)

Cross-drive analysis is the term that coined to
describe forensic analysis of a data set that
spans multiple drives.
The fundamental theory of cross-drive analysis is
data gleaned from multiple drives can improve the
forensic analysis of a drive in question both in
the case when the multiple drives are related to
the drive in question and in the case when they
are not.
two forms of CDA first order, in which the
results of a feature extractor are compared
across multiple drives, an O(n) operation and
second order, where the results are correlated,
an O(n2) operation.

9
Implementation

1. Disks collected are imaged onto into a single
AFF file. (AFF is the Advanced Forensic Format, a
file format for disk images that contains all of
the data accession information, such as the
drives manufacturer and serial number, as well
as the disk contents)
2. The afxml program is used to extract drive
metadata from the AFF file and build an entry in
the SQL database.
3. Strings are extracted with an AFF-aware
program in three passes, one for 8-bit
characters, one for 16-bit characters in lsb
format, and one for 16-bit characters in msb
format.
4. Feature extractors run over the string files
and write their results to feature files.
5. Extracted features from newly-ingested drives
are run against a watch list hits are reported
to the human operator.
6. The feature files are read by indexers, which
build indexes in the SQL server of the identified
features.

10
Implementation

7. A multi-drive correlation is run to see if the
newly accessioned drive contained features in
common with any drives that are on a drive watch
list.
8. A user interface allows multiple analysts to
simultaneously interact with the database, to
schedule new correlations to be run in a batch
mode, or to view individual sectors or recovered
files from the drive images that are stored on
the file server.

11
Directions

Improve feature extraction
Improve the algorithms
Develop end to end systems

12
Abstract of Paper 2

Establishing the time at which a particular event
happened is a fundamental concern when relating
cause and effect in any forensic investigation.
Reliance on computer generated timestamps for
correlating events is complicated by uncertainty
as to clock skew and drift, environmental factors
such as location and local time zone offsets, as
well as human factors such as clock tampering.
Establishing that a particular computers
temporal behavior was consistent during its
operation remains a challenge. The contributions
of this paper are both a description of
assumptions commonly made regarding the behavior
of clocks in computers, and empirical results
demonstrating that real world behavior diverges
from the idealized or assumed behavior. Authors
present an approach for inferring the temporal
behavior of a particular computer over a range of
time by correlating commonly available local
machine timestamps with another source of
timestamps. We show that a general
characterization of the passage of time may be
inferred from an analysis of commonly available
browser records.

13
Outline

Introduction
Factors to consider
Drifting clocks
Identifying computer timescales by correlation
with corroborating sources
Directions

14
Introduction

Timestamps are increasingly used to relate events
which happen in the digital realm to each other
and to events which happen in the physical realm,
helping to establish cause and effect.
Difficulty with timestamps is how to interpret
and relate the timestamps generated by separate
computer clocks when they are not synchronized
Current approaches to inferring the real world
interpretation of timestamps assume idealized
models of computer clock
Uncertainty with the behavior of suspects clock
computer before seizure.
Authors explore two themes related to this
uncertainty.
investigate whether it is reasonable to assume
uniform behavior of computer clocks over time,
and test these assumptions by attempting to
characterize how computer clocks behave in the
wild.
investigate the feasibility of automatically
identifying the local time on a computer by
correlating timestamps embedded in digital
evidence with corroborative time sources.

15
Factors

Computer timekeeping
Real-time synchronization
Factors affecting timekeeping accuracy
Clock configuration
Tampering
Synchronization protocol
Misinterpretation
Usage of timestamps in forensics

16
Drifting clocks behavior

Enumerate the main factors influencing the
temporal behavior of the clock of a computer, and
then attempt to experimentally validate whether
one can make informed assumptions about such
behavior.
Authors do this by empirically studying the
temporal behavior of a network of computers found
in the wild.
The subject of case study is a network of
machines in active use by a small business. The
network consists of a Windows 2000 domain,
consisting of one Windows 2000 server, and mixed
number of Windows XP and 2000 workstations.
The goal here is to observe the temporal
behavior. In order to observe this behavior,
authors have constructed a simple service that
logs both the system time of a host computer and
the civil time for the location.
The program samples both sources of time and logs
the results to a file. The logging program was
deployed on all workstations and the server

17
Correlation

Automated approach which correlates time stamped
events found
on a suspect computer with time stamped events
from a more reliable, corroborating source.
Web browser records are increasingly employed as
evidence in investigations, and are a rich source
of time stamped data.
Techniques implemented are Click stream
correlation algorithm and Non-cached correlation
algorithm
Authors compare the results of both algorithms

18
Directions

Need to determine whether the conditions and the
assumptions of the experiments are realistic
What are the most appropriate correlation
algorithms?
Need to integrate with clock synchronization
algorithms

19
Abstract of Paper 3 (OPTIONAL)

Hashing is a fundamental tool in digital forensic
analysis used both to ensure data integrity and
to efficiently identify known data objects.
Authors objective is to leverage advanced hashing
techniques in order to improve the efficiency and
scalability of digital forensic analysis. They
explore the use of Bloom filters as a means to
efficiently aggregate and search hashing
information. In They present md5bloo a Bloom
filter manipulation tool that can be incorporated
into forensic practice, along with example uses
and experimental results.

20
Outline

Introduction
Bloom filter
Applications
Directions

21
Introduction

The goal is to pick from a set of forensic images
the one(s) that are most like (or perhaps most
unlike) a particular target.
This problem comes up in a number of different
variations, such as comparing the target with
previous/related cases, or determining the
relationships among targets in a larger
investigation.
The goal is to get a high-level picture that will
guide the following in-depth inquiry.
already existing problems of scale in digital
forensic tools are further multiplied by the
number of targets, which explains the fact that
in other forensic areas comparison with other
cases is routine and massive, whereas in digital
forensics it is the exception.
An example is object versioning detection need
to detect a particular version of an object and
not the target object

22
Introduction

need to address is a way to store a set of hashes
representing the different components of a
composite object as opposed to a single hash.
For example, hashing the individual routines of
libraries or executables would enable
fine-grained detection of changes (e.g. only a
fraction of the code changes from version to
version).
The problem is that storing more hashes presents
a scalability problem even for targets of modest
sizes.
Therefore, authors propose the use of Bloom
filters as an efficient way to store and query
large sets of hashes.

23
Bloom Filters

A Bloom filter B is a representation of a set S
s1,., sn of n elements from a universe (of
possible values) U. The filter consists of an
array of m bits, initially all set to 0.
the ratio r m/n is a key design element and is
usually fixed for a particular application.
To represent the set elements, the filter uses k
independent hash functions h1, ., hk, with a
range 0, ., m 1. All hash functions are assumed
to be independent and to map elements from U
uniformly over the range of the function.
Md5bloom Authors have a prototype
stream-oriented Bloom filter implementation
called md5bloom.

24
Application of Bloom Filter in Security

Spafford (1992) was one of the first person to
use Bloom filters to support computer security.
The OPUS system by Spafford uses a Bloom filter
which efficiently encodes a wordlist containing
poor password choices to help users choose strong
passwords.
Bellovin and Cheswick present a scheme for
selectively sharing data while maintaining
privacy. Through the use of encrypted Bloom
filters, they allow parties to perform searches
against each others document sets without
revealing the specific details of the queries.
The system supports query restrictions to limit
the set of allowed queries.
Aguilera et al. discuss the use of Bloom filters
to enhance security in a network-attached disks
(NADs) infrastructure.
The authors use bloom filtering to detect hash
tampering

25
Directions

Cryptography is a kept application for detecting
evidence tampering
Bloom filters are one application for tampering
the hash
Need to compare different cryptographic
algorithms
Relations to correlation needs to be determined

26
Abstract of Paper 4

Homologous files share identical sets of bits in
the same order. Because such files are not
completely identical, traditional techniques such
as cryptographic hashing cannot be used to
identify them. This paper introduces a new
technique for constructing hash signatures by
combining a number of traditional hashes whose
boundaries are determined by the context of the
input. These signatures can be used to identify
modified versions of known files even if data has
been inserted, modified, or deleted in the new
files. The description of this method is followed
by a brief analysis of its performance and some
sample applications to computer forensics.

27
Outline

Introduction
Piece-wise hashing
Spamsum algorithms
Directions

28
Introduction

This paper describes a method for using a context
triggered rolling hash in combination with a
traditional hashing algorithm to identify known
files that have had data inserted, modified, or
deleted.
First, they examine how cryptographic hashes are
currently used by forensic examiners to identify
known files and what weaknesses exist with such
hashes.
Next, the concept of piecewise hashing is
introduced.
Finally a rolling hash algorithm that produces a
pseudo-random output based only on the current
context of an input is described.
By using the rolling hash to set the boundaries
for the traditional piecewise hashes, authors
create a Context Triggered Piecewise Hash (CTPH).

29
Piece wise hashing

arbitrary hashing algorithm to create many
checksums for a file instead of just one. Rather
than to generate a single hash for the entire
file, a hash is generated for many discrete
fixed-size segments of the file. For example, one
hash is generated for the first 512 bytes of
input, another hash for the next 512 bytes, and
so on.
A rolling hash algorithm produces a pseudo-random
value based only on the current context of the
input. The rolling hash works by maintaining a
state based solely on the last few bytes from the
input. Each byte is added to the state as it is
processed and removed from the state after a set
number of other bytes have been processed.

30
Spamsum

Email spam detection written by Dr. Andrew
Tridgell is Spamsum can identify emails that are
similar but not identical to samples of known
spam. The spamsum algorithm was in turn based
upon the rsync checksum also by Dr. Tridgell.
The spamsum algorithm uses FNV hashes for the
traditional hashes which produce a 32-bit output
for any input. In spamsum, Dr. Tridgell further
reduced the FNV hash by recording only a base64
encoding of the six least significant bits (LS6B)
of each hash value
The algorithm for the rolling hash was inspired
by the Alder32
checksum.

31
Directions