Forensicsforensicsdisk-analysisfeature-extractionregex-scanninghistograms

bulk-extractor

bulk_extractor scans disk images, files, or directories to extract useful information without parsing file system structures. It generates feature files and histograms for easy inspection and analysis.

Description

bulk_extractor is a high-performance C++ program designed for digital forensics that extracts information such as emails, URLs, credit card numbers, and other features directly from disk images or files. It operates without needing to parse the file system, making it efficient for large datasets. Results are stored in feature files that can be inspected manually or processed with automated tools, with histograms highlighting the most common and potentially important features.

Use cases include forensic investigations where rapid extraction of contact information, network artifacts, and sensitive data is needed from unallocated space or fragmented files. It supports multi-threaded processing and provides progress updates during analysis, making it suitable for large disk images.

The tool creates detailed reports on features like CCNs, domains, emails, IP addresses, telephone numbers, and URLs, enabling analysts to quickly identify key evidence without traditional file carving limitations.

How It Works

bulk_extractor reads the input disk image or file in phases: Phase 1 scans raw data using multiple enabled scanners (e.g., ccn, email, url) that apply regular expressions and pattern matching to detect features regardless of file system boundaries. It processes data in threads, reporting progress by offset and estimated completion time. Phase 2 shuts down scanners, and Phase 3 generates histograms for all detected features. Features are written to individual files in the output directory with context windows around matches, using configurable hash algorithms and scan options. Performance is CPU-bound, benefiting from multi-core systems.

Installation

bash
sudo apt install bulk-extractor

Flags

-oSpecifies the output directory for extracted features
-A, --offset_addOffset added (in bytes) to feature locations (default: 0)
-b, --banner_filePath of file whose contents are prepended to top of all feature files
-C, --context_windowSize of context window reported in bytes (default: 16)
-d, --debugEnable debugging (default: 1)
-D, --debug_helpHelp on debugging
-xDisable specific scanners (e.g., -x accts)
-eEnable specific scanners (e.g., -e base16)
-SSet scanner-specific options (e.g., -S ssn_mode=0, -S word_min=6)

Examples

Extracts files to the output directory (bulk-out) after analyzing the image file (xp-laptop-2005-07-04-1430.img)
bulk_extractor -o bulk-out xp-laptop-2005-07-04-1430.img
Displays help information including usage, flags, and scanner options
bulk_extractor -h
Runs analysis on image while disabling the accts scanner
bulk_extractor -x accts image_name
Runs analysis on image while enabling the base16 scanner
bulk_extractor -e base16 image_name
Runs analysis with SSN scanner set to no 'SSN' required mode
bulk_extractor -S ssn_mode=1 image_name
Runs analysis with wordlist scanner configured for words between 6-16 characters
bulk_extractor -S word_min=6 -S word_max=16 image_name
Runs analysis using SHA1 as the hash algorithm for calculations
bulk_extractor -S hash_alg=sha1 image_name
Updated 2026-04-16kali.org ↗