Bulk Extractor 1.4.1.   Feature Freeze: 1 DEC.  Beta: 1 JAN. Release: 1 FEB
===========================================================================

New in v1.4.1:

+ Implemented libbulk_extractor as a loadable or linkably library with
  a callback API for feature reporting.

+ Currently the API is single-threaded. In the future, it will be able
  to spawn multiple threads and then kill them at the conclusion. (The
  callback will be serialized so that the caller does not need to
  worry about concurrency.)

- Implement random sampling in the API

+ Implemented modbulk_extractor.py, a python module that can run
  bulk_extractor on a buffer, file, or direct access to the drive


    typedef bulk_extractor_handle int;
    bulk_extractor_handle han;

- kill the threads when phase1 is over. (Save memory)

- We implemented a thread-safe map (atomic_map) which
  will slowly replace the maps with explict locks in the feature
  recording system.

- Replace all instances of <map> with <unordered_map>, as it is faster.

- evaluate how much RAM is used by histogram generation; can we move to one-pass histogram for other feature recorders?


Timeline for Insider Threat Project:

Jan 1 - Decision Date

Fallback option:

  If we do not have bulk_extractor working with GRR, we will implement a
  year 1 work-around that will:

  - C++ program that will run bulk_extractor API in sampling mode and send histogram to a website as a JSON post.
  - Host website on Google App Engine with the back-end in Python.
  - Implementation Jan-March

GRR Implementation:

  If we can run bulk_extractor from Python and GRR:

  - Create a GRR installer with bulk_extarctor bundled.

  - Have the lightweight object run every hour, with a 10% probability of initiating a hunt

  - Hunt does random sampling on 1% of the drive and reports results
    to GRR Server and store in an AFF hiearchy (which is a mongo DB).

Interfacing to UTSA:
   UTSA needs to be able to get a set of VECTORS (histograms?) specifying:
    - (date range)   (e.g. "2013-01-01 through 2013-05-01")
    - (Machine list) (e.g. "1,2,3" or "all")
    - Produce a list of vectors as a CSV file that UTSA can perform datamining with.
    - This can query Google App Engine or GRR interface


==Bulk Extractor API==

behan =  bulk_extractor_open(be_callback cb);
- returns valid pointer if success, 0 if failure.
int bulk_extractor_analyze_dev(behan,char *dev,uint64_t start,uint64_t end,uint64_t readsize,float percent);
int bulk_extractor_analyze_dir(behan,char *dir);
int bulk_extractor_analyze_buf(behan,uint8_t *buf,size_t buflen);
int bulk_extractor_close(behan);

  The only histograms you get are the one-pass histograms.


  pure C callback:

    int be_callback(int32_t flag,
                    uint32_t arg,
                    const char *feature_recorder_name,
                    const char *feature,size_t feature_len,
                    const char *context,size_t context_len);

    flag 0x0001 - feature 
         0x0002 - histogram (count in arg)
         0x0004 - carved object.   
               feature is no longer null-terminated
               context is the filename

    return 0 to continue, return -1 to abort further bulk_extractor processing.

Python Module Example:
  
import bulk_extractor

def be_callback(flag,count=None,feature_recorder_name=None,feature=None,context=None,filename=None):
    if flag & 0x0001:
        print("{}: Found feature '{}'".format(feature_recorder_name,feature))
    if flag & 0x0002:
        print("{}: {}  n={}".format(feature_recorder_name,feature,arg))
    if flag & 0x0004:
        print("CARVE filename {} (len={})".format(context,len(feature)))

be = bulk_extractor.bulk_extractor()
print("Running bulk_extractor on a single email address")
be.analyze_buf(be_callback,b"user@company.com")
print("Running bulk_extractor on current directory")
be.analyze_dir(be_callback,".")


