Document pipeline

The pipeline clones samples from a Gitlab repo, sorts files to PDF and other documents and then runs appropriate tools to the sample files. * Watch the VIDEO

Tools run in the pipeline

ClamAV/PDFiD/PeePDF/JSunpack-n/shellcode/strings/oledump/olevba

Pipeline workflow

concourse-view

User uploads samples to Git branch "sample-source". Concourse polls for Git, pipeline is triggered by changes in repository.

Samples are sorted to two analysis lines: PDF, and other documents. All files are scanned with Clam AV for viruses.

The Document pipeline at Gitlab

document-pipeline

PDF analysis

After the virus scan, PDFs are analysed with Pdfid and Peepdf. Pdfid tries to find certain PDF keywords, and tells if the file contains something suspicious, like JavaScript, or executes an action when opened.

Peepdf runs basic analysis as well, but it also checks if a sample's hash is found on VirusTotal and recognized as malware.

Jsunpack-n unpacks JavaScript from the samples and extracts shellcode, if found, to a "shellcode" folder, and converts it to binary format.

Finally, the found shellcode is analysed with Peepdf's sctest. Sctest tries to analyse the shellcode binary and to show what it's purpose is.

Document analysis

Other documents are run through "strings" after the virus scan.

After strings, Oledump dumps data streams found in the samples.

The last job, Olevba, parses the OLE/OpenXML files to detect macros and extract their source code. Olevba also detects patterns, like auto-executable macros, VBA keywords, anti-sandboxing and anti-virtualization
techniques and IOCs. It can also decode obfuscation methods like Hex encoding, StrReverse, Base64 and Dridex.

Results

All jobs create their own logs, and the final job creates a summary report. These can be viewed in the "results" branch. An example of a summary report here.

Setting up the pipeline

The easiest way to set up the pipeline is using the pilot environment instructions.


Job descriptions

job-sort-files

Files are sorted by type to PDF or non-PDF.

job-clamscan

All samples are scanned with ClamAV.
triggered after: job-sort-files

job-strings

Samples run through "strings".
triggered after: job-sort-files

job-pdfid

Analyses PDF files.
triggered after: job-clamscan

job-peepdf-virustotal-check

Analyses PDF files. Checks if the file's hash is found on Virustotal.
triggered after: job-clamscan

job-jsunpackn

Analyses PDF files. Possible shellcode is extracted to "results/shellcode".
triggered after: job-pdfid & job-peepdf-virustotal-check

job-sctest

Analyses shellcode from "results/shellcode" extracted by jsunpackn.
triggered after: job-jsunpackn

job-oledump

Dumps data streams of OLE files (doc, xls, ppt...).
triggered after: job-strings

job-olevba

Parses OLE and OpenXML files to detect VBA macros and extract their source code.
triggered after: job-oledump

job-generate-report

Creates the final report to the "results" branch.