Document pipeline

The pipeline clones samples from a Gitlab repo, sorts files to PDF and other documents and then runs appropriate tools to the sample files. * Watch the VIDEO

Tools run in the pipeline


Pipeline workflow


User uploads samples to Git branch "sample-source". Concourse polls for Git, pipeline is triggered by changes in repository.

Samples are sorted to two analysis lines: PDF, and other documents. All files are scanned with Clam AV for viruses.

The Document pipeline at Gitlab


PDF analysis

After the virus scan, PDFs are analysed with Pdfid and Peepdf. Pdfid tries to find certain PDF keywords, and tells if the file contains something suspicious, like JavaScript, or executes an action when opened.

Peepdf runs basic analysis as well, but it also checks if a sample's hash is found on VirusTotal and recognized as malware.

Jsunpack-n unpacks JavaScript from the samples and extracts shellcode, if found, to a "shellcode" folder, and converts it to binary format.

Finally, the found shellcode is analysed with Peepdf's sctest. Sctest tries to analyse the shellcode binary and to show what it's purpose is.

Document analysis

Other documents are run through "strings" after the virus scan.

After strings, Oledump dumps data streams found in the samples.

The last job, Olevba, parses the OLE/OpenXML files to detect macros and extract their source code. Olevba also detects patterns, like auto-executable macros, VBA keywords, anti-sandboxing and anti-virtualization
techniques and IOCs. It can also decode obfuscation methods like Hex encoding, StrReverse, Base64 and Dridex.


All jobs create their own logs, and the final job creates a summary report. These can be viewed in the "results" branch. An example of a summary report here.

Setting up the pipeline

The easiest way to set up the pipeline is using the pilot environment instructions.

Job descriptions


Files are sorted by type to PDF or non-PDF.


All samples are scanned with ClamAV.
triggered after: job-sort-files


Samples run through "strings".
triggered after: job-sort-files


Analyses PDF files.
triggered after: job-clamscan


Analyses PDF files. Checks if the file's hash is found on Virustotal.
triggered after: job-clamscan


Analyses PDF files. Possible shellcode is extracted to "results/shellcode".
triggered after: job-pdfid & job-peepdf-virustotal-check


Analyses shellcode from "results/shellcode" extracted by jsunpackn.
triggered after: job-jsunpackn


Dumps data streams of OLE files (doc, xls, ppt...).
triggered after: job-strings


Parses OLE and OpenXML files to detect VBA macros and extract their source code.
triggered after: job-oledump


Creates the final report to the "results" branch.