Determining the quality of TI feeds - Status report after one week of hacking
Writer: Jyrki Takanen, JAMK University of Applied Sciences
In recent years there has been an increase of various publicly available threat intelligence feeds. With the abundance of different sources for threat intelligence it has became increasingly important to determine quality of the various feeds. There has been some research on the subject but publicly available practical tools for such analysis are scarce. One of such tools is tiq-test.
In this post we explain first steps of how we began to analyze some publicly available TI feeds and how to automate the process of data collection and analysis.
Automating TI feed quality analysis
For polling the various threat intelligence feeds, we wrote simple python script and custom JSON file which lists the feeds and describes their properties. Now adding new feed is as easy as adding it to the JSON file.
We set the script to run daily using a cron job.
Problem with parsing different TI feeds is that they come in various formats. Luckily most of the time they are either list of lines separated by some delimiter or JSON files. So next step was to write script to parse the various feed formats to standard JSON format. This was accomplished by using the previously mentioned JSON file (containing the feeds) to describe the format of the feed: whether it is a csv file or delimited by other characters, if the file contains comment lines or banner, etc. This script is then called to convert the original files to JSON format. This can also be automated using cron.
Once the data has been pulled and formatted, comes the question of what to do with it all. It is not obvious what determines the quality of a particular feed and the determination of quality is also dictated by the use case of the feed.
Some determining properties one might consider: * Size * Dynamism - How often and what portion of entries are added and removed from the feed. * Timeliness - How early are entries appearing compared to other sources. * False positives - Does the feed contain a lot of entries which shouldn't be there. This may prohibit usage of the feed for some purposes.
We wanted to get some data about these metrics.
To this end we built a simple python script which compares individual feed to it's previous daily snapshot and counts how many entries are added and removed from the feed. Finally same numbers are calculated for the first and last snapshots. Results are then printed to a file.
One measure of quality of the feed is it's speed. In other words, are feeds entries added to the feed earlier than to other feed and are they removed in a timely manner. Getting meaningful data about this metric is not as easy. As a first approximation, our idea was to check what percentage of entries seen in multiple different feeds are seen earlier in a particular feed than in others. This lead to a following report
Short interval of sample data limits the usefulness of the report. Currently most of the entries are already in all of the feeds on the first snapshot which skews the numbers. Limiting on to subset of the data which has had changes occur after the start of data collection is one option to mitigate this issue.
In the future(?)
In the future the plan is to try to figure other methods for determining quality of the feeds and possibly to explore ways to visualize the data. Also so far most of the work has focused on feeds containing IP addresses. Possible avenues for further work would be to device methods for analyzing feeds containing different kind of indicators of compromise, such as domains, URLs and hashes.
- Some hosts don't like that their feeds are pulled by scripts and have banned urllib library's default user-agent. Possible workaround: https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/
- Hosts may have set limits how often data can be pulled (for example once every 30mins).
- JSON files are slow.