Fuzzycomparison - a tool for monitoring for changes in webpages
Writer: Ville Kalliokoski - OUSPG / University of Oulu
Fuzzycomparison is a tool for monitoring for changes in webpages. Originally meant for monitoring phishing sites, it calculates ssdeep hash and Levenshtein distance from previous versions of the page, and reports if there have been significant changes.
Fuzzy comparison of text files
Since webpages are rarely static nowadays, it is impossible to just compare two versions of the same page to notice meaningful changes. This means, that we must use a fuzzy comparison method, that allows for some leeway in the contents of the page.
sdeep
Ssdeep is a fuzzy hashing algorithm that hashes the input data in chunks, and can then calculate the difference of the hashes, giving a rough estimate how similar the two data sets are. The comparison value is between 0 and 100, with 100 being an exact copy.
1536:vyZzME2KFCa/J+skYH/JjsdtoRVFNGZPVJB9VZBFVVBhBMopERwoh1QPj9vwoh1t:yNosjHByVJB9VZBFVVBMeOa
1536:vyZzME2KFOa/J+skYH/JjsdtoRVFNGZwVJB9VZBFVVBQeUuvYRXSMEGUg6XSMEGl:yNEsjHBfVJB9VZBFVVBI5Wz
Comparison value: 75
Levenshtein distance
Edit distance calculates the steps needed to edit the input data to the output data based on insertions and deletions of characters. Since the edit distance can be anything from zero to infinity, fuzzycomparison calculates the relative distance by dividing the edit distance with the length of the older version of the page.
Edit distance cache
Since edit distance needs the contents of the previous version, fuzzycomparison keeps a cache of previously downloaded 
pages. You can configure, how many versions are stored of the page. If you need only the comparison values, you can use 
the default value 1, but if you wish to store multiple versions, you can adjust it with --cache-count parameter. You can
also set minimum time interval to store the page to cache, so that the cache is updated only if the cached version is older
than the configured time span with --archive-interval parameter with a string formatted as 00d00h00m00s.
As the name suggests, both of these give a fuzzy value for the differences, so the meaning of these comparison values is always dependent on the input data. To get around this, fuzzycomparison needs a few data points to know, when there have been significant changes in the website. It calculates the standard deviation of these comparison values, and reports if the changes are outside of it. You can also configure a tolerance for the deviation so the program reports changes only, if the comparison values are eg. more than 10% outside of the standard deviation.
Basic usage
Fuzzycomparison is designed to take in a list of URLs, scan and parse webpages in it and store the results. You can use either text file containing URLs, supply them from the command line or read the input from stdin.

The results are stored in a JSON containing a list of analyzed webpages and the metadata relating to them:
[
    {
      "edit distance": integer,
      "relative edit distance": float - proportional edit distance,
      "comparison value": integer - ssdeep comparison value,
      "error code": previous error code received while fetching,
      "hash": ssdeep hash of the latest downloaded version,
      "hash-old": ssdeep hash of the previous downloaded version,
      "human-readable timestamp": ISO timestamp - the time the latest version was downloaded,
      "human-readable timestamp-old": ISO timestamp - the time the previous version was downloaded,
      "previous version": path to previously downloaded version of the page,
      "timestamp": Epoch timestamp - the time the latest version was downloaded,
      "timestamp-old": Epoch timestamp - the time the previous version was downloaded,
      "url": URL of the page
    },
]
This JSON file can then be used as an input for the program, and rescan the pages for any changes.
If you don't specify the output, fuzzycomparison either updates the input file (in case of a JSON input), creates a new JSON file if you have configured a default output or outputs the contents to stdout.
You can also pass hashes and previous versions through CLI:

Note that if you wish to do this for multiple URLs, the hashes and filepaths of previous versions have to match the order of the URLs.
If you wish, you can also use stdin and stdout for input and output. They are disabled by default, but you can enable them in the config file. Default one is default.cfg in the root folder of the application.
Using previously downloaded pages

By default, fuzzycomparison fetches the provided URLs, but in case you have downloaded the differing versions beforehand,
you can use those with -c flag. Fuzzycomparison expects a path to a directory, that has subdirectories for each of the 
webpages containing at least two different versions of the page. Note that the cached files are sorted by the filename 
(by default they have a timestamp in the beginning of the filename).
fuzzycomparison -c alternate_cache_path
alternate_cache_path
    * webpage1
        * 00_version1.txt
        * 01_version2.txt
Configuration

Configuration is stored as a JSON file with following fields:
{
  "ChangesOutputPath": null, # Output path for report of webpages that have changed beyond thresholds defined below
  "EditDistanceThreshold": 0.9, # Threshold for edit distance (lower is farther from original)
  "SsdeepThreshold": 10, # Threshold for ssdeep difference (lower is farther from original)
  "DefaultOutputFile": None, # Default output file
  "Header": {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"
  }, # Header used when fetching the webpage
  "UseEditDistance": true,
  "UseSsdeep": true,
  "UseStdin": false,
  "UseStdout": true,
}
The application creates default.cfg if it doesn't exist, but you can also pass your own configuration files through CLI:
For more information, documentation and source code, check out the gitlab repository.