itInsight / Writing a web application scanner

Meet wascan

usage: wascan.py [-h] [-u] [-b] [-w path] target

Web Application Scanner

positional arguments:
target        target url to scan

optional arguments:
-h, --help    show this help message and exit
-u, --unique  print only unique content
-b, --brute   bruteforce urls
-w path       set the wordlist to use

Download me from GitHub

I’ve started out to build a web application scanner. The first usable iteration is ``wascan'' (sorry, I’m bad at naming things, if you have a better idea, let me know). The current version can crawl a target url and then by performing a brute-forcing step, it can discover further resources. The goal is to create a scanner which can automatically perform authentication, find and fuzz parameters, detect CSRF tokens, recognize session cookies and discover most of the OWASP top 10 vulnerabilities.

I will post an update describing the new features when they get done.

Features

The current feature list is pretty slim:

Website crawling & directory enumeration
Duplicate detection using simhash

Similarity hashing

Similarity hashing is a technique for deriving short identifiers from data in a way that similar input produces similar output. Simhash is a simple similarity hashing algorithm devised by the researchers at Google for efficiently detecting near-duplicates during web crawling. I am using this technique to identify and group web pages which have dynamically generated parts (timestamps, urls, etc.), but are essentially the same. The output from the scanner shows duplicate items with a yellow hash and unique items with green.

WaScan in action

The simhash algorithm starts by splitting the input into fragments for which a regular hash code is calculated. We can use either cryptographic or non-cryptographic hashes, but since we want better performance I’ve chosen to use murmurhash3. If we write these values below each other, the bits of the values form a matrix. From this matrix the output bits are calculated by counting the number of 1s in each column and determining whether they’re in majority. If there are more 1s than 0s, then a 1 is outputted otherwise we write a zero. At the end of this process we have a hash of the same length as the hashes of the fragments. This looks like the following in Python:

def simhash(text):
    delim = b'; |, |. | |\r\n|\n|<|>|<!--|-->|\x00'

    if type(text) == str:
        text = text.encode('utf-8')

    frag = filter(lambda x: len(x) > 0, re.split(delim, text))
    bins = [0] * 32
    ures = 0

    for f in frag:
        h = mmh3.hash(f, seed=0xcafe)

        for k in range(32):
            bins[k] += (((h & (1 << k)) > 0) * 2 - 1)

    for k in range(32):
        ures |= ((bins[k] > 0) << k)

    return ures

Todo

Skip large files by looking at the Content-Length header
Implement multithreaded mode to speed up scanner
Parameter and form detection, advanced features