Writing a web application scanner
1 2 3 4 5 6 7 8 9 10 11 12
usage: wascan.py [-h] [-u] [-b] [-w path] target Web Application Scanner positional arguments: target target url to scan optional arguments: -h, --help show this help message and exit -u, --unique print only unique content -b, --brute bruteforce urls -w path set the wordlist to use
I’ve started out to build a web application scanner. The first usable iteration is ``wascan'' (sorry, I’m bad at naming things, if you have a better idea, let me know). The current version can crawl a target url and then by performing a brute-forcing step, it can discover further resources. The goal is to create a scanner which can automatically perform authentication, find and fuzz parameters, detect CSRF tokens, recognize session cookies and discover most of the OWASP top 10 vulnerabilities.
I will post an update describing the new features when they get done.
The current feature list is pretty slim:
- Website crawling & directory enumeration
- Duplicate detection using simhash
Similarity hashing is a technique for deriving short identifiers from data in a way that similar input produces similar output. Simhash is a simple similarity hashing algorithm devised by the researchers at Google for efficiently detecting near-duplicates during web crawling. I am using this technique to identify and group web pages which have dynamically generated parts (timestamps, urls, etc.), but are essentially the same. The output from the scanner shows duplicate items with a yellow hash and unique items with green.
The simhash algorithm starts by splitting the input into fragments for which a regular hash code is calculated. We can use either cryptographic or non-cryptographic hashes, but since we want better performance I’ve chosen to use murmurhash3. If we write these values below each other, the bits of the values form a matrix. From this matrix the output bits are calculated by counting the number of 1s in each column and determining whether they’re in majority. If there are more 1s than 0s, then a 1 is outputted otherwise we write a zero. At the end of this process we have a hash of the same length as the hashes of the fragments. This looks like the following in Python:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
def simhash(text): delim = b'; |, |. | |\r\n|\n|<|>|<!--|-->|\x00' if type(text) == str: text = text.encode('utf-8') frag = filter(lambda x: len(x) > 0, re.split(delim, text)) bins =  * 32 ures = 0 for f in frag: h = mmh3.hash(f, seed=0xcafe) for k in range(32): bins[k] += (((h & (1 << k)) > 0) * 2 - 1) for k in range(32): ures |= ((bins[k] > 0) << k) return ures
- Skip large files by looking at the Content-Length header
- Implement multithreaded mode to speed up scanner
- Parameter and form detection, advanced features