Any plans to start using p0fv3?

Question

Any plans to start using p0fv3?

cmavr8 opened this issue 9 years ago · comments

Now that p0f3 has been out for a couple of years, I wonder if anyone is interested in making use of it in PRADS. I mean, the signatures are different, so you can't just replace p0f2 with p0f3. And you also would not like dropping the old sigs because the new ones do not really include old systems (as far as I can tell).

So, maybe using p0f3 by itself as another module in PRADS could be an option. The other way to go is to start picking features/methods used by p0f3 and adding them to PRADS.

Thoughts, anyone?

Kacper Why · Answer 1 · Mon Apr 20 2015 23:26:09 GMT+0800 (China Standard Time)

Yes, p0fv3 has been interesting from the start. It would need a little rewriting prads in-memory asset database though. The old sigs could be used to augment the v3 sigs, or just not used because they have limited value for new systems anyway.

Feel free to take a stab at this 👍

Kacper Why · Answer 2 · Mon Apr 20 2015 23:27:37 GMT+0800 (China Standard Time)

FWIW the existing SYN/SYNACK sigs are from p0f so using p0fv3 as a library somehow would be preferrable if possible.

Chris Mavrakis · Answer 3 · Tue Apr 21 2015 19:28:12 GMT+0800 (China Standard Time)

Thanks for the reply. Since I posted here, I did some trials, prads vs p0f3 or 'along with' p0f3, and it's been fun to see how the different programs behave.

I also worked a bit on merging the two sig sets, but using p0fv3's format. v2 sigs are not completely useless, especially in environments where old systems dominate (e.g. Industrial Control Systems!), so it would be nice to have them together.

Anyway, I'm working on fingerprinting so I may or may not come back to prads. I'll let you know if I have something to contribute.

Kacper Why · Answer 4 · Tue Apr 21 2015 20:01:46 GMT+0800 (China Standard Time)

Hia, I agree that the old signatures still have some value.

Regardless of the approach you are taking - within or outside of prads - I would love to hear about it. Stuff like this rekindles my interest for the project.

For instance, how are you merging the two signature sets?

Chris Mavrakis · Answer 5 · Tue Apr 21 2015 20:23:50 GMT+0800 (China Standard Time)

Well, for now I just parse each v2 signature, and write the fields needed for v3, always characterizing it as "s" (and not "g"). But I still need to figure out what to do with olen (options length), wscale and pclass fields. I'll probably leave quirks empty.

It's just a side project and I haven't worked on it for a few weeks.

My main task now is to use p0fv3 sigs to train a classifier, which outputs a model, and then do recognition (fingerprinting) using that. It started as an effort to "improve p0fv3" by replacing the matching logic with machine learning enabled methods.

I kind of passed on prads (for now at least), as it gave me inconsistent performance; it "changed its mind" too much, while processing new packets. p0fv3 was more conservative with its conclusions. Less results, but more consistent. I'm sure there's a way to tune prads, so maybe I'll come back to it later.

Kacper Why · Answer 6 · Tue Apr 21 2015 23:12:51 GMT+0800 (China Standard Time)

Your approach seems reasonable. If prads gets a p0fv3 classifier maybe we can extend the sigs to support quirks.
I like your machine learning approach, which I belive has the potential to be more powerful than the existing approaches, but may at the very least give us fresher fingerprints.
As far as the PRADS classifier goes, when you say inconsistent performance you do not mean processing speed wise but signature match quality wise? If that is so, then this is certainly a weakness of the p0fv2 approach which just spits out the TCP/UDP signature match for the current packet. Patching prads to support the p0fv3 approach would be the way to go to get more consistentcy.

Chris Mavrakis · Answer 7 · Wed Apr 22 2015 19:26:55 GMT+0800 (China Standard Time)

Actually, I think I didn't explain it correctly... I didn't mean I'd use p0fv3's classifier (well, it's more of a sig matcher than a full blown classifier). What I'm trying to do is substitute p0fv3's matching logic with a classifier (e.g. decision tree), to improve matching, especially when there are no exact matches. Michal did put some fuzzy matching tricks in his code, but I want to see if a "proper" classifier can be more effective.

Regarding PRADS performance, yeah I wasn't talking about speed. Ok, good insight, I hadn't thought it was the sigs causing it. It was unfair since I was comparing PRADS and p0fv3, not v2.

So now I'm trying to decide which way to go. I can either patch the classifier into p0fv3, and then patch PRADS to use p0fv3 (sounds like a lot of work for a C-non-guru like me), or bypass everything and just go with python from packet-captureing to OS detection.

I'm planning on using Layer 7 info extraction, too (e.g. from SMB), and I believe in software re-use and not having a thousand small projects trying to achieve the same results, so going with PRADS makes sense...

Kacper Why · Answer 8 · Wed Apr 22 2015 23:12:41 GMT+0800 (China Standard Time)

Hi Chris,
I am wondering if your best approach would not be to make a prototype in a high-level language. We originally made PRADS in perl, and only when we were happy with the approach and started looking for speed did we rewrite it all in C. Zalewski is a C guru so for him this is not much of a tradeoff, but even when I write a lot of C code it's far more slow going.

The Perl version got us doing our core algorithm real fast, and allowed us to try out a whole lot of features that later on weren't that essential, but once we got there we quickly saturated the Perl performance model.

One thing that came out of (the C version of ) PRADS was a packet capture scaffolding that had all the bits in place to make stand-alone tools like Gamelinux' cxtracker, passivedns and my edd possible, not to mention all the small packet capturing experiments that were never published, and have those perform as close to as efficiently as a single-threaded program can perform.

I almost used the word "framework" here but in the interests of simplicity we never made a library of it, we just had a gutted version of PRADS around to handle everything from daemonization through logging to parsing of the lower-level protocols. Pretty much every packet capturing project rewrites those. To tell you the truth I still plan on making libraries out of most of it sometime.

I would of course love to have you contribute some patches to PRADS for such an interesting approach, and that's why I suggest you start with the high-level languages. That way you get to focus on your classifier.

Chris Mavrakis · Answer 9 · Mon May 04 2015 16:38:32 GMT+0800 (China Standard Time)

Hi! I went silent but I wasn't idle...

Thanks for your input on high/low level implementations. I get it now, and I'll probably come back to low level implementations in the future.

I did it in python (scapy is super easy) for this round. The main limitation is that scapy loads the whole pcap to memory at once, which is not practical for anything larger than a hundred MB (on my laptop). So I split my pcaps and just process them serially.

Anyway, on to the main matter: I'm getting interesting results. I haven't validated any of it 100% scientifically, but using SMB data as ground truth (trusting that windows hosts tell the truth in their SMB session initiation messages), I measured performance for PRADS, p0fv3 and the classifier approach.

So, very preliminary results: out of 416 visible hosts on a network, SYNs are available for 385 of them. PRADS gives a result for 331 of them, p0f for 307, and my tool 332.

Now, I only validated the results' validity for 13 hosts:

PRADS: 6 OK results (correct but wide), 5 wrong
p0f: 2 OK, 4 wrong, 7 precise
The classifier: 2 OK, 2 wrong, 9 precise

And that's not even with using all the features my approach could use (maybe that's why it does better :P). It's using only TTL, options length, win size, win scale and options layout (thanks p0fv3 ;) ). And just a simple decision tree classifier (modified to work with wildcards). Maybe other classifiers (I'm thinking random forest) do even better!

So, I now need to digest the results, re-implement the classifier in python (it's running in rapidminer for now) and measure properly, try improvements etc etc.

CU!

EDIT: PRADS detects and reports a load balancer in 6 of the 13 cases mentioned above. I need to verify the topology of the network before I trust the results and judge the tools. I'm not sure whether it's PRADS or p0fv3/classifier/myself that is being tricked by the balancer.

Psipher Diaz · Answer 10 · Fri May 08 2015 00:11:31 GMT+0800 (China Standard Time)

Curious as to if your using a released classifier like weka or writing your own?

Also I wanted to mention the excellent nDPI library for Layer7 stuff which it seems may be fairly simple to integrate into PRADS. I have added pf_ring support to PRADS to enhance performance. Setting packet poll watermark to 128 greatly reduced the load.

Lastly there was a script that was part of PADS that converted nmap service probes to PADS rules. I have had some success using this after some small modifications. Some signatures would cause PRADS to segfault due to what I think was an issue with incorrect number of '/' fields. Source is below;

#!/bin/bash

LINENUM=wc -l $1 | awk '{print $1}'
LINE=tail -$LINENUM $1 | head -1
COMMENT=""
MAXFIELDS=9

echo
echo "# Signatures auto converted from $1"
echo
while [ $LINENUM -ne 0 ]; do
echo $LINE | grep ^match >/dev/null
MATCHLINE=$?
if [ $MATCHLINE -eq 0 ]; then
echo $LINE | grep SUBST >/dev/null
SUBSTLINE=$?
#comment out lines containing SUBST - FIXME
if [ $SUBSTLINE -eq 0 ]; then
COMMENT="#"
fi
echo $LINE
PROTOCOL=echo ${LINE} | awk '{print $2}'
S1=echo ${LINE} | sed -n 's/.* p\///p' | awk -F'/ ' '{ print $1 }'
S2=echo ${LINE} | sed -n 's/.* v\///p' | awk -F'/ ' '{ print $1 }'
S3=echo ${LINE} | sed -n 's/.* (i\/|h\/|o\/|d\/)//p' | awk -F'/ ' '{ print $1 }'
GREP=echo ${LINE} | sed -n 's/.* m[[:punct:]]//p' | sed -n 's/[[:punct:]] p\/.*$//p'
#GREP=echo ${LINE} | sed -n 's/\.\* m[[:punct:]]//p' | sed -n 's/[[:punct:]]\.\*p\/\.\*$//p'
SERVICE="${S1}/${S2}/${S3}"
echo "PROTOCOL: ${PROTOCOL}"
echo "SERVICE: ${SERVICE}"
echo "GREP: ${GREP}"
echo "${PROTOCOL},v/${SERVICE},${GREP}" >> $2
else
echo $LINE | grep '^# ' >/dev/null
COMMENTLINE=$?
if [ $COMMENTLINE -eq 0 ]; then
echo $LINE
fi
fi
LINENUM=expr $LINENUM - 1
LINE=tail -$LINENUM $1 | head -1
done
exit 0

Chris Mavrakis · Answer 11 · Wed Jun 10 2015 19:42:00 GMT+0800 (China Standard Time)

CyberTaoFlow, I used the decision tree module that is implemented in RapidMiner, but I had to modify it to handle wildcards ("*") found in p0fv3's signatures.

Thanks for the suggestions for the software, but I already have most of my code in python. Scapy is slow, but ok for proof-of-concept (after I've split the pcaps in baby-bite size to avoid choking scapy).