Production implementation

Question

Production implementation

SyGen899 opened this issue 3 years ago · comments

Hi first off this is really cool, Im a novice coder and for research I would like to implement this on Netflow data in real time, the only thing is Im unsure how this can be integrated into a live environment and not on some local dataset, but maybe its a dumb question, but how should or could this be implemented?

Rui LIU · Answer 1 · Mon Mar 01 2021 06:15:55 GMT+0800 (China Standard Time)

Hi, thanks for your attention. But I'm not very clear about your requirement. Can you please give more details?

SyGen899 · Answer 2 · Mon Mar 01 2021 06:58:31 GMT+0800 (China Standard Time)

Thanks for your response, so if im understanding this, it needs to run constantly to perform better at detecting anomalies on a network so e.g. stream data , gets a new edge and score it, then classify, but maybe I dont understand what if the system goes down that this is running on, is there a way to store what the algorithm has learned as a backup or something? I read about the Count min sketch, is this only created in memory and released if failure happens? or does this not matter?

Rui LIU · Answer 3 · Mon Mar 01 2021 10:33:22 GMT+0800 (China Standard Time)

I think you would want to periodically backup algorithm states and CMS states to a local file. The current implementation is rather a minimal version, so all things are in-memory, except for outputs. To backup, I think most variables are useful, except for a small index array that carries hashing results back from CMS.

SyGen899 · Answer 4 · Mon Mar 01 2021 15:34:20 GMT+0800 (China Standard Time)

Ok, I understand a lot better now, do you know of a way that I can maybe do this with the current implementation? how should I go about storing these states?

Rui LIU · Answer 5 · Mon Mar 01 2021 20:35:54 GMT+0800 (China Standard Time)

Like you can save the states to a local file (whatever format you prefer) every 10M edges. You don't need to modify the core, since those data structures only use public members. Just add a wrapper, like example/Demo.cpp, and do your job.

SyGen899 · Answer 6 · Tue Mar 02 2021 21:38:52 GMT+0800 (China Standard Time)

Thanks. another question, how should a threshold be defined with this? is there an implementation that is available?

Rui LIU · Answer 7 · Thu Mar 04 2021 16:07:02 GMT+0800 (China Standard Time)

If you mean a threshold to decide whether an edge is anomalous, no, the algorithm only gives raw scores. But you can use a small sample of scores as the baseline.

SyGen899 · Answer 8 · Wed Mar 10 2021 19:45:20 GMT+0800 (China Standard Time)

Sorry, another question, but would this be effective in sampled NetFlow data, ie aggregate at n intervals?

Rui LIU · Answer 9 · Wed Mar 10 2021 20:52:55 GMT+0800 (China Standard Time)

Sorry I can't give a clear answer. Maybe you can try once and see if there's any problem.