Large time-based feed calls crash

Question

Large time-based feed calls crash

TheDropZone opened this issue 4 years ago · comments

Brandon Beiler commented 4 years ago

Do you have a support subscription?

yes, fortifydata

Did you check the FAQ https://github.com/csirtgadgets/cif-v5/wiki/FAQ ?

I made sure that my memory and storage were at acceptable levels for the docker stats

Expected behavior and actual behavior.

curl -X GET "http://localhost:5000/indicators?limit=10000&itype=ipv4&hours=1&nolog=1&confidence=3" -H "accept: application/json"

When running the previous command, either through curl/swagger/ or the Python CLI (with same args), the logs show that the CIF server is crashing while trying to process the returning data that is requested. By pulling the database, I can see that requesting the past hour should give me ~2k-3k records. The database is sitting at 218M. It would be expected that those records are returned.

As a workaround, I have found that using the /Indicators POST endpoint actually works (Although you skip the benefits of the whitelist I think? Right?) . The downside of using the post endpoint is that you dont get the automatic "reported_at" processing on hours/days/etc. So if you duplicate the "calc_reported_at_window" function and generate the "reported_at" on the ingesting side, it will actually spit out those expcted ~2k-3k results that can be verified in the database.

curl -X POST "http://3.21.225.128:5000/indicators" -H "accept: application/json" -H "Content-Type: application/json" -d "[ { \"itype\": \"ipv4\", \"confidence\": 3, \"limit\": 50000, \"nolog\": 1, \"reported_at\": \"2020-03-24T11:11:20,2020-03-24T12:41:20\" }]"

Steps to reproduce the problem

Spin up your CIF server and let it run for a few days in order to build a large database.
Using the CIF CLI or CURL or SWAGGER, request a feed of the past hour with ipv4, confidence of 3, limit of 10000 and nolog = 1.
Check the logs

Relevant logs as a result of the actual behavior

cif-httpd       | [2020-03-24 12:49:01,870] ERROR in app: Exception on /indicators [GET]
cif-httpd       | Traceback (most recent call last):
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 22, in _get
cif-httpd       |     r = client.indicators_search(filters)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/client/zeromq/__init__.py", line 22, in indicators_search
cif-httpd       |     decode=decode)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/client/zeromq/base.py", line 198, in _send
cif-httpd       |     return self._recv(decode=decode, close=self.autoclose)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/client/zeromq/base.py", line 161, in _recv
cif-httpd       |     m = self.socket.recv_msg()
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/zmq/socket.py", line 24, in recv_msg
cif-httpd       |     m = self.recv_multipart()
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/zmq/sugar/socket.py", line 475, in recv_multipart
cif-httpd       |     parts = [self.recv(flags, copy=copy, track=track)]
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
cif-httpd       |   File "zmq/backend/cython/checkrc.pxd", line 20, in zmq.backend.cython.checkrc._check_rc
cif-httpd       | zmq.error.Again: Resource temporarily unavailable
cif-httpd       |
cif-httpd       | During handling of the above exception, another exception occurred:
cif-httpd       |
cif-httpd       | Traceback (most recent call last):
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/indicators.py", line 130, in get
cif-httpd       |     rv = get_feed(f)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 60, in get_feed
cif-httpd       |     _get_whitelist(filters)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 80, in _get_whitelist
cif-httpd       |     return aggregate(_get(wl_filters))
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 25, in _get
cif-httpd       |     raise ConnectionError
cif-httpd       | ConnectionError
cif-httpd       |
cif-httpd       | During handling of the above exception, another exception occurred:
cif-httpd       |
cif-httpd       | Traceback (most recent call last):
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
cif-httpd       |     rv = self.dispatch_request()
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
cif-httpd       |     return self.view_functions[rule.endpoint](**req.view_args)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/api.py", line 325, in wrapper
cif-httpd       |     resp = resource(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 89, in view
cif-httpd       |     return self.dispatch_request(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/resource.py", line 44, in dispatch_request
cif-httpd       |     resp = meth(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/indicators.py", line 143, in get
cif-httpd       |     api.abort(503)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/namespace.py", line 141, in abort
cif-httpd       |     abort(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/errors.py", line 31, in abort
cif-httpd       |     flask.abort(code)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/werkzeug/exceptions.py", line 772, in abort
cif-httpd       |     return _aborter(status, *args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/werkzeug/exceptions.py", line 753, in __call__
cif-httpd       |     raise self.mapping[code](*args, **kwargs)
cif-httpd       | werkzeug.exceptions.ServiceUnavailable: 503 Service Unavailable: The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
cif-httpd       | [2020-03-24 12:49:01,870] ERROR in app: Exception on /indicators [GET]
cif-httpd       | Traceback (most recent call last):
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 22, in _get
cif-httpd       |     r = client.indicators_search(filters)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/client/zeromq/__init__.py", line 22, in indicators_search
cif-httpd       |     decode=decode)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/client/zeromq/base.py", line 198, in _send
cif-httpd       |     return self._recv(decode=decode, close=self.autoclose)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/client/zeromq/base.py", line 161, in _recv
cif-httpd       |     m = self.socket.recv_msg()
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cifsdk/zmq/socket.py", line 24, in recv_msg
cif-httpd       |     m = self.recv_multipart()
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/zmq/sugar/socket.py", line 475, in recv_multipart
cif-httpd       |     parts = [self.recv(flags, copy=copy, track=track)]
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy
cif-httpd       |   File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
cif-httpd       |   File "zmq/backend/cython/checkrc.pxd", line 20, in zmq.backend.cython.checkrc._check_rc
cif-httpd       | zmq.error.Again: Resource temporarily unavailable
cif-httpd       |
cif-httpd       | During handling of the above exception, another exception occurred:
cif-httpd       |
cif-httpd       | Traceback (most recent call last):
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/indicators.py", line 130, in get
cif-httpd       |     rv = get_feed(f)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 60, in get_feed
cif-httpd       |     _get_whitelist(filters)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 80, in _get_whitelist
cif-httpd       |     return aggregate(_get(wl_filters))
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/feeds/__init__.py", line 25, in _get
cif-httpd       |     raise ConnectionError
cif-httpd       | ConnectionError
cif-httpd       |
cif-httpd       | During handling of the above exception, another exception occurred:
cif-httpd       |
cif-httpd       | Traceback (most recent call last):
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
cif-httpd       |     rv = self.dispatch_request()
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
cif-httpd       |     return self.view_functions[rule.endpoint](**req.view_args)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/api.py", line 325, in wrapper
cif-httpd       |     resp = resource(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 89, in view
cif-httpd       |     return self.dispatch_request(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/resource.py", line 44, in dispatch_request
cif-httpd       |     resp = meth(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/cif-5.0b6-py3.7.egg/cif/http/indicators.py", line 143, in get
cif-httpd       |     api.abort(503)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/namespace.py", line 141, in abort
cif-httpd       |     abort(*args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/flask_restplus/errors.py", line 31, in abort
cif-httpd       |     flask.abort(code)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/werkzeug/exceptions.py", line 772, in abort
cif-httpd       |     return _aborter(status, *args, **kwargs)
cif-httpd       |   File "/usr/local/lib/python3.7/site-packages/werkzeug/exceptions.py", line 753, in __call__
cif-httpd       |     raise self.mapping[code](*args, **kwargs)
cif-httpd       | werkzeug.exceptions.ServiceUnavailable: 503 Service Unavailable: The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
```

# Did you attempt to fix the problem and submit a pull request?

I traced the get call all through the code and found that it seems to be a ZMQ error. When the:
get_feed -> _get_whitelist -> _get call is made, the error is being generated in the http.feeds._get method as the ZMQ seems to be having a ConnectionError. I'm not super familiar with Python messaging queues, but maybe a bandwidth issue? I dunno.

# Specifications like the version of the project, operating system, or hardware.
Ubuntu 18LTS, AWS EC2 t2.large, 2 cpu, 8gb rm, 25gb storage.

# How large is your /var/lib/cif.db database?
~230M

wes · Answer 1 · Wed Mar 25 2020 06:54:32 GMT+0800 (China Standard Time)

Try setting ‘CIF_HUNT: 0’ in the cif-router section of your docker-compose. The hunters can put a lot of pressure on cif-store thread running on SQLite- which is why they’re off by default.

Brandon Beiler · Answer 2 · Wed Mar 25 2020 06:57:51 GMT+0800 (China Standard Time)

Try setting ‘CIF_HUNT: 0’ in the cif-router section of your docker-compose. The hunters can put a lot of pressure on cif-store thread running on SQLite- which is why they’re off by default.

Could I solve that by allocating more cpu? More bandwidth? Im quite partial to the hunters

wes · Answer 3 · Sat Mar 28 2020 05:05:13 GMT+0800 (China Standard Time)

does it resolve the issue? if so, how long between the request and when you see those error timestamps... (it should be ~60sec), just want to make sure we're measuring apples to apples here. what does your cif-httpd section in docker-compose look like?

Brandon Beiler · Answer 4 · Tue Mar 31 2020 20:58:52 GMT+0800 (China Standard Time)

I apologize, I've been out of the office and haven't been able to try this. I can confirm that it is definitely right around ~60 seconds which I would guess coincides with the ZMQ timeout?
This is my http docker-compose, which is just being pulled from the repo.

  # REST interface
  cif-httpd:
    depends_on:
      - cif-router
    image: csirtgadgets/cif-httpd:latest
    container_name: cif-httpd
    volumes:
      - ./datav5/cif:/var/lib/cif
    ports:
      - '5000:5000'
    environment:
      ZMQ_RCVTIMEO: 60000
      CIF_LOGLEVEL: "INFO"

Like I had mentioned before, I currently have production code just using the /POST endpoint with all my filters (confidence, faked hours with reported_at formatting, limit, nolog, itype). So far, this setup has been working perfectly with really fast response times. I have noticed that a few of my returned records (google.com 😆) didn't get whitelist filtered since that endpoint doesn't have the whitelist filtering, but so far, that hasn't been a deelbreaker

wes · Answer 5 · Wed Apr 01 2020 20:20:44 GMT+0800 (China Standard Time)

try ZMQ_RCVTIMEO: 90000 (i think you have to hard-restart the nodes, or even bring them all down && up)-- there's a few things going on here- sometimes t2 style ec2 boxes start clamping down on IO performance over time (which makes things more sluggish), so testing this on a local box where you control the IO might yield different results.. (eg: since hunters are writing a lot over longer periods of time, as does csirtg-fm, etc). i've run into this from time to time with the ip feeds on ec2, but haven't had a chance to test it locally (over long periods of time) yet either..

Brandon Beiler · Answer 6 · Thu Apr 02 2020 20:18:45 GMT+0800 (China Standard Time)

Ok, I'll give that a try. I really do like the communication setup you have going though with ZMQ.

And that makes sense with the AWS instances since that guy has been running for almost two weeks I think. The problem is that it didn't seem to start happening until after it had been running a long period of time like you mentioned, so that makes testing that much harder, as well as fixes.

Thanks for the suggestions. I'll give them a try

wes · Answer 7 · Sun Apr 05 2020 04:16:23 GMT+0800 (China Standard Time)

fwiw- i did some perf testing, it also seems like the sqlite joins between indicators and tags gets slower with 35k+ ips.. (artifact that this is all sqlite). in 5.0b11 i made some adjustments to the default feed settings which should should at-least stop the errors by default. mostly just means if you wanna do feeds with sqlite just need to either do them semi-hourly or start building a custom data store plugin kinda like: https://github.com/csirtgadgets/verbose-robot-elasticsearch (the v4 version..).

Brandon Beiler · Answer 8 · Mon Apr 06 2020 00:15:54 GMT+0800 (China Standard Time)

@wesyoung that definitely makes sense beings that it is sqLite. I'll pull down the latest version and give it a shot!

As far as the custom plugin, I had actually thought of asking you about that. When I started using Cif-v5 I looked through that elastisearch plugin for v4 and thought about doing something similar for v5. My thought was using something like Redis or an in-memory database for super fast lookups. With Redis (or something similar) you could also take advantage of the built in pub-sub and build that into the store plugin, for easy pub-sub ingestion on the application side. The only downside is that redis and such are key value stores, so you would have to get creative to be able to lookup threat intel by things other than just the indicator.
Either way, it would be nice to build a plugin that could take use of a database on the network, so that AWS (or whatever cloud platform is used) could manage the scaling and reliability of the database.

If I were to start on a plugin like that in the next 3 weeks or so, any guidelines or suggestions for it? Thanks

wes · Answer 9 · Wed Apr 08 2020 20:12:27 GMT+0800 (China Standard Time)

start with the sqlite plugin and work to make it "in memory" only (an sqlite option, it'll give you an easy bridge into writing a plugin, what the functions do, check performance etc).
keep it simple
then mirror what you did with redis

a buddy of mine did something similar (not using CIF, but his own custom tool) with redis:

https://github.com/MattCarothers/threatstash/blob/master/plugins/filter-oil-redis.py

which might be helpful too. i've found when going to k/v storage tools like redis, you're either "just doing single lookups" or getting creative to make feeds work.. you almost have to pick which one you want..

hth, be interested in how you end up solving it.

Brandon Beiler · Answer 10 · Thu Apr 09 2020 20:48:41 GMT+0800 (China Standard Time)

@wesyoung
That sounds like a good starting point. Thanks for the link out as well. Whether or not it is with Redis, I do really want to get started on a plugin so that we can manage a scalable database inside of AWS for this data. Thanks