mozilla-services / hindsight

Hindsight - light weight data processing skeleton

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tail input being ignored

deric opened this issue · comments

I'm running on Hindsight 0.15.5.

After service restart Hindsight is unable to continue with processing fairly large log file (11G).

The configuration is pretty much default, just increased instruction_limit

input_defaults = {
  instruction_limit = 1e8,
  restricted_headers = false,
}

utilization.tsv showing 0 processes messages:

Messages Processed      0
% Utilization           -1
% Message Matcher       -1
% Process Message       -1
% Timer Event           -1

plugins.tsv:

Inject Message Count            0
Inject Message Bytes            0
Process Message Count           1232
Process Message Failures        0
Current Memory                  706510
Max Memory                      2199359
Max Output                      0
Max Instructions                4127
Message Matcher Avg (ns)        0
Message Matcher SD (ns)         0
Process Message Avg (ns)        20270
Process Message SD (ns)         8925
Timer Event Avg (ns)            0
Timer Event SD (ns)             0

Hindsight open log with certain offset and does nothing:

output.kafka-nginx opened file: /var/cache/hindsight/input/54703.log offset: 10889177
output.kafka-nginx opened file: /var/cache/hindsight/analysis/0.log

I'm able to run hindsight ./hindsight.cfg 7 with just single input and process whole log file without any issues.

offset from hindsight.cp:

_G['input.rapi_access'] = 24099852323

That checkpoint is well beyond the end of an 11GB file. A fix was put in about a half year ago to detect this condition. mozilla-services/lua_sandbox_extensions@305355f#diff-eccc21c8229f427c40108995dff20ba0. If you are using the 1.6.7 version of the lfs package you should see debug message logged and the checkpoint being reset. Please give that a try. I would be more interested in how the checkpoint got out of sync (e.g. did the file roll/change after HS was shutdown). The fix above would at process the new data but anything between the HS shutdown and the roll would not be processed.

The server is running on luasandbox-lfs 1.6.6, I'll try to update that. How is the offset computed? Can I reset it e.g. a few thousands lines backwards? Based on number of lines in the log file?

There's quite standard logrotate config:

/var/log/nginx/*.log {
        daily
        missingok
        rotate 14
        compress
        delaycompress
        notifempty
        create 0640 www-data adm
        sharedscripts
        prerotate
                if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                        run-parts /etc/logrotate.d/httpd-prerotate; \
                fi \
        endscript
        postrotate
                invoke-rc.d nginx rotate >/dev/null 2>&1
        endscript
}

The log file was rotated 8 hours prior to the incident. The HS was normally restarted (due to config change on unrelated input), there were no errors logged:

Apr 01 14:24:18 f02 hindsight[1375]: 1585751058747358709 [info] hindsight stop signal received
...
Apr 01 14:24:19 f02 hindsight[1375]: 1585751059413546198 [info] analysis_plugins exiting thread: 0
Apr 01 14:24:28 f02 hindsight[1375]: 1585751068747553022 [warning] input.udp sandbox did not respond to a clean stop
Apr 01 14:24:30 f02 hindsight[1375]: 1585751070747660316 [warning] input.udp sandbox did not respond to a forced stop
Apr 01 14:24:32 f02 hindsight[1375]: 1585751072813350200 [info] hindsight exiting

After that HS started with incorrect offset.

I could have used the offset from my debugging cache, which is obviously much smaller:

# bad offset
24099852323
# debug offset
11071764038

Current log file has about 20655128 lines, the rotated one has 26332716 lines (17GB).

commented

I'm currently observing an older hindsight (1.5.3) that is actively writing a wrong offset into the hindsight.cp, but it is working: Messages are going through, continue to go through.

I'm quite sure that when I restart the process, the actual CP values is used. Is there a variable shadowed somewhere?

commented

So, these are observations taken at some random intervals, taken from hindsight.cp and the filesize.

hindsight.cp  size      = difference          δcp      δsize
3203810625 -  456169993 = 2747640632            -          -
3203936962 -  456276076 = 2747660886       126337     106083
3204071389 -  456371900 = 2747699489       134427      95824
3204610804 -  456905065 = 2747705739       539415     533165
3205033767 -  457322539 = 2747711228       422963     417474
3205649477 -  457957044 = 2747692433       615710     634505
3205790490 -  458084431 = 2747706059       141013     127387
3208422875 -  460691948 = 2747730927      2632385    2607517
3212457689 -  464770805 = 2747686884      4034814    4078857
3222118875 -  474404653 = 2747714222      9661186    9633848
3235437704 -  487732047 = 2747705657     13318829   13327394
3248217427 -  500519902 = 2747697525     12779723   12787855
3253038436 -  505339263 = 2747699173      4821009    4819361
3253978942 -  506244767 = 2747734175       940506     905504
3263031441 -  515332714 = 2747698727      9052499    9087947
3296365901 -  548686730 = 2747679171     33334460   33354016
3318135484 -  570410790 = 2747724694     21769583   21724060

@giganteous I cannot reproduce what you describe above (your configuration files would be useful also what it looked like before/after a restart and a log roll).

However, I see a case where the checkpoint would not be immediately cleared on a file open error but the current release v1.6.7 should properly reset it on the next open (unless the new file has already surpassed the old checkpoint which is unlikely but the data would be skipped/lost in that case). The change below should avoid the problem and the invalid checkpoint warning in this case. This is more in line with the original reported issue and could explain how an old large checkpoint wedged it due to this line https://github.com/mozilla-services/lua_sandbox_extensions/compare/issue_199?expand=1#diff-eccc21c8229f427c40108995dff20ba0L172 (the move of the old file was detected but the new file wasn't available yet)

https://github.com/mozilla-services/lua_sandbox_extensions/tree/issue_199