NeuraLegion / har

HAR (HTTP Archive) parser in Crystal

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lazily yield entries from IO instead of a string

vladfaust opened this issue · comments

While parsing big HAR files, I may occasionally run into the following error:

2021-01-09T12:11:24.521623Z   INFO - NeuraLegion::HARParsingService: Processing "incoming/some.har.zip"...
Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
Invalid memory access (signal 11) at address 0x0
[0x557abc4e9396] *Exception::CallStack::print_backtrace:Int32 +118
[0x557abc49402e] __crystal_sigfault_handler +398
[0x7fef41040890] ???
[0x7fef402058f0] abort +560
[0x557abcc8e0f2] ???
[0x557abcc8e39a] GC_expand_hp_inner +410
[0x557abcc8e5fd] GC_collect_or_expand +349
[0x557abcc9244b] GC_alloc_large +235
[0x557abcc92b43] GC_generic_malloc +307
[0x557abcc92cc1] GC_malloc_kind_global +225
[0x557abcc931f1] GC_realloc +209
[0x557abc55be61] *GC::realloc<Pointer(Void), UInt64>:Pointer(Void) +49
[0x557abc47ebe4] __crystal_realloc64 +68
[0x557abc4df23e] *Pointer(UInt8) +94
[0x557abc5604d2] *String::Builder#resize_to_capacity<Int32>:Pointer(UInt8) +50
[0x557abc5603e8] *String::Builder#write<Slice(UInt8)>:Nil +184
[0x557abc671db2] *Compress::Zip::ChecksumReader +370
[0x557abcaaeafe] *NeuraLegion::HARParsingService#parse<Compress::Zip::ChecksumReader>:NeuraLegion::HARParsingService::Observation +302
[0x557abcaa0d89] *NeuraLegion::HARParsingService#run:Bool +70601
[0x557abc4c4925] ~procProc(Nil) +197
[0x557abc590120] *Fiber#run:(IO::FileDescriptor | Nil) +208
[0x557abc49351d] ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) +29
[0x0] ???

I suspect that this is the failing piece of code:

    private def parse(io : IO) : Observation
      observation = Observation.new
      parser = Parser.new

      entries = HAR.from_string(io.gets_to_end).entries # This
      entries.each do |entry|

I.e. the io.gets_to_end part.

I think it would be useful if there was an initializer accepting an IO instead of a String which would yield entries lazily, one-by-one. For example, HAR.from_io(io) do |entry|.

However, this would imply an entirely different approach compared to how the shard is implemented now, so this is a long-shot... 🤔

Actually there's no technical reason preventing from using IO (Data.from_json(io).log) instead although I'm not sure would it help in this case.