socketry / async-http

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Streaming `Async::HTTP::Internet` response

bruno- opened this issue · comments

Hi,

I'm not reporting a gem issue, but instead I'm struggling to make the below code work.
I'm trying to make asynchronous requests where response IO object is passed to nokogiri SAX parser.

require "async"
require "async/http/internet"
require "nokogiri"

urls = %w(
  https://www.codeotaku.com/journal/2018-11/fibers-are-the-right-solution/index
  https://www.codeotaku.com/journal/2020-04/ruby-concurrency-final-report/index
)

class HtmlDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    puts name
  end
end

Async do |task|
  internet = Async::HTTP::Internet.new
  parser = Nokogiri::HTML::SAX::Parser.new(HtmlDocument.new)

  urls.each do |url|
    task.async do
      response = internet.get(url)
      puts "#{url} #{response.status}"

      # parser.parse(response.read) # using this line works, but not streaming
      parser.parse_io(response.peer.io) # this line errors 💥
    end
  end
end

This is the problem line: parser.parse_io(response.peer.io) - it errors with this:

 1.72s    error: Async::Task [oid=0x2bc] [pid=74458] [2020-06-03 21:42:30 +0200]
               |   Protocol::HTTP2::FrameSizeError: Protocol::HTTP2::Frame (type=112) frame length 6841632 exceeds maximum frame size 1048576!
               |   → /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/frame.rb:181 in `read'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/framer.rb:95 in `read_frame'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/connection.rb:161 in `read_frame'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/async-http-0.52.4/lib/async/http/protocol/http2/connection.rb:106 in `block in read_in_background'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/async-1.26.1/lib/async/task.rb:258 in `block in make_fiber'

The #parse_io method needs to be passed an IO and I've been unable to figure it out. Any hints? Thank you very much 🙏

You cannot use the underlying IO for streaming the response because with HTTP/2 for example, there are multiple requests being multiplexed on the same IO. In addition, the binary framing format is not the raw data you expect.

require 'async'
require 'async/http/internet'

Async do
	internet = Async::HTTP::Internet.new
	
	response = internet.get(...)
	pipe = Async::HTTP::Body::Pipe.new(response.body)
	
	parse(pipe.to_io)
end

This makes an adapter socket around the HTTP data stream. Hopefully this helps you enough to figure it out - let me know if not.

This was very helpful, thank you very much.