Streaming `Async::HTTP::Internet` response
bruno- opened this issue · comments
Hi,
I'm not reporting a gem issue, but instead I'm struggling to make the below code work.
I'm trying to make asynchronous requests where response IO object is passed to nokogiri SAX parser.
require "async"
require "async/http/internet"
require "nokogiri"
urls = %w(
https://www.codeotaku.com/journal/2018-11/fibers-are-the-right-solution/index
https://www.codeotaku.com/journal/2020-04/ruby-concurrency-final-report/index
)
class HtmlDocument < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [])
puts name
end
end
Async do |task|
internet = Async::HTTP::Internet.new
parser = Nokogiri::HTML::SAX::Parser.new(HtmlDocument.new)
urls.each do |url|
task.async do
response = internet.get(url)
puts "#{url} #{response.status}"
# parser.parse(response.read) # using this line works, but not streaming
parser.parse_io(response.peer.io) # this line errors 💥
end
end
end
This is the problem line: parser.parse_io(response.peer.io)
- it errors with this:
1.72s error: Async::Task [oid=0x2bc] [pid=74458] [2020-06-03 21:42:30 +0200]
| Protocol::HTTP2::FrameSizeError: Protocol::HTTP2::Frame (type=112) frame length 6841632 exceeds maximum frame size 1048576!
| → /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/frame.rb:181 in `read'
| /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/framer.rb:95 in `read_frame'
| /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/connection.rb:161 in `read_frame'
| /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/async-http-0.52.4/lib/async/http/protocol/http2/connection.rb:106 in `block in read_in_background'
| /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/async-1.26.1/lib/async/task.rb:258 in `block in make_fiber'
The #parse_io
method needs to be passed an IO and I've been unable to figure it out. Any hints? Thank you very much 🙏
You cannot use the underlying IO for streaming the response because with HTTP/2 for example, there are multiple requests being multiplexed on the same IO. In addition, the binary framing format is not the raw data you expect.
require 'async'
require 'async/http/internet'
Async do
internet = Async::HTTP::Internet.new
response = internet.get(...)
pipe = Async::HTTP::Body::Pipe.new(response.body)
parse(pipe.to_io)
end
This makes an adapter socket around the HTTP data stream. Hopefully this helps you enough to figure it out - let me know if not.
This was very helpful, thank you very much.