Implement `GET /v0/bulk/<job_id>/download` for download email results

Question

Implement `GET /v0/bulk/<job_id>/download` for download email results

amaury1093 opened this issue 3 years ago · comments

Amaury commented 3 years ago

Implement issue for downloading results.

See spec for more info: https://hackmd.io/hE1IW5e0Tmef3_wggY65qQ

I would really prefer to get CSV working on the first run. But if it's really too complicated, we can default to JSON.

Amaury · Answer 1 · Fri Jan 28 2022 18:59:40 GMT+0800 (China Standard Time)

As discussed on Discord, let's do returning a JSON array first in this endpoint.

Ishan Bhanuka · Answer 2 · Wed Feb 16 2022 00:17:35 GMT+0800 (China Standard Time)

I can see two approaches to sending the data in csv format. One is direct and the other is streamed.

Direct - This would mean loading all the rows for a query into memory as a Vec. Convert each element and collect the whole thing into a single String and return it as the response body.
Streamed - This would not load everything into memory at once, instead it would stream some rows, map them into strings and return the mapped stream in the response body and continue doing so until all rows are streamed. warp supports this as discussed here - seanmonstar/warp#38

The advantage of direct is perhaps that it's easier to implement. The advantage of streamed approach is that it's more scalable for larger datasets and more queries. Performance-wise I'm not sure and it'll probably be best to benchmark.

For this to work struct CheckEmailOutput needs the deserialize trait so that the db record can be converted to it. Currently it's a jsonb value so it gets converted to serde_json::Value.

Now the challenge is that CheckEmailOutput is a nested structure and the current serializer converts it into a map. As discussed in BurntSushi/rust-csv#98 serializing maps to csv is not supported right now.

For direct approach the struct needs a custom serializer that converts to a flat structure in the serde data model. Then the vector can be passed to a csv writer which can directly convert the whole vec of results into a csv string body.

For the streamed approach this can be avoided all together. The struct can implement something like a to_csv_string() method that gives it's csv string representation. This function can be use to map the stream from the struct a string. Part reason for this is because BurntSushi/rust-csv does not support converting a single struct to it's csv string equivalent directly. The row has to be passed to a writer which only returns the whole output and not a stream so it won't work for this task.

Ishan Bhanuka · Answer 3 · Sun Feb 20 2022 12:02:44 GMT+0800 (China Standard Time)

I wanted to highlight one of the issues that I faced while adding the csv conversion.

Here's a diagram that summarizes the situation, starting from CheckEmailOutput box -

Currently once deserialized to json it's not possible to get back CheckEmailOutput at all. This might cause problem later, if extra processing needs to be done after retrieving from db as was the case for converting the struct to a csv record.

The green box highlights the approach taken in the current PR to implement conversion to csv. It wraps the json and converts it into a simpler struct that can be converted to csv. This gets the job done but might cause maintenance problems if more fields need to be added or extra processing needs to be done before converting to csv.

The red boxes are the where there are issues with serde - the problems are that currently there is no de-serialize for the struct and the serialize implementation uses maps which does not allow it to be converted to csv.

The yellow boxes are possible ways to fix the underlying issues. There are two approaches I can see for this -

Deserialization issue is because the some of the nested error fields are directly storing errors from standard library like future::TimeoutError. It might be a good idea to convert these errors into domain specific errors like ReacherFutureError(String) which can automatically be serialized and de-serialized.
Taking a page from oop design pattern a DTO might be a possible approach here. That is the domain object CheckEmailOutput gets converted to a Domain Transfer Object (DTO) say CheckEmailClientOutput. The DTO is a simpler structure - easier to serde. For e.g. CheckEmailClientOutput could simplifies the error fields into strings since end users only care about the error message and not really it's structure. This would still keep all the important fields as is to avoid breaking the current API.

Amaury · Answer 4 · Sun Feb 20 2022 20:49:37 GMT+0800 (China Standard Time)

Hey @twitu, thanks for this excellent writeup!

I really like the left yellow path you proposed in the diagram. I believe it corresponds to the 2nd approach using a DTO, correct?

The reason for which I prefer the 2nd approach is that I would still like to keep the domain error types from external libraries (future, async-smtp, trust-dns) inside check-if-email-exists.

I'm sure 99% of the users don't care about this, and the error string should be enough for them. But I'd like to keep the core library correct, and let other libraries (including reacherhq/backend) build on top. The CheckEmailClientOutput seems like a good built-on-top abstraction suited for the REST backend usage of this repo.

2nd approach

If we go with the 2nd approach, how doesCheckEmailClientOutput look like? I can think of:

+ #[derive(Serialize, Deserlize)]
- pub struct CheckEmailOutput {
+ pub struct CheckEmailClientOutput {
	pub input: String,
	pub is_reachable: Reachable,
	pub misc: Result<MiscDetails, MiscError>,
	pub mx: Result<MxDetails, MxError>,
-	pub smtp: Result<SmtpDetails, SmtpError>,
+	pub smtp: Result<SmtpDetails, SmtpErrorClient>, // where we replace future::TimeoutError with ReacherFutureError(String)
	pub syntax: SyntaxDetails,
}

Are there any other changes needed in other deeply nested errors?

Also:

Do you see CheckEmailClientOutput living in check-if-email-exists or in this repo? I'd say the latter, but wanted your opinion.
can the serde_json output of CheckEmailClientOutput be exactly the same as the current REST API output? Ideally without writing a custom serializer? i.e. using #[derive(Serialize)], maybe with some additional #[serde] tags?
can the rust_csv output of CheckEmailClientOutput be exactly the same as the CSV columns defined in the hackmd? (bonus: what about the weird mx.records field?)
We need to implement impl From<CheckEmailOutput> for CheckEmailClientOutput, right? It should be fairly easy. Do you see the need for the other way conversion (I don't think so)?

Next steps?

I saw you did the green path in your PR. I'm totally fine to merge as is, and we can explore the 2nd approach in another PR. What do you think?