Unstructured-IO / unstructured-js-client

A Typescript client for the Unstructured hosted API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How do I supply `Files.contentType`?

omikader opened this issue · comments

How can I provide a value for my file's content type to the partitioning API?

I noticed that the code in unstructured-api determines how to partition each file using the content_type attribute attached to the FastAPI UploadFile. If one is not provided, it tries to infer the file type using the filename extension.

My file names are arbitrary UUIDs (no filename extension) so when I try to partition them I get this error

{"detail":"File type None is not supported."}

I would like to manually provide a value for UploadFile.content_type to avoid the fallback behavior but I don't see a way to do that using the JS client. Can we modify the Files definition to include an optional value for contentType, which I presume would be used in the unstructured-api code and result in skipping the fallback path?

export declare class Files extends SpeakeasyBase {
    content: Uint8Array;
    fileName: string;
    // PROPOSING WE ADD THE FOLLOWING LINE
    contentType?: string;
}

Hi there, apologies for the delay. This is certainly something that should be in the client. I can do some digging and get back to you soon. We're also planning to improve the content type checking on the server side in the near term.

Hi @awalker4! Do you have any updates on this front? We'd like to upgrade to the new version of the JS client but we get the infamous {"detail":"File type None is not supported."} error when we try provide to provide a Blob type for the files argument. Unfortunately, when providing a Blob, you can no longer supply the fileName

Hi Omar, sorry for the delays! We still need to get content_type in as a client param, and I'd like get around to that this week. As a workaround, the latest client does still take the files object from before, so you can set the filename. Check out the Typescript tab in the docs here. Let me know if this is sufficient for now, or if you're blocked on needing the content type.

Separately, I have an internal ticket to improve server side file handling. We can address the filetype None issue by actually inspecting the file and not just keying off of the extension.

No problem! Thanks for the quick response! Yes, once we upgrade we can continue to provide the Files object but we'd love to start using the Blob variant to avoid loading the entire file into memory at once.

For now, we've decided to stay on the older client version because we started running into fetch timeout issues at the 5 minute mark and I believe this issue is related nodejs/node#46375

Also when supplying files: new Blob([data], { type: file.mimetype }), the splitPdfPage: true doesn't work and client raises Given file is not a PDF. Continuing without splitting.

commented

@alimoezzi

You can do this

const blob = await openAsBlob('path/to/filename.pdf');
const name = 'filename.pdf';
const file = new File([blob], name);

Hi all, we've merged a fix for the API that removes the naive extension check and does an actual filetype detection. This will get rid of the Filetype None is not supported errors and should cover most of the cases where you'd need to explicitly send a content type. This is deployed in our hosted serverless and free tier APIs.

We've added the contentType parameter to the SDK, to coincide with the new API param here. In addition to the better filetype checking, this issue should be resolved. Apologies for the very long turnaround time on this :/

@alimoezzi I created #100 for the pdf page splitting bug.