Unstructured-IO / unstructured-js-client

A Typescript client for the Unstructured hosted API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for online docs

ndayishimiyeeric opened this issue · comments

Error while working with a document hosted by an online provider uploadthing

To reproduce

  1. Set up an file hosting file
  2. after upload process file using unstructured-js-client sdk

Error

 Error: ENOENT: no such file or directory, open 'https://utfs.io/f/"key".pdf'
    at Object.openSync (node:fs:581:18)
    at Object.readFileSync (node:fs:457:35)
    at processFile (./src/data/files.ts:35:52)
    at async handler (./src/actions/file/upload/index.ts:42:51)

code

const fsData = fs.readFileSync(url);
const fsData = fs.readFileSync(url);
  usClient.general
    .partition({
      files: {
        content: fsData,
        fileName: url,
      },
    })
    .then((res: PartitionResponse) => {
      if (res.statusCode === 200) {
        console.log("res", res);
        return res;
      }
    })
    .catch((err) => {
      console.log("err", err);
    });

Other options tried

  • langchain blob loader then providing the loaded content in the file
    ts error
Type 'string' is not assignable to type 'Uint8Array'.

Is there a way to read hosted file?

readFileSync return buffer data I guess, convert your url to buffer data instead of using readFileSync.
Go visit here

https://stackoverflow.com/a/55665383/5748537

If you need to get a file from the web you need to use http/https api, specifically request or similar to read the contents of the file/url you want.

Thanks @hiepxanh

I've found a stable solution using the writeFile and unlink from fs/promises

code snippet

const data = await axios.get(url, {
    responseType: "arraybuffer",
 });

const randomName = Math.random().toString(36).substring(7);
await writeFile(`/tmp/${randomName}.pdf`, data, "binary");
const loader = new UnstructuredLoader(`/tmp/${randomName}.pdf`, {
   // loader data using langchain UnstructuredLoader
});
const documents = await loader.load();
await unlink(`/tmp/${randomName}.pdf`);

Great, I think this is a good solution <3