Support for saving embeddings bin file in laser_encoder

Question

Support for saving embeddings bin file in laser_encoder

vmenan opened this issue 8 months ago · comments

Hi every one,
first and for most thank you so much for building laser_encoder support. This is very very useful. In previous embed.sh the pipeline was able to save the embedding.bin file, is this support available in laser_encoder?

Thank you so much for the support!

David Dale · Answer 1 · Mon Nov 27 2023 18:39:04 GMT+0800 (China Standard Time)

Hi @vmenan!
The old embed.sh is still working (at least, supposed to). It now uses laser_encoders under the hood, but it should not affect the results.
So if you prefer, you can still use the embed.sh pipeline that saves the embedding.bin file.

David Dale · Answer 2 · Mon Nov 27 2023 18:41:08 GMT+0800 (China Standard Time)

The new laser_encoders package is intended for the users who want to implement their own pre- or post-processing of the data, including the way the embeddings are saved (or maybe used without saving).

David Dale · Answer 3 · Mon Nov 27 2023 18:41:22 GMT+0800 (China Standard Time)

@vmenan does this help?

vmenan · Answer 4 · Tue Nov 28 2023 14:12:37 GMT+0800 (China Standard Time)

I apologize the delayed reply @avidale . Yes, it does. laser_encoders give more control to the user, which is brilliant. I was able solve my issue by using a simple function, it can be easily implemented by anyone, but im sharing the code here, incase someone wants a quick solution to this.

import numpy as np

def append_to_bin_file(file_name, numpy_array):
    # Convert NumPy array to bytes
    binary_data = numpy_array.tobytes()
    try:
        # Open the file in binary append mode ('ab')
        with open(file_name, 'ab') as file:
            # Append binary data to the file
            file.write(binary_data)
        print(f"Array appended to {file_name} successfully.")
    except Exception as e:
        print(f"An error occurred: {e}")

This function will allow one to append the embedding to a ".bin" file. I chose to append to a file, if someone is loading in data in chunks due to RAM limitations.