performance issues

Question

performance issues

pavlexander opened this issue a year ago · comments

I would like to use this library to save/append the candlestick trading data (OHLCV) into files.

After looking at advertised performance I have tried using this library, but it's just not as fast as simply saving the data to a binary file.. I would like to know if I am misusing the library or it's simply not meant for the given use-case? The test data set consists of 3_020_871 records.

the test code looks like this

using MemoryStream memoryStream = new MemoryStream();
foreach (var data in dataStructLong)
{
    Memory<byte> bytes = await Hyper.HyperSerializer.SerializeAsync(data);
    await memoryStream.WriteAsync(bytes);
}

File.WriteAllBytes("customHyper.hyper", memoryStream.ToArray());

for performance comparison, I am also using the teafiles library (which basically is a wrapper for brute-force binary serialization):

    using (var tf = TeaFile<CandlestickLongStruct>.Create("teaFile.tea"))
    {
        foreach (var item in dataStructLong)
        {
            tf.Write(item);
        }
    }

the results are:

Tea: 528 ms, File size: 184.37969970703125 mb
Hyper: 1967 ms, File size: 184.37933349609375 mb

it does seems like the HyperSerializer produces almost exact size output file, but the performance is much worse.

The aim of this post of course, not to compare this lib to others. I genuinely want to replace the teafiles library and looking for a better solution. I would appreciate to hear a feedback on the performance issue..

for the sake of completeness, here's the serialized data type

public readonly struct CandlestickLongStruct
{
    public long Id { get; init; }
    public Time OpenTime { get; init; }
    public long Open { get; init; }
    public long High { get; init; }
    public long Low { get; init; }
    public long Close { get; init; }
    public long Volume { get; init; }
    public Time CloseTime { get; init; }
}

.net 7, HyperSerializer 1.4.0, Rubble.TeaFiles.Net 2.0.0

additionally I would like to know how would you solve the problem of appending the data to an existing file and also getting/upating a total number of records in file.

Adam Cohen · Answer 1 · Fri Dec 29 2023 23:48:41 GMT+0800 (China Standard Time)

``Few things...

Your candle is a struct - copying it to to Memory<T> (let alone onto MemoryStream) when you can use the non-async Serialize method to get a stack based Span<byte> will be significantly faster vs allocating a pointer on the heap.
You're copying around the serialized bytes to the heap and subsequently copy 3 more times unnecessarily. You pass Memory<byte> to MemoryStream which data from one array to another array for no reason, and then you're copying the entire stream again with ToArray - iterating over the stream's internal array byte-by-byte making a copy of yet another array that gets passed to 'File.WriteAllBytes. **If you just use a FileStream (e.g. FileStream fs = File.OpenWrite()) and Span bytes = HyperSerializer.Serialize(item)- and the callfs.Write(bytes)` it eliminates all the copying and leverages the Formula 1 speed of Span - plus you eliminate the the overhead of pointers, copying and iterating 3 times unnecessarily, etc.**

The most important point is that you don't really need to serialize each array item one-by-one. Just pass the array to the Serialize function - like this example.

In related news, I copied your code from above into Linqpad (see below), ran some benchmarks (see below) and provided both for reference. If you use Linqpad, you can just add the NuGet packages, copy the code into a "Program" script, and finally highlight the functions you want to benchmark and hit "ctrl+shift+b." I added a class version of a candle for further comparison vs TeaCup (only supports structs) and HyperSerializer still 33% faster (see bottom benchmark line. This is an important point because classes tend to be more suitable for stream processing and ML trading algos).

All benchmarks are 10M randomly generated candle objects (structs and/or classes).

_I'm going write a separate post to address some of your other questions which are equally if not more important asap. Here's the Linqpad script...

Linqpad "Program" script type...add HyperSerializer and Teacup references. Teacup's bits are still .Net Framework Linqpad (and .Net in general) doesn't play well with...

async Task Main()
{

}

#region Highlight and CTRL + SHIFT + B
void Tea_ForEach_Struct_10M()
{
	var path = @"e:\temp\teaFile6.tea";
	using (var tf = TeaFile<CandlestickLongStruct>.Create(path))
	{
		foreach (var item in Bars)
		{
			tf.Write(item);
		}
	}
	File.Delete(path);
}


void HyperSerializer_ForEach_Struct_FileStream_10M()
{
	var path = @"e:\temp\hyper.bin";
	using (var fs = File.OpenWrite(path))
	{
		foreach (var item in Bars)
		{
			Span<byte> bytes = Hyper.HyperSerializer.Serialize(item);
			fs.Write(bytes);
		}
	}
}

void HyperSerializer_NoLoop_FileStream_10M()
{
	var path = @"e:\temp\hyper.bin";
	using (var fs = File.OpenWrite(path))
	{
		Span<byte> bytes = Hyper.HyperSerializer.Serialize(Bars);
		fs.Write(bytes);
	}
	File.Delete(path);
}

void HyperSerializer_ForEach_Class_FileStream_10M()
{
	var path = @"e:\temp\hyper.bin";
	using (var fs = File.OpenWrite(path))
	{
		foreach (var item in BarsClass)
		{
			var bytes = Hyper.HyperSerializer.Serialize(item);
			fs.Write(bytes);
		}
	}
	File.Delete(path);
}

#endregion

#region Benchark Data
Random rand = new Random();

CandlestickLongStruct[] _bars;
public CandlestickLongStruct[] Bars => _bars ??=
	Enumerable.Range(0, 10_000_000).Select(x =>
			new CandlestickLongStruct
			{
				High = rand.Next(),
				Open = rand.Next(),
				Close = rand.Next(),
				Low = rand.Next()
			})
	   .ToArray();

CandlestickLongClass[] _barsClass;
public CandlestickLongClass[] BarsClass => _barsClass ??=
	Enumerable.Range(0, 10_000_000).Select(x =>
			new CandlestickLongClass
			{
				High = rand.Next(),
				Open = rand.Next(),
				Close = rand.Next(),
				Low = rand.Next()
			})
	   .ToArray();

public struct CandlestickLongStruct
{
	public int High { get; set; }
	public int Low { get; set; }
	public int Close { get; set; }
	public int Open { get; set; }
}

public class CandlestickLongClass
{
	public int High { get; set; }
	public int Low { get; set; }
	public int Close { get; set; }
	public int Open { get; set; }
}
#endregion

Adam Cohen · Answer 2 · Sat Dec 30 2023 00:12:47 GMT+0800 (China Standard Time)

additionally I would like to know how would you solve the problem of appending the data to an existing file and also getting/upating a total number of records in file.

You can use a standard format like CSV, build your own binary file / database / wire format - or just use something that's already built. CSV last functionality and isn't fast enough, rolling your own is a waste of time and Microsoft FASTER cannot be beat - that said I wouldn't recommend it anyone looking for something plug and play in a few hours.

I have a private repo with a bunch of this stuff implemented. If you send me some detail regarding what you're working on and goals / objectives (data sources, exchanges, brokers, asset classes, strategy, etc.)....we may be able to help each other out.

pavlexander · Answer 3 · Sun Dec 31 2023 00:10:14 GMT+0800 (China Standard Time)

Hi!

Thank you very much for a very detailed answer!

Actually, the reason why I am looking for teafiles replacement is for that exact reason that I want to use classes and this extra bit of struct<->class mapping that I use currently does not add any value.. The HyperSerializer has an advantage in that regards..

note1

Regarding your suggested solution to just serialize the whole batch of data - I've had troubles when trying to serialize it one go

var allData = Hyper.HyperSerializer.Serialize(dataStructLong);

Surprisingly, no such issue occurs if SymbolTick from the tests is used!

Hyper.HyperSerializer.Serialize(ticks);

Turns out the issue is with the Ienumerable type used! Somehow if I use the array then serialization works, if I use the list - it throws, e.g.

        Hyper.HyperSerializer.Serialize(dataStructLong.Take(1).ToArray()); // no exception
        Hyper.HyperSerializer.Serialize(dataStructLong.Take(1).ToList()); // exception

note2

The second issue I've identified is connected to HyperSerializer warmup.

For example, the following code runs 1420 ms in release mode (teatime 540 ms)

        var sw = Stopwatch.StartNew();

        using (var fs = File.OpenWrite(customHyperFilePath))
        {
            foreach (var item in dataStructLong)
            {
                fs.Write(Hyper.HyperSerializer.Serialize(item));
            }
        }

        sw.Stop();
        var elapsedHyper = sw.ElapsedMilliseconds;

but with the following line added before the stopwatch - the HyperSerializer execution time is only 547 ms (teatime is 553 ms).

        Hyper.HyperSerializer.Serialize(dataStructLong.First());

It means that HyperSerializer performs better after it's been "run" previously. Could you comment on that?

note3

Finally, in regards to your question about the use-case.. I'am simply collecting the OHLC data from popular exchanges (currently only 1) and trying to come up with some solution to persist the data.. After a bunch of tests I've figured that it's easier and better to store the data in binary files through the means of libraries such as teafiles or the HyperSerializer.

They perform substantially better than SQL, or, god forbid, the JSON or CSV. Not only the read/writes are multiverse faster, but all other metrics are better as well (memory footprint, disk space usage, etc.).

Here are some results from previous tests (they did not include HS at that point)

File write performance (Serialization)

Json: 203 ms, File size: 7.64947509765625 mb (p.s. ignore this result :) not sure what went wrong with the test there but json serialization usually takes 1+ seconds guaranteed)
Tea: 510 ms, File size: 163.18914794921875 mb
Csv: 2901 ms, File size: 219.56466674804688 mb
Bin: 601 ms, File size: 163.18878173828125 mb
Protobuf: 1170 ms, File size: 114.9475326538086 mb
MessagePack: 1582 ms, File size: 134.53640747070312 mb
MessagePack lz4 compressed: 813 ms, File size: 86.13397216796875 mb
Parquet: 958 ms, File size: 31.272869110107422 mb

File read performance (De-Serialization)

sql: 11452 ms
json: 499 ms
tea: 340 ms
csv: 3352 ms
bin: 575 ms
protobuf: 1063 ms
messagePack: 588 ms
messagePack compressed: 587 ms
parquet: 878 ms

The use-case actually involves saving the huge chunk of data once, then appending fresh data to the same files on a daily basis. All's done synchronously. No frequent read-access is required either.

To make the long story short, I just want to drop the tea-files library due to it's inability use the classes. This causes some headaches on the backend service I am building :)

Adam Cohen · Answer 4 · Sun Jan 14 2024 20:08:49 GMT+0800 (China Standard Time)

Regarding the warmup, the first time that HyperSerializer is used to serializer or deserialize, it generates dynamic in memory assembly containing a type that's optimized to serialize the object. Just create an initialization function in you application startup (program.cs or wherever) that makes a call using each type that you want to serialize.

Regarding how to store the serialized objects you have a few options:

Serialize and deserialize the everything into a single time - this gets messy quickly
Build your own binary file format and indexing - time consuming
Use one of the open source options

I highly suggest the last option. I use Microsoft FASTER w/HyperSerializer. It's a complicated beast but very powerful and the fastest KV store I found by several orders of magnitude. It's a memory mapped file store and has the flexibility to be used as a multi-value dictionary with a native mechanism that allows for key -> values chaining. In other words, each time you add a value (called an upsert) to the store, you can configure it's call back functions to store the new record at a new memory address to the key without overwriting the existing value, and create a reference chain to all prior values for the same key. It allows you to read, write and update values in terabyte files in microseconds.

Using HyperSerializzer w/ FASTER, I can write 10 million records to SSD in about 3 seconds. Happy to give you access to my private repo if you want to take a look at how I used it. LMK.