performance issues
pavlexander opened this issue · comments
I would like to use this library to save/append the candlestick trading data (OHLCV) into files.
After looking at advertised performance I have tried using this library, but it's just not as fast as simply saving the data to a binary file.. I would like to know if I am misusing the library or it's simply not meant for the given use-case? The test data set consists of 3_020_871 records.
the test code looks like this
using MemoryStream memoryStream = new MemoryStream();
foreach (var data in dataStructLong)
{
Memory<byte> bytes = await Hyper.HyperSerializer.SerializeAsync(data);
await memoryStream.WriteAsync(bytes);
}
File.WriteAllBytes("customHyper.hyper", memoryStream.ToArray());
for performance comparison, I am also using the teafiles
library (which basically is a wrapper for brute-force binary serialization):
using (var tf = TeaFile<CandlestickLongStruct>.Create("teaFile.tea"))
{
foreach (var item in dataStructLong)
{
tf.Write(item);
}
}
the results are:
Tea: 528 ms, File size: 184.37969970703125 mb
Hyper: 1967 ms, File size: 184.37933349609375 mb
it does seems like the HyperSerializer
produces almost exact size output file, but the performance is much worse.
The aim of this post of course, not to compare this lib to others. I genuinely want to replace the teafiles library and looking for a better solution. I would appreciate to hear a feedback on the performance issue..
for the sake of completeness, here's the serialized data type
public readonly struct CandlestickLongStruct
{
public long Id { get; init; }
public Time OpenTime { get; init; }
public long Open { get; init; }
public long High { get; init; }
public long Low { get; init; }
public long Close { get; init; }
public long Volume { get; init; }
public Time CloseTime { get; init; }
}
.net 7
, HyperSerializer 1.4.0
, Rubble.TeaFiles.Net 2.0.0
additionally I would like to know how would you solve the problem of appending the data to an existing file and also getting/upating a total number of records in file.
``Few things...
-
Your candle is a struct - copying it to to
Memory<T>
(let alone ontoMemoryStream
) when you can use the non-asyncSerialize
method to get a stack basedSpan<byte>
will be significantly faster vs allocating a pointer on the heap. -
You're copying around the serialized bytes to the heap and subsequently copy 3 more times unnecessarily. You pass
Memory<byte>
toMemoryStream
which data from one array to another array for no reason, and then you're copying the entire stream again withToArray
- iterating over the stream's internal array byte-by-byte making a copy of yet another array that gets passed to 'File.WriteAllBytes. **If you just use a FileStream (e.g.
FileStream fs = File.OpenWrite()) and
Span bytes = HyperSerializer.Serialize(item)- and the call
fs.Write(bytes)` it eliminates all the copying and leverages the Formula 1 speed of Span - plus you eliminate the the overhead of pointers, copying and iterating 3 times unnecessarily, etc.**
The most important point is that you don't really need to serialize each array item one-by-one. Just pass the array to the Serialize
function - like this example.
In related news, I copied your code from above into Linqpad (see below), ran some benchmarks (see below) and provided both for reference. If you use Linqpad, you can just add the NuGet packages, copy the code into a "Program" script, and finally highlight the functions you want to benchmark and hit "ctrl+shift+b." I added a class version of a candle for further comparison vs TeaCup (only supports structs) and HyperSerializer still 33% faster (see bottom benchmark line. This is an important point because classes tend to be more suitable for stream processing and ML trading algos).
All benchmarks are 10M randomly generated candle objects (structs and/or classes).
_I'm going write a separate post to address some of your other questions which are equally if not more important asap. Here's the Linqpad script...
Linqpad "Program" script type...add HyperSerializer and Teacup references. Teacup's bits are still .Net Framework Linqpad (and .Net in general) doesn't play well with...
async Task Main()
{
}
#region Highlight and CTRL + SHIFT + B
void Tea_ForEach_Struct_10M()
{
var path = @"e:\temp\teaFile6.tea";
using (var tf = TeaFile<CandlestickLongStruct>.Create(path))
{
foreach (var item in Bars)
{
tf.Write(item);
}
}
File.Delete(path);
}
void HyperSerializer_ForEach_Struct_FileStream_10M()
{
var path = @"e:\temp\hyper.bin";
using (var fs = File.OpenWrite(path))
{
foreach (var item in Bars)
{
Span<byte> bytes = Hyper.HyperSerializer.Serialize(item);
fs.Write(bytes);
}
}
}
void HyperSerializer_NoLoop_FileStream_10M()
{
var path = @"e:\temp\hyper.bin";
using (var fs = File.OpenWrite(path))
{
Span<byte> bytes = Hyper.HyperSerializer.Serialize(Bars);
fs.Write(bytes);
}
File.Delete(path);
}
void HyperSerializer_ForEach_Class_FileStream_10M()
{
var path = @"e:\temp\hyper.bin";
using (var fs = File.OpenWrite(path))
{
foreach (var item in BarsClass)
{
var bytes = Hyper.HyperSerializer.Serialize(item);
fs.Write(bytes);
}
}
File.Delete(path);
}
#endregion
#region Benchark Data
Random rand = new Random();
CandlestickLongStruct[] _bars;
public CandlestickLongStruct[] Bars => _bars ??=
Enumerable.Range(0, 10_000_000).Select(x =>
new CandlestickLongStruct
{
High = rand.Next(),
Open = rand.Next(),
Close = rand.Next(),
Low = rand.Next()
})
.ToArray();
CandlestickLongClass[] _barsClass;
public CandlestickLongClass[] BarsClass => _barsClass ??=
Enumerable.Range(0, 10_000_000).Select(x =>
new CandlestickLongClass
{
High = rand.Next(),
Open = rand.Next(),
Close = rand.Next(),
Low = rand.Next()
})
.ToArray();
public struct CandlestickLongStruct
{
public int High { get; set; }
public int Low { get; set; }
public int Close { get; set; }
public int Open { get; set; }
}
public class CandlestickLongClass
{
public int High { get; set; }
public int Low { get; set; }
public int Close { get; set; }
public int Open { get; set; }
}
#endregion
additionally I would like to know how would you solve the problem of appending the data to an existing file and also getting/upating a total number of records in file.
- You can use a standard format like CSV, build your own binary file / database / wire format - or just use something that's already built. CSV last functionality and isn't fast enough, rolling your own is a waste of time and Microsoft FASTER cannot be beat - that said I wouldn't recommend it anyone looking for something plug and play in a few hours.
I have a private repo with a bunch of this stuff implemented. If you send me some detail regarding what you're working on and goals / objectives (data sources, exchanges, brokers, asset classes, strategy, etc.)....we may be able to help each other out.
Hi!
Thank you very much for a very detailed answer!
Actually, the reason why I am looking for teafiles
replacement is for that exact reason that I want to use classes and this extra bit of struct<->class
mapping that I use currently does not add any value.. The HyperSerializer
has an advantage in that regards..
note1
Regarding your suggested solution to just serialize the whole batch of data - I've had troubles when trying to serialize it one go
var allData = Hyper.HyperSerializer.Serialize(dataStructLong);
Surprisingly, no such issue occurs if SymbolTick
from the tests is used!
Hyper.HyperSerializer.Serialize(ticks);
Turns out the issue is with the Ienumerable type used! Somehow if I use the array
then serialization works, if I use the list
- it throws, e.g.
Hyper.HyperSerializer.Serialize(dataStructLong.Take(1).ToArray()); // no exception
Hyper.HyperSerializer.Serialize(dataStructLong.Take(1).ToList()); // exception
note2
The second issue I've identified is connected to HyperSerializer
warmup.
For example, the following code runs 1420 ms in release mode (teatime
540 ms)
var sw = Stopwatch.StartNew();
using (var fs = File.OpenWrite(customHyperFilePath))
{
foreach (var item in dataStructLong)
{
fs.Write(Hyper.HyperSerializer.Serialize(item));
}
}
sw.Stop();
var elapsedHyper = sw.ElapsedMilliseconds;
but with the following line added before the stopwatch - the HyperSerializer
execution time is only 547 ms (teatime
is 553 ms).
Hyper.HyperSerializer.Serialize(dataStructLong.First());
It means that HyperSerializer
performs better after it's been "run" previously. Could you comment on that?
note3
Finally, in regards to your question about the use-case.. I'am simply collecting the OHLC data from popular exchanges (currently only 1) and trying to come up with some solution to persist the data.. After a bunch of tests I've figured that it's easier and better to store the data in binary files through the means of libraries such as teafiles
or the HyperSerializer
.
They perform substantially better than SQL
, or, god forbid, the JSON
or CSV
. Not only the read/writes are multiverse faster, but all other metrics are better as well (memory footprint, disk space usage, etc.).
Here are some results from previous tests (they did not include HS at that point)
File write performance (Serialization)
- Json: 203 ms, File size: 7.64947509765625 mb (p.s. ignore this result :) not sure what went wrong with the test there but json serialization usually takes 1+ seconds guaranteed)
- Tea: 510 ms, File size: 163.18914794921875 mb
- Csv: 2901 ms, File size: 219.56466674804688 mb
- Bin: 601 ms, File size: 163.18878173828125 mb
- Protobuf: 1170 ms, File size: 114.9475326538086 mb
- MessagePack: 1582 ms, File size: 134.53640747070312 mb
- MessagePack lz4 compressed: 813 ms, File size: 86.13397216796875 mb
- Parquet: 958 ms, File size: 31.272869110107422 mb
File read performance (De-Serialization)
- sql: 11452 ms
- json: 499 ms
- tea: 340 ms
- csv: 3352 ms
- bin: 575 ms
- protobuf: 1063 ms
- messagePack: 588 ms
- messagePack compressed: 587 ms
- parquet: 878 ms
The use-case actually involves saving the huge chunk of data once, then appending fresh data to the same files on a daily basis. All's done synchronously. No frequent read-access is required either.
To make the long story short, I just want to drop the tea-files library due to it's inability use the classes. This causes some headaches on the backend service I am building :)
Regarding the warmup, the first time that HyperSerializer is used to serializer or deserialize, it generates dynamic in memory assembly containing a type that's optimized to serialize the object. Just create an initialization function in you application startup (program.cs or wherever) that makes a call using each type that you want to serialize.
Regarding how to store the serialized objects you have a few options:
- Serialize and deserialize the everything into a single time - this gets messy quickly
- Build your own binary file format and indexing - time consuming
- Use one of the open source options
I highly suggest the last option. I use Microsoft FASTER w/HyperSerializer. It's a complicated beast but very powerful and the fastest KV store I found by several orders of magnitude. It's a memory mapped file store and has the flexibility to be used as a multi-value dictionary with a native mechanism that allows for key -> values chaining. In other words, each time you add a value (called an upsert) to the store, you can configure it's call back functions to store the new record at a new memory address to the key without overwriting the existing value, and create a reference chain to all prior values for the same key. It allows you to read, write and update values in terabyte files in microseconds.
Using HyperSerializzer w/ FASTER, I can write 10 million records to SSD in about 3 seconds. Happy to give you access to my private repo if you want to take a look at how I used it. LMK.