zrlio / albis

Albis: High-Performance File Format for Big Data Systems

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about Albis

ArvinDevel opened this issue · comments

commented

Hi, I'm curious about Albis, and I have some questions after read your paper:

  1. is the bigger buffer cache is dangerous to cache performance?
    " the continuing use of large I/O buffers is detrimental to the cache behavior and performance. For example, on a 16 core machine with a 128 MB buffer for each task, the memory footprint of a workload would be 2 GB, a much larger quantity than the modern cache sizes."
    When the buffer in memory larger than cache in cpu, it decreases the IO exchange numbers between memory and device. As we all know the memory and external devices still has some order latency gap. I think the appropriate buffer size should larger than cache size. So I hope to see your excellent explains.
  2. why the hdfs write bandwidth so poor?

Thanks!

Hi Arvin

  1. A buffer size represent the trade-off between I/O performance (the cost of moving data between DRAM (i.e. buffer cache) and I/O device) AND CPU cache hit/miss performance. A small buffer size improves the CPU cache performance while degrading the I/O performance from a device. On the other hand, a large buffer size might improve I/O performance from a device while sacrificing CPU cache hits. In the paper, we have shown that specifically for high-end NVMe device the trade-off has shifted to use small buffer sizes becasue NVMe devices are getting very close to the DRAM performance. If one is using slow I/O device like disks or old flash SSDs, then it still makes sense to use a large buffer size of 100s of MB. What is the right buffer size, depends upon the CPU, cache size, and speed of the I/O device.

  2. I am not sure, but I suspect that it has to do with larger code path (very inefficient for small I/O e.g., header and footer writing) and multiple copies that happen on the writing side. HDFS only has write(byte[]) interface, so one must copy out from on-heap byte arrays.

commented

@animeshtrivedi Thank you very much.
I'm still confused about the tradeoff, it's a bit abstract. I can't imagine the exact reason why the small buffer size can reduce cache miss. Is the cache miss rate averaged over time? Is that the small buffer reduce the cpu executing time and reduce the stalled time of cpu, so the cache miss rate decrease?
And for ORC and parquet, do you have some details data about their cache miss reason? If we decrease the block size of ORC and parquet just as Table3 does, say 1 MiB as Albis, will they has low cache miss as Albis?
For the intructions/row in Table 5, I don't know why ORC has so large value compared to Albis, in my opinion, the way they retrieve data and meta is almost same, they all has no complex encoding at all.
So I request your help, thanks.

Small buffers reduce the CPU cache miss because the buffer can fit completely in the CPU cache. It will indeed reduce number of CPU stalls and time CPU has to access DRAM. Furthermore, the I/O buffer in Albis is used repeatedly, so that it is cached (no new virtual address).

I don't have detailed reasoning why Parquet and ORC is so poor with the CPU cache misses. A part of which is definitely their I/O buffer size (~128MB vs 1MB for Albis). You can see when the I/O buffer size is decreased, the CPU cache profile improves. Have a look at Table 3. And you can see that performance improves when reducing the I/O buffer size, but eventually it is bottlenecked by the number of instructions required (see in the same Table). ORC had very poor performance for < 32GB buffer sizes. I don't know why. Further factors such as inefficient code, poor buffer management (they don't recycle buffers), etc. contribute to this situation.

As far as know ORC still uses some type specific encoding (RLE for integers). You can see this from the size of the file. You can do this experiment to store 1 billion integers. If the file size is less than 4GB, then you know ORC is encoding/compressing the data. You can see this in Table 6, the raw data size of the TPC-DS data is close to 100GB, which Albis shows. Others like Parquet and ORC have much smaller.

Does this make sense?

commented

Thanks a lot, more clear.
Sorry for more questions.
I think the dataset of workload was small, eg. 100GiB. As far as I know, Spark shuffle will first write the data to local file, so I don't know which overhead are you try to avoid? And the shuffle overhead should be equal to all of the file format, isn't it? So could you share result with bigger dataset size?

We tried to avoid overheads associated with writing to disk when shuffling. So we mounted the tmpfs as the shuffling location to keep the shuffling data in DRAM (instead of disk). Yes, you are right the overheads would be equal but if that overhead is more than the gains from the input, then it would be hard to show end-to-end gains in a workload. Say for example, the input with parquet takes 10 seconds and shuffling overheads are 100 seconds. So even if Albis improves the input performance by 50% to 5 seconds (from 10 secs), it would be just 4.5% in the end-to-end run (110 secs vs 105 secs).

Yes, 100 GiB is a small data set. We are working on to generate results with bigger data set sizes.

commented

Thank you very much!

commented

Hi, what's the plan for opensource Albis? Hope to see the source code, thanks.

Following up here. Any plans to open source the code?

Hi Laurence - unfortunately I recently switched jobs and did not get much time to work with the code. The plan is still to open source it, but it might take a while. I can try to get a rough and dirty version out soon.