Unified notation for representing a packing of tensor data into a set of RLWE ciphertexts
j2kun opened this issue · comments
Based on discussions in #906 (e.g., #906 (comment)), I think it would be prudent to have everyone agree on a notation for packing layouts. As I see it, the options are:
- TileTensor notation
- Fhelipe's index-bitstring (see #800)
- A formalization of @edwjchen's comment #906 (comment)
IMO we would ultimately formalize this as a custom attribute, similar to affine_map
.
This would need to support:
- Packing multiple data items into one ciphertext (e.g., the example in Fhelipe of packing an entire matrix into a single ciphertext)
- Splitting one data item across multiple ciphertexts
- Packing multiple data across multiple ciphertexts?
Here are some notes on the different packing abstractions from Fhelipe and Viaduct. At a high level, both abstractions are interchangeable and can support all three use-cases listed (packing multiple data into a single ciphertext, splitting data between ciphertext, packing multiple data across multiple ciphertext). Fhelipe's abstraction seems more succinct.
Here are some notes on the different packing abstractions from Fhelipe and Viaduct.
Thanks for that! It sounds like Fhelipe is the one we should start with. I'd like to do a quick implementation of the notation in Python, with some unit tests, to ensure I understand it and to use it as a reference for whomever ends up implementing this.
So I added an implementation of Fhelipe notation (without ciphertext-selecting bits!) here: https://github.com/j2kun/fhe-packing/blob/78b1c450b90cf15986f2a063c778a611f68cc63d/fhelipe.py#L120
I was wondering if it's possible to do Halevi-Shoup in this abstraction. I added a FIXME in fhelipe_test.py because I couldn't figure it out and I have to stop to go feed my infant daughter :-P
Here are some notes on the different packing abstractions from Fhelipe and Viaduct. At a high level, both abstractions are interchangeable and can support all three use-cases listed (packing multiple data into a single ciphertext, splitting data between ciphertext, packing multiple data across multiple ciphertext). Fhelipe's abstraction seems more succinct.
Thanks! I don't think I've fully grokked either representation yet, so I'll need to take a closer look at both papers again.
In the meantime, I wanted to ask if you could elaborate on the (dis)advantages of the TileTensor notation, too?
More generally, given that the Fhelipe and Viaduct syntax seems pretty complex already, I wonder what the advantages of the custom syntax would be for the compiler (i.e., what kind of optimizations do they enable/make easier)?
EDIT: One that I can think of already is that it seems much easier to "compute" the required conversion from one packing to another with the custom syntaxes, rather than arbitrary affine-map-style expressions.
If we want to go for an approach closer to affine_map, we could (as far as I can tell) directly express pretty much every possible packing as a direct translation (given as an affine expression) between the dimensions/indices of a cleartext/application tensor and the corresponding indices in a tensor of ciphertexts and their slots. This would also allow us to use the full hypercube view of packing/rotations (which I'm not sure any of the "custom" approaches currently seem to support?) and would of course also directly allow expressing diagonal packings.
EDIT2: Though I'm not sure how we could represent the "replication" style things that, e.g., TileTensor notation has.
One that I can think of already is that it seems much easier to "compute" the required conversion from one packing to another with the custom syntaxes, rather than arbitrary affine-map-style expressions.
My intuition says that for affine map, once you have the concrete size of the tensor, this would be doable just by expanding the iteration space (just run the loop and write down all the indices you get) for the input and output affine maps, then that gives you a permutation and you can (at worst using the naive method) implement the conversion with mask+shift.
What I don't immediately see is how to compute it using the compact form of the affine map.
This would also allow us to use the full hypercube view of packing/rotations (which I'm not sure any of the "custom" approaches currently seem to support?) and would of course also directly allow expressing diagonal packings.
So one way of writing the diagonalization of a row-major square tensor with length d
using a function on the indices is [i][j] => [j][(i+j) % d]
. Is this considered an affine transformation?
Though I'm not sure how we could represent the "replication" style things that, e.g., TileTensor notation has.
Replication within a ciphertext can be represented by using empty dimensions in Viaduct. Likewise, there's a replication function to fill gap slots in Fhelipe (though this representation is usually tied with a corresponding computation that requires replication e.g., replicating a vector for matrix-vector multiplication).
This would also allow us to use the full hypercube view of packing/rotations (which I'm not sure any of the "custom" approaches currently seem to support?) and would of course also directly allow expressing diagonal packings.
So one way of writing the diagonalization of a row-major square tensor with length
d
using a function on the indices is[i][j] => [j][(i+j) % d]
. Is this considered an affine transformation?
Yes: https://mlir.llvm.org/docs/Dialects/Affine/#affine-expressions
The diagonal packing can be directly expressed as an affine_map today. Can this be expressed in Fhelipe?
Edit: MLIR calls them affine but they are quasi-affine. It seems using the floordiv-by-constant capability here, we can also support replication, and with scalar mul by constant we can support striding.
In the meantime, I wanted to ask if you could elaborate on the (dis)advantages of the TileTensor notation, too?
I added the TileTensors notation for Fhelipe's LogReg program to the end of the google doc!
One benefit is that it can allow for tiling and replication within a dimension. In the case where tiles are reused, then replicating a tile and broadcasting the tile is fairly cheap, and thus can help with performance.
One disadvantage is that the notation is unable to represent compaction, a technique introduced in Fhelipe to compress multiple ciphertexts together if there are gap (zeroed out) slots.
The diagonal packing can be directly expressed as an affine_map today. Can this be expressed in Fhelipe?
Out of the box, we can't express diagonalization in Fhelipe. However, if we introduce Rolls (or some representation of the affine_map permutation to the notation, we can!
edit:
For example, let M be a tensor of shape (4,4), and M has dimensions i
and j
, i.e., M[i][j]. The diagonalization of M is Roll(i,j) (j 3:0; i 3:0)
. Recall that Roll(i,j)
transforms the index of i => (i+j % len(i)))
. We can then see how this schedule relates back to the equation [i][j] => [j][(i+j) % d]
!
Does HEIR support lowering affine_map expressions, e.g. diagonalization, into and FHE circuit? If so, how would it reason about the required data movement and rotations?
No, this is what we'd have to implement. Picking the starting representation is the first choice. I'm leaning toward reusing affine_map, but I want to do some more experiments to ensure we can represent everything the other notations can represent, and try to understand what it can do that the others cannot
Does HEIR support lowering affine_map expressions, e.g. diagonalization, into and FHE circuit? If so, how would it reason about the required data movement and rotations?
Not yet, as Jeremy said, but your question made me realize there's actually two different things going on here:
Analysis / Packing-Decision-Making Phase
The compiler is presented with a "cleartext" program that performs operations on tensors and their elements, primarily falling into three categories for us:
tensor
dialect operations likeextract
/insert
/slice
/etc that go between elements and tensorsarith
dialect operations likeaddi
/muli
/etc that can be applied both to entire tensors or to extracted lementslinalg
dialect operations like (e.g., convolution, matrix-multiplications, etc)
plus, potentially, some even higher-level constructs, though for now it looks like LinAlg can handle most of the ML things we want to express.
Given such an input program, the task of the compiler is to figure out a bunch of different (but related) things:
- re-"vectorize" program structures that are expressed not on entire tensors but on their elements (we currently have the HECO heuristics for this, but they're pretty limited) so they can be performed using mostly SIMD operations
- decide how each input should be encoded/encrypted into ciphertexts (as doing weird packing stuff there is basically "free", compared to changing representation later)
- decide what algorithm (and therefore representation) to use for each instance of the high-level operations (e.g., matmul), trading off between efficiency of the algorithm with the cost of switching presentations.
As far as I can see, this is really what most of the academic work on the topic is focused on, and this seems to be where a combination of (sometimes implicit/hidden) assumptions, restrictions on the input language, custom packing abstractions and lots of optimizer-number-crunching come together to get the academic results.
"Execution" / Lowering-to-native-FHE-ops Phase
Once the packing-related decisions have been made (which might be explicitly represented in the IR or merely in-memory as the result of an Analysis, depending on how we implement that phase), we need to actually lower things like "switch packing for ciphertext X from this packing to that packing", or "perform a shoup-halevi-style mvp".
I think this part is relatively straightforward, and I can absolutely see an affine_map style representation of packing being useful here, as doing what Jeremy suggested (expanding iteration space), while probably not super efficient, gives rise to a really straight forward lowering. However, I'm not sure the same approach would work for the analysis phase, where we might already need to be able to "search" a potentially massive design space, and having an efficient way to express packings and conversions between them might be crucial to the efficiency of that search/optimization.
For what it's worth, I don't think Fhelipe does any sophisticated analysis here. They pick a layout for inputs, propagate forward through ops, insert layout conversion ops as needed, and greedily hoist the conversions through ops to try to eliminate them.
Perhaps the hard part of using affine_map is modifying the map as it "passes through" an op? I.e. forward/backward propagating the layout change from the affine map description alone? I'd be curious to work through that problem in more detail.
One thing to consider when using the affine_map style representation is how to derive the output packing from a given (edit:tensor) operation. Here was one example where I got stuck before.
For example, let's reuse the matrix-vector multiplication example with tensors M[4,4] and V[4].
M = [4,4]
V = [4]
V2 = [4] # result vector
for i in range(4):
for j in range(4):
V2[i] += M[i][j] * V[j]
# assign M to be row-major packed
M[i][j]
V[4][j]
=>
V2[i][4]
# assign M to be col-major packed
M[j][i]
V[j][4]
=>
V2[4][i]
# assign M to be diagonal-major packed
M[j][(i+j)%4]
V[4][(i+j)%4]
=>
V2[4][i]
In the row and col major packing, we can clearly see the sum dimension is j
, and likewise i
added to an empty dimension, 4
, propagates i
to the output layout. However, these rules seem to break when looking at the diagonal-major packing... 🤔