gramineproject / graphene

Graphene / Graphene-SGX - a library OS for Linux multi-process applications, with Intel SGX support

Home Page:https://grapheneproject.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SGX startup is slow due to quadratic TOML processing

pwmarcz opened this issue · comments

Description of the problem

Graphene-SGX startup is slow for manifests that have a lot of trusted_files.

Steps to reproduce

On an Ubuntu 18.04 machine, current master branch (c602e56). Try to run Python example with graphene-sgx python -c "print('hello')".

Expected results

This should be relatively quick.

Actual results

The command takes 10 seconds:

real	0m10.816s
user	0m9.311s
sys	0m1.473s

It looks like python.manifest.sgx contains a lot of files (all of /usr/lib/python3, /lib/x86_64-linux-gnu, /usr/lib/x86_64-linux-gnu):

$ wc -l python.manifest.sgx
30049 python.manifest.sgx
$ grep /usr/lib/python3/ python.manifest.sgx | wc -l
10263
$ grep x86_64-linux-gnu python.manifest.sgx | wc -l
3621

Stopping in GDB shows that time is spent in this loop:

    for (ssize_t i = 0; i < toml_trusted_files_cnt; i++) {
        const char* toml_trusted_file_key = toml_key_in(toml_trusted_files, i);
        assert(toml_trusted_file_key);
        toml_raw_t toml_trusted_file_raw = toml_raw_in(toml_trusted_files, toml_trusted_file_key);
        // ...
    }

It looks like toml_raw_in does a linear traversal of the whole trusted_files table.

Yup, because we're using wrong TOML constructs for this, we should use arrays, not dictionaries (which are slow, and the keys make no sense here). But this will be resolved when we completely fix #2076.

Using TOML tables instead of TOML arrays also blocks my other PR: #2484

I started working on this transition. Here is the idea:

  • Refactor all relevant places in code (#2607)
  • First try to parse legacy TOML-table syntax sgx.allowed_files.bla = "file", if not found, try new TOML-array syntax sgx.allowed_files = ["file1", ...]
    • We keep legacy TOML-table syntax purely for compatibility reasons; we deprecate it and at some point we may drop it
  • Add a better name sgx.passthrough_files to the legacy unclear sgx.allowed_files
    • We keep legacy sgx.allowed_files name purely for compatibility reasons; we deprecate it and at some point we may drop it

@dimakuv What about sgx.trusted_checksum though? Will it remain a table (with the same quadratic-lookup problem), or will it be an array (and the parsing code will need to "zip" both arrays), or...?

sgx.trusted_checksum will be an array. Effectively, sgx.trusted_files[index] = "file:bla" has a corresponding item sgx.trusted_checksum[index] = "12345...".

I have a branch in my local repo, I'll publish it after #2607 is merged.

There's those other efforts related to partial manifest and HSM signing and I'm not sure how the manifest structure should look like. In case of partial manifests (i.e. situation, when you don't have all the trusted/protected files on your machine and you rely on externally provided hashes), don't you want something like:

sgx.trusted_files = [
 { 'path' = '/q/werty', 'sha256' = 'deadbeef' },
]

# or maybe
[[sgx.trusted_files]]
path = '/asdf/zxcv'
sha256 = 'abcd'

?

Because managing parallel arrays, while certainly possible to get right, might be more error-prone.

sgx.trusted_files = [
  { 'path' = '/q/werty', 'sha256' = 'deadbeef' },
]

Definitely doable, though I wouldn't consider it important. sgx.trusted_checksum is a Graphene-SGX-internal feature which users never use or even know about. How exactly this is implemented in the final .manifest.sgx and in Graphene-SGX code, should be irrelevant to the users/developers.

Forcing users to use an "array of two-field tables" sound much more complicated than my current "array of file paths":

sgx.trusted_files = [
  "file:{{ graphene.runtimedir() }}/",
  "file:{{ entrypoint }}",
]

sgx.allowed_files = [
  "file:tmp/",
  "file:root", # for getdents test
  "file:testfile" # for mmap_file test
]

Anyway, my points are:

  1. I want to have the new syntax for sgx.{allowed/trusted/protected}_files as above, just a TOML array
  2. The part with SHA256 hashes (historically called sgx.trusted_checksum) is Graphene-internal and it doesn't matter much how it is implemented; we can change it later without anyone noticing

In case of partial manifests (i.e. situation, when you don't have all the trusted/protected files on your machine and you rely on externally provided hashes)

I am not aware of such scenarios. Can this really happen for sgx.trusted_files? (Please note that sgx.protected_files works in a completely different way, there is no SHA256 hash associated with them.)

Yes, there are at least two scenarios for trusted_files:

  • the file is confidential and we don't want to keep around ML weights;
  • the file is very big and we don't want to keep a copy on build server for no other reason than to recalculate it's hash.

So we need to have a possibility of "partially finalised" manifest and to merge several manifests in various stages of finalisation. From this POV it's not internal anymore, unless you want some manifests that look like manifests, still unsigned, but you'd better not touch them by hand.

If you'd like to preserve simplicity of an array of strings, trusted_files could be an array of (string or two-key hash), if that's not too much work.

If you'd like to preserve simplicity of an array of strings, trusted_files could be an array of (string or two-key hash), if that's not too much work.

I like this idea, it preserves simplicity for usual use-cases, but doesn't block more complicated ones.
And I think we already had someone asking to support providing hashes without the corresponding data to some of the trusted files.

Ok, let me implement Woju's approach.

So I tried this:

sgx.trusted_files = [
  "file:exec_victim",
  {uri = "file:trusted_testfile", hash = "deadbeef"}
]

And got Python TOML error:

  File "/home/dimakuv/graphene/built/bin/graphene-sgx-sign", line 5, in <module>
    sys.exit(main())
  File "/home/dimakuv/graphene/built/lib/python3.6/site-packages/graphenelibos/sgx_sign.py", line 825, in main
    manifest = read_manifest(manifest_path)
  File "/home/dimakuv/graphene/built/lib/python3.6/site-packages/graphenelibos/sgx_sign.py", line 683, in read_manifest
    manifest = toml.load(path)
  File "/home/dimakuv/.local/lib/python3.6/site-packages/toml/decoder.py", line 134, in load
    return loads(ffile.read(), _dict, decoder)
  File "/home/dimakuv/.local/lib/python3.6/site-packages/toml/decoder.py", line 512, in loads
    multibackslash)
  File "/home/dimakuv/.local/lib/python3.6/site-packages/toml/decoder.py", line 778, in load_line
    value, vtype = self.load_value(pair[1], strictly_valid)
  File "/home/dimakuv/.local/lib/python3.6/site-packages/toml/decoder.py", line 880, in load_value
    return (self.load_array(v), "array")
  File "/home/dimakuv/.local/lib/python3.6/site-packages/toml/decoder.py", line 1002, in load_array
    a[b] = a[b] + ',' + a[b + 1]
IndexError: list index out of range

So yeah, Python's TOML parser doesn't support mixed arrays: uiri/toml#270. Actually, looking at this GitHub repo, the project seems to be dying? There was no commit activity in the last couple months (I think from January 2021).

But this workaround works:

sgx.trusted_files = [
  "file:exec_victim",
]

[[sgx.trusted_files]]
uri = "file:trusted_testfile"
hash = "deadbeef"

Oh nice, our C TOML parser doesn't support mixed arrays:

$ graphene-direct ./helloworld
error: PAL failed at parsing the manifest: line 35: array mismatch

Well, the latest version supports it: cktan/tomlc99#51

I will update our TOML C parser to this latest version then.

Ok, I implemented everything in my local branch.

My Python SGX manifest is similar to Pawel's in terms of number of Python-internal files:

$ wc -l python.manifest.sgx
33359 python.manifest.sgx

Old times:

$ time graphene-sgx python -c "print('hello')"
hello

real    0m6.026s
user    0m4.254s
sys     0m1.737s

New times:

$ time graphene-sgx python -c "print('hello')"
hello

real    0m3.007s
user    0m0.873s
sys     0m2.098s

About 5x improvement (looking at user time).