microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Samba / NFS Integration Storage-Manager Documentation needed / Behavior infuse Security Problem

lkaupp opened this issue · comments

Organization Name: Research Center Applied Informatics

Short summary about the issue/question:

NFS / Samba Integration works only partially. Is it possible to configure the storage manager further? NFS is not restricted to the IPs of the cluster, so you open up a black hole in your infrastructure open for everyone. Tested from another server from our organization, it was possible to mount the NFS endpoint. Next, NFS is working, PVC+PV also, but the started Samba-Instance with the configured user/pwd set in service-configuration is not connectable (followed the tutorial one-by-one, despite the user of standard smbusr/smbpwd password).

Every time I try to login from another Unix system with smbclient -U myOwnUser -L //endpointip/data i get:

session setup failed: NT_STATUS_LOGON_FAILURE

Brief what process you are following:
Followed the tutorial of adding a new node: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/how-to-add-and-remove-nodes.html
Followed the How to Setup Storage Example: "https://openpai.readthedocs.io/en/latest/manual/cluster-admin/how-to-set-up-storage.html#example-use-storage-manager-to-create-an-nfs-samba-server"

NFS works (but too well, open for everyone). PVC+PV works like a charm, Jobs can create files, Samba with custom user/pwd does not work. Is it possible to get more documentation about how to restrict NFS access only to cluster nodes and how to configure and test the integrated Samba service?

  • Operating type: Initial deployment / operating

How to reproduce it: Setup and use the integration storage-manager

OpenPAI Environment:

  • OpenPAI version: v1.8.0
  • Cloud provider or hardware configuration: Self-hosted
  • OS (e.g. from /etc/os-release): All Systems - Ubuntu 20.04 LTS
  • Kernel (e.g. uname -a): 5.4.0-47-generic Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):

master: 10c/40gb/2TB
cpu-w1: 20c/20gb/1TB
cpu-w:2 20c/20gb/1TB
cpu-w3: 20c/20gb/1TB
gpu-w1: 1gpu/6c/20gb/1TB
storage-w: 2c/2gb/4TB <-- Added after Deployment (Everything is well functioning (NFS is working / mount possible), Is shown in Webportal, just added to the cluster not to an VC, tagged with pai-storage: "true")

  • Others:

Anything else we need to know:
Tested again, with smbuser/smbpwd was possible to connect... so despite what is written in the config, it uses the default usr/pwd..

Thanks @lkaupp. The storage-manager we provided just for internal use. The NFS setting here is simple and not recommended for production env.
For production env, we encourage you to use cloud storage such as AzureBlob...

And we allow customer to setup their own storage and integrate with OpenPAI. So, you can setup your NFS server with more powerful config, then use PAI API to make the storage available for some groups.
You can refer this doc https://openpai.readthedocs.io/en/latest/manual/cluster-admin/how-to-set-up-storage.html for more detail.

@Binyang2014 Thanks for the clarification. Is there another way to get a hold of the training results and models besides the NFS solution? Sure, you could use sleep (same as for tensorboard) and use ssh to connect to the container, but this is not usable in our restricted environment. Any other way? As a federal organization, Azure or any other cloud service is not an option due to IP, privacy, and other concerns.

If you want to restrict the IPs which can access the NFS server, you can refer to this doc: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/deployment_guide/s1-nfs-server-config-exports. Then change the NFS export file in PAI to set the restriction.

/share/pai *(rw,sync,fsid=0,all_squash,no_subtree_check,anonuid=1000,anongid=1000)
/share/pai/data *(rw,sync,all_squash,no_subtree_check,anonuid=1000,anongid=1000)
/share/pai/users *(rw,sync,all_squash,no_subtree_check,anonuid=1000,anongid=1000)
.

In our practice, if you cannot use cloud service, NFS is a good choose. Job can use NFS directly via PV/PVC. If customer want to donwload/upload data, we let them use samba client (For us, the customer using Windows machine, and NFS server using an internal IP which cannot be accessed out of cluster).

Our storage-manger is a simple implementation. It has single-point-failure issue (we don't enable HA for NFS server), and missing many security features. You can use NFS in your environment, but I recommend you deploy the NFS server by your own, storage-manager is not good for production use.

Thanks for the response, I have done it similarly. I disabled the storage-manager again and configured another nfs storage, (+share: false), where I restricted the access to only node ips. I also added an NFS User/Group that can write to the folders (anonuid/anongid). Users of the samba server belong to the same group, which makes result sharing between pods and samba users possible. Each researcher has now their own directory which they are able to fill with data. Thanks for the support.