piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No way to have SFTP connections `load_host_keys()` via `transport_params`

Kache opened this issue · comments

Problem description

While currently system host keys are loaded: https://github.com/RaRe-Technologies/smart_open/blob/v5.2.1/smart_open/ssh.py#L91

There's currently no way to load_host_keys() for verifying the host. It could be added via transport_params at paramiko client instantiation. For example, something like:

        ssh = _SSH[key] = paramiko.client.SSHClient()
        ssh.load_system_host_keys()
        if 'load_host_keys' in transport_params:
            ssh.load_host_keys(transport_params['load_host_keys'])
        ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

Steps/code to reproduce the problem

Currently unable to verify host key using a local known_hosts file.

Versions

Darwin-20.6.0-x86_64-i386-64bit
Python 3.7.6 (default, Nov 24 2021, 00:59:23)
[Clang 13.0.0 (clang-1300.0.29.3)]
smart_open 5.2.1

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

If you're open to accepting a PR, I'd be willing to create a PR based on the example above.

if 'load_host_keys' in transport_params:
    ssh.load_host_keys(transport_params['load_host_keys'])

Can't the user do this prior to the smart_open.open call?

Yes, but notice the first line in my example -- that solution involves accessing smart_open's private cache ,_SSH, to "grab" the instance of the paramiko client, which is not a good software engineering practice.

smart_open does not normally (and rightfully, IMO) expose the underlying paramiko client to the user

I wonder if there's a better way to do this without having smart_open know all these paramiko details. I don't want to handle more transport parameters than absolutely necessary.

How about:

def ssh_client_init():  # user's code
    client = paramiko.client.SSHClient()
    # additional ssh config goes here
    return client

transport_params = {'ssh_client_init': ssh_client_init}
with smart_open(url, 'rb', transport_params=transport_params) as fin:
    ...

I think it's better to pass a callable instead of the client itself because we can use the callable to create a new client whenever we get disconnected.

If there is no callable passed, then we can use the default client settings, e.g. what is currently being done.

Yet another way is to expose the underlying client. I'm not opposed to that idea, either. Hiding implementation details is a good thing in general, but here it's getting in the way of the user achieving what they want, so it isn't something we have to strictly stick to.