nodejs / undici

An HTTP/1.1 client, written from scratch for Node.js

Home Page:https://nodejs.github.io/undici

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support HTTP_PROXY and NO_PROXY env var

shttrstckak opened this issue · comments

A somewhat standard env var used in proxying applications is the HTTP_PROXY and NO_PROXY variables.

Undici should support automatically setting a global dispatcher if HTTP_PROXY env var is used. Currently Undici supports using setGlobalDispatcher() method to set a specific proxy endpoint.

However there is a sister environment variable to HTTP_PROXY that supports contextually non-proxying domains: NO_PROXY

NO_PROXY requires a list of domain names to be NOT proxied and should make a request normally.

If any of the above cannot be supported, then Undici (and ultimately Node's Fetch-Undici) should allow per-request proxy settings or function hook to allow anyone to develop the same effect as HTTP_PROXY and NO_PROXY env var in their application that uses Undici/Node Fetch.

Thoughts?

I don't know if we should support HTTP_PROXY. How is widespread the use of HTTP_PROXY in the ecosystem.

The "ecosystem" as in the JS ecosystem or the larger software ecosystem?

For the latter, HTTP_PROXY is used everywhere.

For JS it's a quite heavily used I imagine for the same reasons. Popular frameworks tend to implement it.

Node.http module already supports it out of the box.

Got, Axios, Node-fetch (npm package) require a module global-agent to utilize. Global-agent does not appear to have any plans to support Undici.

Popular JS based testing framework Cypress supports it.

On the other side, browsers tend to support some sort of proxy configuration.

Overall I think if there is no interest in supporting HTTP_PROXY in Undici, then it would be better to support to being able to implement these env vars. HTTP_PROXY can already be supported easily however NO_PROXY doesn't seem currently possible.

Use of these environment variables is in very widespread, see number of upvotes on nodejs/node#8381. Many high-security environments restrict outbound internet access via a allowlist of domains configured on the proxy.

I do maintain a fetch wrapper specifically for the purpose of automatic proxy discovery from environment. The crucial part is that if such proxy discovery is performed, there must be an opt-out option because sometimes one wants to fetch without proxy, irregardless of environment variables.

Parsing of these env vars is tricky, I recommend https://github.com/Rob--W/proxy-from-env for it.

Then let's do it.

I offer myself as sacrifice and will probably get started this weekend.

Does anyone have any good starters on developing with this repository? For example, What would be a good place in code to start investigating for this feature?

The check for automatically supporting them should likely be in https://github.com/nodejs/undici/blob/main/lib/global.js.

We probably should cleanly re-implement https://github.com/Rob--W/proxy-from-env because that module includes support for npm_config_* variables which I think has no place in module that is supposed to only support the standard environment variables. See discussion in https://github.com/Rob--W/proxy-from-env/issues/13.

Also read https://about.gitlab.com/blog/2021/01/27/we-need-to-talk-no-proxy/. It gives an overview of the different implementations and the last two paragraphs give good advice on how to implement.

I plan to create a module that exactly implements such a parser.

There is my PoC:

import {URL} from 'node:url';
import {getGlobalDispatcher, setGlobalDispatcher, Dispatcher, ProxyAgent} from 'undici';

const proxyAgents = Object.fromEntries(
  ['http', 'https'].map( protocol => {
    const uri = process.env[ `${protocol}_proxy` ];
    if ( uri ){
      return [`${protocol}:`, new ProxyAgent(uri)];
    }
    return [];
  })
);

const noProxyRules = (process.env['no_proxy'] ?? '').split(',').map( rule => rule.trim() );

const defaultDispatcher = getGlobalDispatcher();

setGlobalDispatcher(new class extends Dispatcher {

  dispatch(options, handler) {
    if (options.origin ){
      const {host, protocol} = typeof options.origin === 'string' ? new URL(options.origin) : options.origin;
      if (! noProxyRules.some( rule => rule.startsWith('.') ? host.endsWith(rule) : host === rule )){
        const proxyAgent = proxyAgents[ protocol ];
        if ( proxyAgent ) {
          proxyAgent.dispatch( options, handler );
        }
      }
    }
    return defaultDispatcher.dispatch(options, handler);
  }

});

I think that this feature should be treated as plugin and loaded by node "preload" argument:

node -r undici/register-http-proxy ./script.js

Seems like a start, but it's definitely missing some features like wildcard (*) support in no_proxy.

I think that this feature should be treated as plugin and loaded by node "preload"

I disagree. Numerous environments support these variables by default, like Deno, Python, Golang and many more. It should of course be considered a breaking change, but I don't see why it should not be the default. Some kind of opt-out would be good to have, but I guess users could always overwrite the env with no_proxy=* I guess.

Maybe the implementation here could be "inspired" by Deno's 😉.

@sla100 my 2 cents on this line:

 return [`${protocol}:`, new ProxyAgent(uri)];
 ...
 const proxyAgent = proxyAgents[ protocol ];

you can also do http over https, that's why most frameworks specify the https option as optional. Meaning that if you haven't specified https_proxy it would do HTTP over the http-proxy :)

that's why I propose smth like this:

const getProxyAgent = (proto: 'http' | 'https') => {
  let agent: ProxyAgent | undefined;
  if (proto === 'http') {
    agent = process.env['https_proxy'] ? new ProxyAgent(process.env['https_proxy']) : undefined;
  }
  if (!agent) {
    agent = process.env['http_proxy'] ? new ProxyAgent(process.env['http_proxy']) : undefined;
  }
  return agent;
};

const noProxyRules = (process.env['no_proxy'] ?? '').split(',').map(rule => rule.trim());

const defaultDispatcher = getGlobalDispatcher();

setGlobalDispatcher(new class extends Dispatcher {

  dispatch(options, handler) {
    if (options.origin) {
      const {
        host,
        protocol,
      } = typeof options.origin === 'string' ? new URL(options.origin) : options.origin;
      if (!noProxyRules.some(rule => rule.startsWith('.') ? host.endsWith(rule) : host === rule)) {
        const proxyAgent = getProxyAgent(protocol);
        if (proxyAgent) {
          proxyAgent.dispatch(options, handler);
        }
      }
    }
    return defaultDispatcher.dispatch(options, handler);
  }
});

I'd be ok to support:

import { setGlobalDispatcher, EnvHttpProxyAgent } from 'undici'

setGlobalDispatcher(new EnvHttpProxyAgent())

Here is an additional data point in favor of this: Deno supports it by default.

https://deno.land/manual@v1.29.4/getting_started/setup_your_environment#environment-variables

We should check if Python, Ruby and other runtimes supports this by default.

As per nodejs/node#8381, the following environments support these variables by default:

  • Python
  • Go
  • Ruby
  • R

curl and wget also use them by default and I'm sure there are many more examples of applications supporting them.

@silverwind would you like to send a PR to make this the default here? I would just recommend we create a new Agent and set it as default.

Possibly, but first I want to create the parser for the env vars. I have a WIP repo for that, will update here once it's in a usable state. I guess we could then incorporate this parser code here, e.g. without a dependency.

Just adding another data point, bun also now supports these variables out of the box.

I'll assign this to you, ok?

Okay

Is it okay to cache ProxyAgents for each environment variable in-memory? We shouldn't instantiate new agents every request, and I was thinking of doing a LRU cache that holds the ProxyAgent instances, keyed on the proxy server URL.

This above approach has the downside that the cache will become stale if the environment variables change during process runtime, so a completely robust solution could likely not cache the agents at all as long as the env vars are mutable.

In https://github.com/silverwind/fetch-enhanced, I had exposed a method to clear the cache, but that seems like a it too much API surface.

Actually, thinking again, if we have a agent cache like

{
  "https://proxy1:3128": ProxyAgent,
  "https://proxy2:3128": ProxyAgent,
}

It should not be neccessary to clear this cache ever because even with changing environment variables, the agents will still be valid. The only issue is that when env vars change, unused ProxyAgent instances will not be cleaned by the garbage collector as there are still references to it, but I take that as acceptable.

That's acceptable.

Yeah, in a general use case, there would only be two cache entries, one for HTTP and one for HTTPS. If a user wants more dynamic behaviour, they can just deal with ProxyAgent themselves.

I'll unassign myself for now, I can't find the motivation to complete this, and I don't want to block anyone else from working in it.

commented

Have you committed your progress anywhere? I'll take a shot at this eventually and I'm kinda hoping I don't need to start from scratch if I don't need to

@KhafraDev invited you to my private repo for the env parsing. It's really not much so you could instead just fresh-start with the env parsing based on proxy-from-env. The tasks I see are:

  1. Start with proxy-from-env and remove support for the npm-related vars, ensure their tests pass.
  2. Double check https://about.gitlab.com/blog/2021/01/27/we-need-to-talk-no-proxy/ is honored, or whether there is room for improvement based on it. Would not support CIDR. Would not support any env var not mentioned in this post.
  3. Work out how to cache ProxyAgent per-origin.
  4. (Optionally) provide an opt-out flag, mostly for debug only I suppose as users could just set no_proxy=* as an effective opt-out.

Just for reference for anybody else who might be in a similar situation: I ran into this issue when using SvelteKit and was able to temporarily workaround by adding this to src/hooks.server.ts:

import type { Handle } from '@sveltejs/kit';
import { env } from '$env/dynamic/private';
import { dev } from '$app/environment';
import { setGlobalDispatcher, ProxyAgent } from 'undici';

if (dev && env.http_proxy) {
	const proxyUrl = new URL(env.http_proxy);
	const token = `Basic ${btoa(`${proxyUrl.username}:${proxyUrl.password}`)}`;

	const proxyAgent = new ProxyAgent({
		uri: proxyUrl.protocol + proxyUrl.host,
		token
	});

	setGlobalDispatcher(proxyAgent);
}

export const handle = (async ({ event, resolve }) => {
	return resolve(event);
}) satisfies Handle;

Since I use deno for production, which supports the proxy out of the box, I only needed to mess with the agent during development.

Hi there, what's the latest on this capability? Is there a branch somewhere with this work in progress?

thanks for all your work!

@aarlaud ProxyAgent can be configured for each request like this
The disadvantage is that you need to install and introduce the undici package. This package is large and slow to load, taking 100ms (first time).

const { ProxyAgent } = await import ('undici')

let proxy_agents: Record<string, ProxyAgent> = { }

let proxy = 'http://localhost:10080'

let options: RequestInit = {
    dispatcher: (() => {
        if (proxy)
            return proxy_agents[proxy] ??= new ProxyAgent({ uri: proxy })
    })(),

thanks. In my current case, I'm more interested in a global, which I suppose is more this way, like @zicklag mention above roughly (https://undici.nodejs.org/#/docs/api/ProxyAgent?id=example-basic-proxy-request-with-global-agent-dispatcher)

I was hoping to avoid dealing with HTTP_PROXY vs HTTPS_PROXY, auth, and NO_PROXY, maybe helping to finish something around supporting these, but i'm not clear whether there is something in flight at this point?

I think there's nothing in-progress already. But any contribution is welcome. You can open a fresh PR and kick-off the discussion from there 🙂

Here's another hackish workaround that will do just enough honor to the no_proxy environment variable for my latest use-case: https://gist.github.com/zicklag/1bb50db6c5138de347c224fda14286da. It's not properly parsing the variable, but it does make sure that all requests except those that have an origin with one of the no_proxy entries as a substring, will go through the proxy globally.