parsealexa.py and CDNs

Question

parsealexa.py and CDNs

frochet opened this issue 6 years ago · comments

Hey,

I was re-doing stuff with shadow recently and the parsealexa.py step made me realize that I was probably receiving IP addresses close to my location (Belgium) for services using CDNs. I guess that top sites are likely doing that. Some questions arise then:

How much is my output of parsealexa.py different from real Tor resolving hostnames from exit relay locations?
How is this impacting performance results? Intuitively, Shadow should experience higher latency than vanilla Tor (E.g., google.com resolved for BE would impact circuits using US exits).
Do we care?

Rob Jansen · Answer 1 · Fri Nov 30 2018 03:37:55 GMT+0800 (China Standard Time)

Hmm, yes, good point and good questions.

I think in most cases the latency from client-entry-middle-exit will be far greater than the latency from exit-server, especially since relays will add additional processing delays that probably wouldn't be experienced at the server. So I think the bottleneck here likely isn't the exit-server link.

Flow/congestion control will limit the speed at which the exit can read from the socket buffer representing the exit side of the exit-server TCP connection. My guess is that the exit doesn't really spend much if any time waiting for data from the server, but rather that the data received from the server spends time waiting for Tor to read it from the kernel buffer and forward it to the next relay. If this is the case, then the latency from the exit to server will have very little effect.

We could do better by doing lookups for the alexa top sites from all of the exit relays in Tor and using the results to form a larger set of possible server IP addresses and locations. We could also get more realistic by having the exits find the servers closest to them, via DNS or otherwise. But, I'm not sure that these changes would be worth the added complexity unless we have evidence that our current approach does in fact cause performance artifacts.

Florentin · Answer 2 · Fri Nov 30 2018 22:19:35 GMT+0800 (China Standard Time)

Great! thanks for the insights. Right, as you said, this might not be worth the added complexity.

Maybe a simpler solution would be to make this problem the same for everyone: we force a unique location using Tor to resolve hostnames in order that people experimenting with shadow from some edge of the internet would not get a bias in their results (if it appears that the current approach does cause performance artifacts).

Rob Jansen · Answer 3 · Fri Dec 14 2018 02:41:27 GMT+0800 (China Standard Time)

It does seem reasonable for us to just generate a large list of servers once, and then allow people to download that list so that they never have to run the parsealexa script. That way everyone selects their server locations from the same list.

In any case, as long as researchers are posting their Shadow configs, then at the very least we should be able to reproduce their results. But it would be nice if we can help them get this part of their setup correct.

Florentin · Answer 4 · Thu Jun 06 2019 18:05:19 GMT+0800 (China Standard Time)

Looks like a good idea. We would probably need to update that list quite frequently though?

Rob Jansen · Answer 5 · Fri Jun 07 2019 00:29:22 GMT+0800 (China Standard Time)

It would be better if we could avoid maintaining such a list.

In general, we should provide a reasonable default location assignment for servers, and then anyone who needs more accuracy can choose how much accuracy they need and adjust the server assignment accordingly.

Along these lines, an even simpler idea is to just assign the servers to the various cities in the world uniformly at random. We could use the city map such as the one I created for my CCS 2018 paper - this map has a node for all cities in the world that contain a RIPE Atlas probe. Then just assign servers to those cities at random. Then every user will get the same server location distribution.

We lose some accuracy since servers probably are not uniformly distributed around the world in practice. If you could come up with a rough distribution of the number of servers per country (it seems like some such data must exist), we could use that as weights to the server location assignment. Such a distribution probably changes much less frequently.

Florentin · Answer 6 · Fri Jun 07 2019 17:08:07 GMT+0800 (China Standard Time)

Along these lines, an even simpler idea is to just assign the servers to the various cities in the world uniformly at random.

Ok, but then shadow servers are not anymore in the logic of "locations where Internet users are most likely to visit" (top-x sites)

We lose some accuracy since servers probably are not uniformly distributed around the world in practice. If you could come up with a rough distribution of the number of servers per country (it seems like some such data must exist), we could use that as weights to the server location assignment. Such a distribution probably changes much less frequently.

Even with a weighted distribution, we may obtain something that does not look like the right distribution of top-x sites locations. Anyway, I know about these guys (https://netray.io/dns.html) who regularly scan 50% of the global domain name space. Their dataset are available and could allow us to derive such a distribution of servers per country. Yet, I guess that their probing suffers from the same issue as here: they get IPs close to the probe for top sites.

I wonder if maintaining a list would not be preferable.

Rob Jansen · Answer 7 · Wed Jul 03 2019 00:30:12 GMT+0800 (China Standard Time)

Two more options:

HTTP Archive does aim to crawl the Alexa top sites. To my knowledge, neither of these provide IP addresses or country code information, however :(