shadow / tornettools

We can generate multiple relays with the same network node and ip address. Example:

  relay124exit:
    network_node_id: 914
    ip_addr: 194.5.249.130
    ...
  relay125exit:
    network_node_id: 914
    ip_addr: 194.5.249.130

@stevenengler says:

Yep that looks like a bug in tornettools. I think it samples two different relays that have the same IP address and shadow doesn't support having two different hosts with the same IP address. Maybe when sampling relays if we already have a relay with that IP address, we should just increment the address (as an integer) by one until it's unused?

okay, the sampling should be in __sample_relays() , and that eventually gets converted to the config format in __filter_nodes() , i think the fix will be somewhere in there (edited)

Using the same staging data, I reran generate and simulate and ended up getting a similar conflict. Given that I've never seen this otherwise, it seems like there's something special about the generate output that is causing this to happen. In both cases, the collision happens on "adjacent" relays. before relay124 and relay125, this time relay1exitguard and relay2exitguard:

  relay1exitguard:
    network_node_id: 1601
    ip_addr: 185.220.100.254
    ...
  relay2exitguard:
    network_node_id: 1601
    ip_addr: 185.220.100.254
    ...

relayinfo_staging_2020-11-01--2020-11-30.json does have multiple relays with the two relevant ip addresses.

194.5.249.130:

...
    "24EB501464B0267EBD31215096C6DE4E051CAFC5": {
      "address": "194.5.249.130",
      "bandwidth_burst": 1073741824,
      "bandwidth_capacity": 5328890,
      "bandwidth_rate": 1073741824,
      "country_code": "RO",
      "exit_frequency": 1.0,
      "fingerprint": "24EB501464B0267EBD31215096C6DE4E051CAFC5",
      "guard_frequency": 0.0,
      "running_frequency": 0.22916666666666666,
      "weight": 5.606978975713741e-06
    },
...
    "643A984F26394D08912B7F2CD3A1A8B838773908": {
      "address": "194.5.249.130",
      "bandwidth_burst": 1073741824,
      "bandwidth_capacity": 6726191,
      "bandwidth_rate": 1073741824,
      "country_code": "RO",
      "exit_frequency": 1.0,
      "fingerprint": "643A984F26394D08912B7F2CD3A1A8B838773908",
      "guard_frequency": 0.0,
      "running_frequency": 0.2263888888888889,
      "weight": 6.819499955739053e-06
    },
...

185.220.100.254

...
    "9971F51A3274758B5C59E1D6580ED2C13E13CBEC": {
      "address": "185.220.100.254",
      "bandwidth_burst": 1073741824,
      "bandwidth_capacity": 77995293,
      "bandwidth_rate": 1073741824,
      "country_code": "DE",
      "exit_frequency": 1.0,
      "fingerprint": "9971F51A3274758B5C59E1D6580ED2C13E13CBEC",
      "guard_frequency": 1.0,
      "running_frequency": 0.9944444444444445,
      "weight": 0.0013773949733679623
    },
...
    "E8C8667CAF3D5148E52ECF736A7B204982F78EAA": {
      "address": "185.220.100.254",
      "bandwidth_burst": 1073741824,
      "bandwidth_capacity": 66339803,
      "bandwidth_rate": 1073741824,
      "country_code": "DE",
      "exit_frequency": 1.0,
      "fingerprint": "E8C8667CAF3D5148E52ECF736A7B204982F78EAA",
      "guard_frequency": 1.0,
      "running_frequency": 0.9930555555555556,
      "weight": 0.0010593993210742808
    },
...

Confirmed that local output of tornettools stage for the same time period also has those relays. I've successfully generated and run simulations from that local output, so if there's a problem with the tornettools stage output of the failing sims above, I don't think it's just that these relays exist.

Raw outputs of tornettools stage for the failing simulations:

userinfo_staging_2020-11-01--2020-11-30.json.gz
tor_metrics_2020-11-01--2020-11-30.json.gz
relayinfo_staging_2020-11-01--2020-11-30.json.gz
networkinfo_staging.gml.gz

I'm able to replicate this bug consistently when generating a 10% network, both with the previously attached outputs of tornettools stage, and with a local output that works fine when generating a 0.1% network.

$ tornettools generate relayinfo_staging_2020-11-01--*.json userinfo_staging_2020-11-01--*.json networkinfo_staging.gml         tmodel-ccs2018.github.io         --network_scale 0.1 --prefix tornet-0.1
$ tornettools simulate tornet-0.1/
$ $ tail tornet-0.1-broken2/shadow.log 
00:00:18.913042 [shadow] n/a [WARN] [n/a] [tsc.c:192] [Tsc_nativeCyclesPerSecond] Couldn't get CPU TSC frequency
00:00:18.913045 [shadow] n/a [WARN] [n/a] [host.c:178] [host_setup] Couldn't find TSC frequency. rdtsc emulation won't scale accurately wrt simulation time. For most applications this shouldn't matter.
00:00:18.913050 [shadow] n/a [INFO] [n/a] [host.c:230] [host_setup] Setup host id '393' name 'relay451middle' with seed 1016471420, ip 172.81.181.254, 7666 bwUpKiBps, 7666 bwDownKiBps, 131072 initSockSendBufSize, 174760 initSockRecvBufSize, 3700000 cpuFrequency, 0 cpuThreshold, 200 cpuPrecision
00:00:18.913120 [shadow] n/a [ERROR] [n/a] [network_graph.rs:746] [shadow_rs::routing::network_graph::export] IP 185.170.113.28 assigned to both nodes 1601 and 1601
00:00:18.913141 [shadow] n/a [ERROR] [n/a] [controller.c:308] [_controller_registerHostCallback] Could not register host relay452middle
00:00:18.913164 [shadow] n/a [ERROR] [n/a] [controller.c:405] [_controller_registerHosts] Could not register hosts with specific IP addresses
00:00:18.913181 [shadow] n/a [ERROR] [n/a] [controller.c:450] [controller_run] Unable to register hosts
00:00:18.913195 [shadow] n/a [WARN] [n/a] [controller.c:120] [controller_free] network graph was not properly freed
00:00:18.926041 [shadow] n/a [INFO] [n/a] [controller.c:136] [controller_free] simulation controller destroyed
00:00:18.927059 [shadow] n/a [ERROR] [n/a] [main.rs:249] [shadow_rs::core::main::export] Controller exited with code 1

I'm assuming determinism isn't the issue, but it's good to know about.

One way this could happen is if you are using the same seed in your generate step:

tornettools --seed=123 generate ...

Doing so will cause tornettools to produce results deterministically. If you did not specify a --seed, then tornettools should be selecting a seed at random.

tornettools/tornettools/tornettools

Lines 544 to 550 in 50cf08b

    
           # seed the pseudo-random generators 
        
           # if we don't have a seed, choose one and make sure we log it for reproducibility 
        
           if args.seed == None: 
        
               args.seed = randint(0, 2**31) 
        
           stdseed(args.seed) 
        
           numpyseed(args.seed) 
        
           logging.info("Seeded standard and numpy PRNGs with seed={}".format(args.seed))

Right; I'm not supplying a seed in these examples. In all 3 examples, a different IP address was duplicated, and to a different relay. It always seems to be an "adjacent" relay, though, which seems to hint at some bug other than "just got an unlucky seed"

It still seems to me like it's expected that we have duplicate IP addresses. The staging data contains 8904 relays with 8287 unique IP addresses. If you randomly sample 890 relays (10% network), I think the probability that you choose two relays with the same IP address should be pretty high.

Edit: More specifically:

with open('relayinfo_staging_2020-11-01--2020-11-30.json', 'r') as f:
    data = json.load(f)
addresses = [data['relays'][x]['address'] for x in data['relays']]
len(addresses) -> 8904
len(set(addresses)) -> 8287

The staging data contains 8904 relays, only 8287 of which have unique IP addresses.

Oh wow, yeah I didn't realize it was that prevalent. I wonder if shadow just wasn't noticing dupe IP addresses before

Oh wow, yeah I didn't realize it was that prevalent. I wonder if shadow just wasn't noticing dupe IP addresses before

Shadow only used the IP address as a hint before, so it would increment the IP address if it was already in use.

In both cases, the collision happens on "adjacent" relays. before relay124 and relay125, this time relay1exitguard and relay2exitguard:

My guess is that this occurs since tornettools sorts the relays by their consensus weight in get_relays() before assigning them nicknames, and two relays with the same IP address are usually running on the same server with the same torrc, so typically have similar consensus weights

Makes sense.

In the short term sounds like the original plan to just reassign fresh IP addresses (e.g. by incrementing) makes sense for modeling the legacy behavior.

Given that multiple relays at the same IP address are so prevalent, in the longer term maybe we ought to consider allowing them and running multiple relays on the same host. We probably need some careful simulation + evaluation before making such behavior a default, though.

	# seed the pseudo-random generators
	# if we don't have a seed, choose one and make sure we log it for reproducibility
	if args.seed == None:
	args.seed = randint(0, 2**31)
	stdseed(args.seed)
	numpyseed(args.seed)
	logging.info("Seeded standard and numpy PRNGs with seed={}".format(args.seed))

Can generate configs w multiple hosts with same IP address