netaddr / netaddr

A network address manipulation library for Python

Home Page:https://netaddr.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Terrible performance compared with GoLang

madhur opened this issue · comments

I was using this library in python and realized it performs terribly low compared to Go.

Here is one example:
Support I generated a list of 1000 Ips

import random
import socket
import struct
import csv
ip = socket.inet_ntoa(struct.pack('>I', random.randint(1, 0xffffffff)))
ip_list = []

for x in range(1000):
    ip_list.append([socket.inet_ntoa(struct.pack('>I', random.randint(1, 0xffffffff)))])


with open('./ip_list.csv', 'w') as f:
    # create the csv writer
    writer = csv.writer(f)
    writer.writerow(['client_ip'])
    # write a row to the csv file
    writer.writerows(ip_list)

Now, I want to determine which of these IPs lies in cloudfront subnet

import requests
from netaddr import all_matching_cidrs
import csv
from os.path import expanduser
home = expanduser("~")

r = requests.get("https://ip-ranges.amazonaws.com/ip-ranges.json")

r_json = r.json()
cloudfront_subnets = []

for subnet in r_json["prefixes"]:
    cloudfront_subnets.append(subnet["ip_prefix"])

print(len(cloudfront_subnets))

file_name = './ip_list.csv'
filtered_rows = []
with open('./filtered.csv', 'w', newline='') as csvfile:
     dict_writer = csv.DictWriter(csvfile, ["client_ip"])
     
     with open(file_name, mode='r') as csv_file:
        print("Starting reading csv")
        csv_reader = csv.DictReader(csv_file)
        print("Read csv")
        line_count = 0
        for row in csv_reader:
            subnets = all_matching_cidrs(row['client_ip'], cloudfront_subnets)
            if len(subnets) != 0:
                obj = {
                    "client_ip": row['client_ip'],
                }
                filtered_rows.append(obj)
                dict_writer.writerow(obj)

print("Num requests", len(filtered_rows))

The program takes ~43 seconds on my Ryzen 5600x

➜ time python check_subnet.py                                                                                          [☸ eks-pt (mec)]
7492
Starting reading csv
Read csv
Num requests 19
python check_subnet.py  43.38s user 0.02s system 99% cpu 43.607 total

GoLang is able to do this in under a second with a single thread as well.

I am curious why the python version is so underperforming.

@madhur - A profiler is your best friend. When calling all_matching_cidrs the code calls IPNetwork on each address to check against and then sorts the list

It's not clear why the library calls sorted on this list but this means given your code you are inadvertently running sort against a larger list every iteration.

Given that you don't need the actual matches and instead are just trying to check membership your code can be fixed like this:

import requests
from netaddr import IPSet
import csv
from os.path import expanduser

home = expanduser("~")

r = requests.get("https://ip-ranges.amazonaws.com/ip-ranges.json")

r_json = r.json()
cloudfront_subnets = []

for subnet in r_json["prefixes"]:
    cloudfront_subnets.append(subnet["ip_prefix"])

print(len(cloudfront_subnets))

file_name = './ip_list.csv'
filtered_rows = []

cloudfront_set = IPSet(cloudfront_subnets)

with open('./filtered.csv', 'w') as csvfile:
    dict_writer = csv.DictWriter(csvfile, ["client_ip"])

    with open(file_name, mode='r') as csv_file:
        print("Starting reading csv")
        csv_reader = csv.DictReader(csv_file)
        print("Read csv")
        line_count = 0
        for row in csv_reader:
            if row['client_ip'] in cloudfront_set:
                obj = {
                    "client_ip": row['client_ip'],
                }
                filtered_rows.append(obj)
                dict_writer.writerow(obj)

print("Num requests", len(filtered_rows))

This will do same job in under a second