Firewall

update:

To confirm the correctness of my implementation, I later modified my original implementation and turned it into a more naive version, in which the rules are simply stored in a list after some preprocessing and conversion, and when making queries, each rule entry is used to compare with the current IP and port to judge if it is a match. Surprisingly, the more naive version runs even more efficiently than my previous TreeMap implementation (test result comparison attached to the tests section).

For FirewallNaive implementation, the time complexity of making a query is O(N), where N is the number of rule entries;

For Firewall implementation, the worst-case time complexity of making a query is O(NlogN), but I was expecting a better average running time. But the test results indicate that it might not be the case, and the associated constant could be rather large.

Conclusion: for this case, the FirewallNaive implementation might be even better.

Usage:

To test the code, you can make create new instance of Firewall with:

Firewall fw = new Firewall("some_file.csv");
The following format will return a boolean value:

fw.acceptPacket("inbound", "udp", 53, "192.168.2.1");
Alternatively, you can also modify the Test.java file to test:

In the Test.java code, you can modify the input file, and it will print out the time to load these rule (as I tested, it takes about 2.5s to load 1M rules, and after loading these 1M rules, it takes about 30 - 40ms to accept/decline a query)
More rule files can be generated using rule_data_generation.py and modify the python script with the following parameters:
- num_of_entries: number of rule entries to be generated
- p_port_range: the probability that the port of an entry is a range
- p_ip_range: the probability that the IP of an entry is a range
- output_file: name of the output file
More query files can be generated using query_data_generation.py and modify the python script with the following parameters:
- num_of_queries: number of rule entries to be generated
- output_file: name of the output file

Tests:

test0.csv: 4 rule entries provided in the problem description.
test1.csv: 1 million rule entries generated by rule_data_generation.py
test2.csv: 4 entries that would allow any packet to go through
test3.csv: 10 million rule entries generated by rule_data_generation.py (file too large to be included in Github)
query1.csv: 200 queries generated by query_date_generation.py
query2.csv: 20 queries generated by query_date_generation.py

Test Results:

**With test0.csv and query1.csv: **

Time spend to load rules for Firewall: 4ms

Time spend to load rules for FirewallNaive: 1ms

Correctness test (comparing with the result from naive implementation):

Correctness test completed successfully!

Efficiency test (comparing with the naive implementation):

For FireWall, the time spend to make 200 queries: 2

For FirewallNaive, the time spend to make 200 queries: 1
**With test1.csv and query1.csv: **

Time spend to load rules for Firewall: 2479ms

Time spend to load rules for FirewallNaive: 819ms

Correctness test (comparing with the result from naive implementation):

Correctness test completed successfully!

Efficiency test (comparing with the naive implementation):

For FireWall, the time spend to make 200 queries: 7870

For FirewallNaive, the time spend to make 200 queries: 23
With test2.csv and query1.csv:

Time spend to load rules for Firewall: 7ms

Time spend to load rules for FirewallNaive: 1ms

Correctness test (comparing with the result from naive implementation):

Correctness test completed successfully!

Efficiency test (comparing with the naive implementation):

For FireWall, the time spend to make 200 queries: 2

For FirewallNaive, the time spend to make 200 queries: 1
With test3.csv and query2.csv:

Time spend to load rules for Firewall: 48544ms

Time spend to load rules for FirewallNaive: 12704ms

Correctness test (comparing with the result from naive implementation):

Correctness test completed successfully!

Efficiency test (comparing with the naive implementation):

For FireWall, the time spend to make 20 queries: 27388

For FirewallNaive, the time spend to make 20 queries: 3

Design Rationale:

Structure and Classes:

I use nested hashmap to store the rules:

the first layer for directions (with keys of “inbound” and “outbound”);
the second layer is for protocols (with keys of “tcp” and “udp”)
the third layer is a TreeMap, whose key is a range of IP (IPRange), the value is a list of port ranges

I implemented a RuleEntry class with 4 fields, for any rule entry, the IP address(es) and port(s) are converted to ranges:

String direction;
String protocol;
IPRange ipRange;
int[] portRange;

I also implemented an IPRange class, which stores the lower bound and higher bound of allowed IP range as long variables. I also override a series of methods (compareTo, hashCode, toString, equals) so that it can be conveniently used in different data structures.

Add Rules and Query:

When adding rules to the firewall, the proper TreeMap was first got according to the direction and protocol of the packet. Then I retrieve the port ranges (list of arrays) with the give IPRange generated from the rule (it could be null if the IPRange was not added before). Lastly, I merge the current port range with the existing ranges (with int[][] addPortRange (int[][] ranges, int[] range) function, and helper functions: void merge(List<int[]> res, int[] range), int lowerBound(int[] target, int[][] ranges)), and store the updated ranges as the value to the IPRange key.
When making a query, as above, the proper TreeMap was first got according to the direction and protocol of the packet. Then the headMap() method was called to retrieve all the IPRanges whose lower bound is lower than or equal to the queried IP. Then I sorted all the retrieved IPRanges by their upper bound in reversed order. Lastly, I check all the qualified candidate IPRanges (lower bound <= query IP <= upper bound), to see if any of whose port ranges include the query port number (with the inRange() function).

Possible Improvement:

My initial idea was to implement the rules as Trie or Radix Tree structure, however, it was pretty complicated to implement in 2 hours (the difficult part is to merge/split overlapping IP ranges). So I decided to switch and store different IP ranges separately, without merging and splitting. So if given more time, I might look more into this, so that the lookup efficiency might be further improved.

Team Interests:

I am most interested in the platform team since I am very interested in building robust and scalable systems.

I am also interested in the data team, especially the data processing and analysis part.

joshua-ne / Firewall