Recruitment tasks CDN77

1. Nginx

1. Cache

By default Nginx's cache uses md5 hash value of $scheme$proxy_host$uri$is_args$args. This value can be changed by setting proxy_cache_key directive in cofig file. Final hash value is used as a filename for given response record in a filesystem. Caching keys and metadata (such as usage timers) for each zone are also stored in a shared memory zone for faster determination of HIT or MISS without having to go to disk cuz nginx uses multiple worker proccesses to make full use of multi-core, multi-CPU and hyper-threading systems.

From the view of source code structure the keys are stored in sruct ngx_http_cache_s

My approach started with reading the documentation. And that was mostly enough to answer these questions. Then I browsed for a while in nginx's source code to exploree data structure of cache.

2. Add X-Cache-Key header

Look at the commit which implements X-Cache-Key header. My approach was first to download stable version of nginx source code, compile it locally and set it up as a reverse proxy with configured cache and some fake local upstream. After everything was prepared I could implement changes and debug them while running the code.

Key is stored in an array of bytes. Conversion of bytes into hex string can be done in many ways. Mine is deffinitely not the most readable one but it's in my opinion much faster than using some standard functions such as sprintf etc. Another way would be to use hex value from cache.file.name.data, which stores the whole path to cache record in a filesystem.

3. Data structure for DNS wildcards with O(1) lookup

My approach would be (and later reading nginx's code I found out that nginx has very similar way) to have 2 structures. One for dns records with wildcards at the beggining (*.fsa.dsa.com) and second for wildcards at top level (foo.bar.*). Basically both structures are just nested hashmaps where keys are individual subdomains ( baz.foo.bar.* gets into 3 nested hashmaps where keys are "baz", "foo", "bar" one by one). Then you just add some bit/flag into each level if wildcard symbol occurs there. Hashmaps have constant complexity for lookup operation. Eventhough we have 2 structures and nested hashmaps, time complexity is still constant. Worst case depends just on the biggest number of subdomains in one record.

2 CDN/GO

I will try sequentually explain my thoughts and assumptions after I read the assignment and specification of ECS (my thoughts were changing and evolving during the process):

In my opinion, naive solution uses one tree child to parse one bit of an address. For example left child for 0 bit and right child for 1 bit. However, I would need tree with depth of 128 to parse 128bits long prefix. That means this tree would have 2^129 nodes. That does not sound legit

Anyway, I am allowed to go "only" into the depth of 56. But routing data might have prefixes of length from 8 to 128. So I cannot use one tree child to parse one bit of ECS address, because that would be suboptimal regarding the accraccy of the response (small subnets would be never used and we want them to be used). That means passing one tree node should strip 1 up to n bits of ECS address.

Maybe I could use similar way to Radix trees. Left(right) child represents 1 up to n of 0(1)bits, respectively. The idea is that a representation of sequence of bits "000000" requires 6 bits. But, I can say it's 6 times zero, which requires just 3 bits (110 = 6 in binary). So we saved some bits. Disadvantage of this approach is when the routing data entry contains a lot of switching bits (~ 01010101010), because in that case we have to represent each bit in its own node. But worst case scenario has the same performance as the naive approach, so we can only get better results this way.

The hard part is how to represent this properly in memory to get memory optimal solution. Representation of the binary tree could be done with Go structs and pointers or in a simple array.

Well, I think an array representation of the binary tree is suboptimal because I have no guarantee that the tree would be complete (all nodes used) because the assignment says there's not that many subnets in our routing data. And even just one long leaf can cause size of this array to explode cuz indexing of children is calculated as (i*2)+{1,2} so we still have to allocate tons of memory so this solution would be suboptimal regarding the memory. Also, if we do not keep the whole array allocated the whole time, inserting new element might be very expensive because we might need to reallocate memory and copy the whole array. I am very doubtful the DNS server would have enough disk space for this anyway :D

Also, I dont think I can use any crazy structures utilising pointers since loading the data should be done in one mmap step without further deserialisation and I am a bit sceptical that some tree Structs with pointers would be properly loaded using mmap out of the box.

My guess is that struct and pointers is a better solution if mmap actually allows it. However, for a production solution I would try to implement both versions and do some benchmarks.

Another thing is that I think we should consider size of our available virtual memory when deciding how much information we would like to store in every node. The allowed tree depth is 56 and the maximum number of tree nodes would be 2**57. For a 64-bit program, the memory limit is normally huge, though exactly how huge will depend on our architecture and OS. Hoever, with this big tree we are approaching the limit IMHO. Maybe I am overthinking this assignment and in reality there won't be ever this many routing data to fill our all address space. But I think it's still something to consider and maybe atleast a probability of filling the whole space should be calculated. And we still did not even talk about storing CDN addresses for routing.

Storing routing CDN ip addresses could be done in every tree node. That means every tree node would keep an information about where this prefix should be routed. That would mean another (NUMBER_OF_NODES*4)bytes used. However, a lot CDN addresses could be duplicated this way. So better solution is to keep separated array of these CDN addresses to avoid duplicates. Every tree node then would have an information about index of a proper CDN server. I guess this solution is better just up to n CDN servers. But eventhough CDN77 is big, I think number of CDN servers is still in favor for this solution. If I'm not mistaken, in last CDN77 podcast in CZPodcast was said that cdn77 has about 50 cdn localities. I know we have to consider also some middle layer caches but let's for simplicity consider 1byte is enough for indexes of CDN servers.

With all this in mind, let's propose that every tree node would store 2 bytes of information. First byte for index of CDN server where this prefix should be routed. Second byte for specifying of how many bits we should strip. This proposal could be further optimized, because we don't need 8bits for stripping part cuz we will never strip up to 256bits. But I've chosen 1 byte so it's rounded for simplicity. This way our full tree would require (2**57)*16 = 2**61 bits just for metadata (not considerinǵ pointers to children in case of struct architecture). I feel bad for the DNS server's disk :D

After a bit of research I got into a conclusion that mmaping struct with pointers is not gonna work out. Definitelly not with just pure mmap syscall. Array it is then.

My code is not fully complete but one can see an idea. Inserting new routing data is halfly implemented. Saving/loading to/from file is fully implemented. Tree is represented as 1D array where each node consists of 2 bytes metadata. First byte says how many bits we should strip from prefix. Second byte specifies index of CDN server where this prefix should be routed. Optimal size of the array is left as homework for a reader ^^ Algorithm for DNS finding the proper CDN server is simply done by traversing the tree by bits of ECS address and remembering the lowest CDN server we could find.

I hope I did not bore you to death

Time consumption

I think I've spent several hours with the first task, It was kinda straightforward...
Second task was a bigger brainteaser tho, in my opinion, it took me atleast a manday including thinking afk

HappyStoic / nginx