Count URLs and get the most frequent 100 urls. The input file is expected to be 100GB. Process memory is expected to be 1GB. It takes ~40 minutes in using one disk in a Ubuntu machine.
Prepare your expected answer, e.g. gen-data/my_top10_ans. Then use gen-data/gen_file_from_ans.py to generate input file:
./gen_file_from_ans.py ans_file file_size_in_gb target_dir
The file_size_in_gb can be float. For example, to generate a 100MB test file:
cd gen-data
./gen_file_from_ans.py my_top10_ans 0.1 .
This project can be built in MacOS or Ubuntu 14.04 with 64bit CPU.
./build.sh
The default build mode DEBUG. For release build, use
./build.sh -release
We also integrated with AddressSantinizer. To enable address santinizer build, use
./build.sh -asan
./topN_freq_urls file_path num_results mem_limit_in_mb
The basic framework is an external sorting, which is divided into two stage:
The first stage loads URLs into memory buffer (limited by the mem_limit), sort them and spill sorted URLs as a file. For a 100GB input file and 1GB mem_limit, we'll generate 100GB/1GB = 100 spilled files. URLs in each file are sorted. Pre-Aggregation is also performed in this stage. The format of each row in the spilled file is (URL, count) separated by '\t'.
Merge all the spilled files and performed Final-Aggregation. A heap is used to merge the spill files. So (URL, count) pairs with the same URLs can be popped in sequence. Then we can merge them to get the global count.
Finally, We maintain a TopN heap to store the recently TopN frequent URLs' pairs, i.e. (URL, count).
There are still some opportunities for optimization. Here is a profile in a latest test:
4879 25.5% 25.5% 4879 25.5% std::istream::get
2781 14.5% 40.0% 2781 14.5% __write_nocancel
1409 7.4% 47.4% 1409 7.4% __memcmp_sse4_1
767 4.0% 51.4% 1021 5.3% __gnu_cxx::__normal_iterator::operator+
759 4.0% 55.3% 759 4.0% topN_freq::StringVal::StrCompare
596 3.1% 58.4% 1122 5.9% std::_Hashtable::_M_find_before_node
468 2.4% 60.9% 468 2.4% __read_nocancel
366 1.9% 62.8% 366 1.9% __memcpy_sse2_unaligned
338 1.8% 64.6% 338 1.8% __gnu_cxx::__normal_iterator::__normal_iterator
332 1.7% 66.3% 332 1.7% std::__detail::_Mod_range_hashing::operator
326 1.7% 68.0% 1799 9.4% std::__adjust_heap
302 1.6% 69.6% 302 1.6% __gnu_cxx::__normal_iterator::operator*
287 1.5% 71.1% 287 1.5% malloc_consolidate
279 1.5% 72.5% 779 4.1% std::__detail::_Hash_code_base::_M_bucket_index
252 1.3% 73.9% 998 5.2% std::__push_heap
199 1.0% 74.9% 199 1.0% std::equal_to::operator
170 0.9% 75.8% 170 0.9% _int_malloc
157 0.8% 76.6% 157 0.8% __GI___libc_free
149 0.8% 77.4% 149 0.8% _int_free
146 0.8% 78.1% 146 0.8% std::move
115 0.6% 78.7% 185 1.0% topN_freq::ResultTuple::operator<
105 0.5% 79.3% 784 4.1% std::_Hashtable::_M_erase
103 0.5% 79.8% 103 0.5% __unlink
102 0.5% 80.4% 131 0.7% std::__detail::_Hash_code_base::_M_h2
99 0.5% 80.9% 99 0.5% std::istream::sentry::sentry
98 0.5% 81.4% 98 0.5% std::__detail::_Hashtable_ebo_helper::_S_cget
97 0.5% 81.9% 97 0.5% std::forward
92 0.5% 82.4% 1452 7.6% std::__unguarded_partition
80 0.4% 82.8% 1950 10.2% topN_freq::MemPool::Free
78 0.4% 83.2% 485 2.5% topN_freq::StringLess::operator
77 0.4% 83.6% 77 0.4% __gnu_cxx::__normal_iterator::operator--
77 0.4% 84.0% 1484 7.8% topN_freq::StringVal::operator<
76 0.4% 84.4% 1576 8.2% std::__detail::_Map_base::operator[]
73 0.4% 84.8% 80 0.4% __gnu_cxx::operator-
69 0.4% 85.1% 69 0.4% __gnu_cxx::__normal_iterator::operator++
69 0.4% 85.5% 931 4.9% std::_Hashtable::_M_bucket_index
63 0.3% 85.8% 1041 5.4% std::_Hashtable::_M_find_node
59 0.3% 86.1% 85 0.4% std::__detail::_Hash_code_base::_M_h1
57 0.3% 86.4% 57 0.3% std::num_get::_M_extract_int
57 0.3% 86.7% 186 1.0% std::vector::push_back
56 0.3% 87.0% 113 0.6% topN_freq::FileManager::ReadLine@405694
52 0.3% 87.3% 114 0.6% std::get
52 0.3% 87.6% 224 1.2% std::less::operator
52 0.3% 87.8% 81 0.4% std::vector::begin
51 0.3% 88.1% 63 0.3% __gnu_cxx::operator<
51 0.3% 88.4% 189 1.0% std::_Hashtable::_M_insert_bucket_begin
49 0.3% 88.6% 49 0.3% __gnu_cxx::__normal_iterator::base
49 0.3% 88.9% 2389 12.5% topN_freq::FileMerger::NextResultTuple
48 0.3% 89.1% 117 0.6% std::__detail::_Hash_code_base::_M_hash_code
43 0.2% 89.4% 121 0.6% std::__detail::_Select1st::operator
42 0.2% 89.6% 42 0.2% std::__detail::_Node_iterator_base::_Node_iterator_base
36 0.2% 89.8% 358 1.9% std::__unguarded_linear_insert
35 0.2% 90.0% 43 0.2% _ZN9__gnu_cxx13new_allocatorIN9topN_freq9StringValEE9constructIS2_IRKS2_EEEvPT_DpOT0_
35 0.2% 90.1% 73 0.4% std::__detail::_Node_iterator::_Node_iterator
34 0.2% 90.3% 231 1.2% _ZN9__gnu_cxx13new_allocatorINSt8__detail10_Hash_nodeISt4pairIKPcmELb0EEEE9constructIS7_IRKSt21piecewise_construct_tSt5tupleIIRS5_EESD_IIEEEEEvPT_DpOT0_
34 0.2% 90.5% 311 1.6% std::_Hashtable::_M_allocate_node
34 0.2% 90.7% 291 1.5% std::__detail::_Equal_helper::_S_equals
Possible optimization points:
- Refactor the MemPool. Load 1GB URLs direcrly into the memory, instead of loading URLs line by line. After sort and spill, free the MemPool at once, instead of freeing URLs one by one. This can optimize time spent by std::istream::get. The hash map operations (e.g. std::_Hashtable::_M_find_before_node) of the current MemPool can also be saved.
- Don't spill the last file. Aggregate it in memory with other spilled files. This can save 1% IO (since we have 100 spilled files).
- Try mmap in reading files.
- Spill to different disks. For example, if we can leverage 4 disks, divide the memory pool into 4 bucket. Sort URLs in each bucket in parallel and then spill buckets into different disks in parallel. This can also boost the Final-Aggregation stage since we have 4 times read speed than before.