The project suggests split points for Accumulo tablets based on the average number of entries per tablet. The idea behind this project is to avoid uneven processing times between tablets.
- Start an Accumulo cluster using https://github.com/medined/Accumulo_1_5_0_By_Vagrant.
- Download this project (I used the host machine, not the virtual one).
- Run 'mvn package' to generate a jar file.
- Copy the jar file to the Accumulo_1_5_0_By_Vagrant directory.
- Switch to the Accumulo_1_5_0_By_Vagrant directory.
- vagrant ssh master
- Run the following command making sure to change the table name to something that exists.
tool.sh \
/vagrant/SplitLargeTablets-1.0-SNAPSHOT.jar \
com.affy.SuggestSplitPoints \
-i instance \
-u root \
-p secret \
-z affy-master:2181 \
--output ./suggest_split_points \
--min_entries 1000000 \
-t tableA
Now you can see the suggestions with this command:
hadoop fs -ls ./suggest_split_points
The split point suggestions are in any file with non-zero length. It's fairly easy to read the set of part-m-XXXX file to build a SortSet which can be passed to the addSplits method of TableOperations.