Backup option causes error if there are overlapping keys in source SSTables

Question

Backup option causes error if there are overlapping keys in source SSTables

ayazahmad786 opened this issue 7 years ago · comments

I was experimenting with deletionCompactionStrategy on one of the table.
The details are as below:

ALTER table selenium.orders with compaction = {
'max_threshold': '8',
'min_threshold': '2',

'rules_select_statement': 'SELECT rulename, column, range_lower, range_upper, ttl FROM simility.deletion_rules_ttl WHERE ks=''selenium'' AND tbl=''orders''',

'dcs_underlying_compactor': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',

'dcs_convictor': 'com.protectwise.cassandra.retrospect.deletion.RuleBasedLateTTLConvictor',

'dcs_backup_dir': '/mnt/disk2/cassandra/deletion_backups',

'dcs_status_report_ms': '60000',

'class': 'com.protectwise.cassandra.db.compaction.DeletingCompactionStrategy'
}

CREATE TABLE simility.deletion_rules_ttl (
ks text,
tbl text,
rulename text,
column text,
range_lower text,
range_upper text,
ttl bigint,
PRIMARY KEY ((ks, tbl), rulename)
) WITH CLUSTERING ORDER BY (rulename ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

i have modified the RuleBasedLateTTLConvictor.java file to remove range tuple<text, text> since our spark SQL were failing due to error 'frozen tuple<text, text> field not found exception' instead of tuple now i am using range_lower(type text) range_upper(type text).

After applying these changes when we run nodetool compact selenium orders command it is unable to run compact command.

Sometimes it throws exception:
ERROR [CompactionExecutor:3] 2016-12-29 07:49:29,635 CassandraDaemon.java:223 - Exception in thread Thread[CompactionExecutor:3,1,main]
java.lang.RuntimeException: Last written key DecoratedKey(-9214344598665125698, 373439303632) >= current key DecoratedKey(-9214344598665125698, 373439303632) writing into /mnt/disk2/cassandra/deletion_backups/selenium-orders-tmp-ka-2421-Data.db
at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:164) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:218) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at com.protectwise.cassandra.db.compaction.BackupSinkForDeletingCompaction.flush(BackupSinkForDeletingCompaction.java:52) ~[deleting-compaction-strategy-0.24-SNAPSHOT.jar:0.24-SNAPSHOT]
at com.protectwise.cassandra.db.compaction.BackupSinkForDeletingCompaction.accept(BackupSinkForDeletingCompaction.java:74) ~[deleting-compaction-strategy-0.24-SNAPSHOT.jar:0.24-SNAPSHOT]
at com.protectwise.cassandra.db.columniterator.FilteringOnDiskAtomIterator$1.apply(FilteringOnDiskAtomIterator.java:78) ~[deleting-compaction-strategy-0.24-SNAPSHOT.jar:0.24-SNAPSHOT]
at com.protectwise.cassandra.db.columniterator.FilteringOnDiskAtomIterator$1.apply(FilteringOnDiskAtomIterator.java:59) ~[deleting-compaction-strategy-0.24-SNAPSHOT.jar:0.24-SNAPSHOT]
at com.google.common.collect.Iterators$7.computeNext(Iterators.java:647) ~[guava-16.0.1.jar:na]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.1.jar:na]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.1.jar:na]
at com.protectwise.cassandra.db.columniterator.FilteringOnDiskAtomIterator.hasNext(FilteringOnDiskAtomIterator.java:137) ~[deleting-compaction-strategy-0.24-SNAPSHOT.jar:0.24-SNAPSHOT]
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.1.jar:na]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.1.jar:na]
at com.google.common.collect.Iterators$7.computeNext(Iterators.java:645) ~[guava-16.0.1.jar:na]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.1.jar:na]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.1.jar:na]
at org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:165) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:121) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:192) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:127) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:197) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:73) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:244) ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]

ayazahmad786 · Answer 1 · Thu Dec 29 2016 17:21:37 GMT+0800 (China Standard Time)

When we remove your deletionCompactionStartegy from the table, the nodetool compact command works as expected.

Eric Stevens · Answer 2 · Fri Dec 30 2016 08:22:18 GMT+0800 (China Standard Time)

It's probably not very likely, because your changes would all have been in the convictor and not in the compaction strategy itself, but could you fork this project and post the changes you made to support the non-tupled version of that rules table? Out of curiosity why do you need to use Spark on the rules table?

The error indicates that records are being written out-of-order to the backup directory. Since we write records in the same order they are considered for compaction, it's surprising that they could be out of order as the source data is always in order.

But I have a hunch what might be going on here - the test in SSTableWriter is >=, so maybe it's overlapping records from different source tables, I'll get working on a test to prove it. Unfortunately there'll need to be a little refactoring of the test code because right now we're doing user defined compaction one table at a time, and this would never appear under such a circumstance.

This is related to the data backup operation, if you're ok with skipping the backups (such as if you take a snapshot first), you could use that as a workaround in the mean time.

ayazahmad786 · Answer 3 · Fri Dec 30 2016 14:46:18 GMT+0800 (China Standard Time)

hi,
First of all thanks for the quick reply
here is the link of code:
ayazahmad786@a540272

Even though the Spark SQL queries are not run against the rule tables, the error still comes up.

The problem with the suggested approach is, since the TTLRulebased convictor only deletes data from the sstables which are in the process of getting compacted, so if a single row has occurrence in multiple sstables other than the sstables which are getting compacted, it will give wrong picture.

Eric Stevens · Answer 4 · Wed Jan 04 2017 00:19:02 GMT+0800 (China Standard Time)

Yeah, there's a fundamental problem of consistency past the deletion boundary, which is intrinsic to this approach. You definitely have no guarantee that for a given primary key (PK), all SSTables containing data for that PK will get compacted together. It could definitely be the case that you do writes A, then B to a PK (sufficiently far apart that they are in different SSTables), and B deletes first, thus resurrecting the value A for a while until it also gets deleted. You will also have consistency boundary issues where data for a PK will not have been consistently deleted on various nodes, so multiple consecutive reads of the same record may give you different results.

Your application needs to be able to handle consistency issues past the deletion boundary - for us we know what retention is for each of our tenants, and we limit their queries to their retention window so that they aren't ever reading records from the time period where the consistency issues begin.

I'm catching up from the holidays, I'll do my best to get at least a test case to reproduce this error this week, which will give me a better sense of what to expect WRT how difficult it is to fix. I don't think it'll be too bad though.

Eric Stevens · Answer 5 · Thu Jan 05 2017 00:40:58 GMT+0800 (China Standard Time)

I should have a fix for this in bug/issue-6. Do you want to give that a shot and let me know if it addresses your problem? I'm pretty confident it should.

I'm not as happy with this solution as I'd like, as detailed in 790d688 - basically compaction merging needs to happen for backups, but all the existing compaction merging classes assume that they are being done inside the compaction controller, which is something we don't have access to during the backup operation.