cstar failing on "Failed getting data from cache" during "continue"

Question

cstar failing on "Failed getting data from cache" during "continue"

yakirgb opened this issue 4 years ago · comments

Hi,
I'm using ctar 0.8.0
I executed cstar and tried to resume/continue but the command failed with "Failed getting data from cache".
First run:

/usr/local/bin/cstar run --command=/home/dba/cassandra/scripts/fstrim_run.sh --seed-host=reco001.tab.com --ignore-down-nodes --topology-per-dc --max-concurrency=14 --ssh-lib=ssh2 --ssh-identity-file=/var/lib/jenkins/.ssh/id_rsa  --ssh-username=root -v
Job id is e8507178-a99e-4765-a967-458e5f3d8892

I see that cache file created:

/var/lib/jenkins/.cstar/cache/endpoint_mapping-a566699b-a182-3247-a9e6-95dba8308b58-b164c8933dc0fdc44c4f030609bc25f6

As i understand, the cache file created based on schema_versions and status_topology_hash:

[root@dba e8507178-a99e-4765-a967-458e5f3d8892]# grep schema_versions -A1 job.json
    "schema_versions": [
        "a566699b-a182-3247-a9e6-95dba8308b58"
[root@dba e8507178-a99e-4765-a967-458e5f3d8892]# grep status_topology_hash -A1 job.json
    "status_topology_hash": [
        "b164c8933dc0fdc44c4f030609bc25f6"

And "continue" failed:

/usr/local/bin/cstar continue e8507178-a99e-4765-a967-458e5f3d8892 -v

The error:

08:30:22 Failed getting data from cache : <traceback object at 0x7f3eb3bd3688>
08:35:22 Traceback (most recent call last):
08:35:22   File "/usr/local/bin/cstar", line 8, in <module>
08:35:22     sys.exit(main())
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 214, in main
08:35:22     namespace.func(namespace)
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 57, in execute_continue
08:35:22     output_directory=args.output_directory, retry=args.retry_failed)
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 35, in read
08:35:22     return _parse(f.read(), file, output_directory, job, job_id, stop_after, max_days, endpoint_mapper, retry)
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 92, in _parse
08:35:22     endpoint_mapping = endpoint_mapper(original_topology)
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 222, in get_endpoint_mapping
08:35:22     pickle.dump(dict(endpoint_mappings), open(self.get_cache_file_path("endpoint_mapping"), 'wb'))
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 130, in get_cache_file_path
08:35:22     return os.path.join(self.cache_directory, "{}-{}-{}".format(cache_type, "-".join(sorted(self.schema_versions)), "-".join(sorted(self.status_topology_hash))))
08:35:22 AttributeError: 'Job' object has no attribute 'cache_directory'
...
08:35:22 Summary:
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 35, in read
08:35:22     return _parse(f.read(), file, output_directory, job, job_id, stop_after, max_days, endpoint_mapper, retry)
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 92, in _parse
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 222, in get_endpoint_mapping
08:35:22   File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 130, in get_cache_file_path
08:35:22 AttributeError: 'Job' object has no attribute 'cache_directory'

Thank you,
Yakir Gibraltar

Alexander Dejanovski · Answer 1 · Mon Sep 28 2020 16:56:35 GMT+0800 (China Standard Time)

Hi @yakirgb,

I've created #63 to fix this issue. Would you mind testing this branch to see if it fixes your problem?
You can install it using: sudo pip3 install git+https://github.com/spotify/cstar.git@alex/fix-continue-cache

Thanks!

Yakir Gibraltar · Answer 2 · Tue Sep 29 2020 15:20:58 GMT+0800 (China Standard Time)

Thank you @adejanovski , testing your PR, i'll update.

Yakir Gibraltar · Answer 3 · Tue Sep 29 2020 20:25:32 GMT+0800 (China Standard Time)

Hi @adejanovski , installed your branch, the job failed with:

12:21:55 unbuffer /usr/local/bin/cstar continue d8e7e11e-665d-4e97-a432-eb5d31ba9e8e -v 2>&1 | tee -a /var/log/cstar/cstar_RESUME_20200929-122154.log
12:21:55 Retry :  False
12:22:07 Resuming job d8e7e11e-665d-4e97-a432-eb5d31ba9e8e
12:22:07 Running  /usr/local/lib/python3.6/site-packages/cstar/resources/commands/run.sh
12:22:26 Traceback (most recent call last):
12:22:26   File "/usr/local/bin/cstar", line 11, in <module>
12:22:26     load_entry_point('cstar==0.8.1.dev0', 'console_scripts', 'cstar')()
12:22:26   File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 225, in main
12:22:26     namespace.func(namespace)
12:22:26   File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 70, in execute_continue
12:22:26     job.resume()
12:22:26   File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 331, in resume
12:22:26     self.resume_on_running_hosts()
12:22:26   File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 360, in resume_on_running_hosts
12:22:26     threading.Thread(target=self.job_runner(self, host, self.ssh_username, self.ssh_password, self.ssh_identity_file, self.ssh_lib),
12:22:26 TypeError: __init__() missing 1 required positional argument: 'host_variables'
12:22:26

Maybe need to add self.get_host_variables(host) in https://github.com/spotify/cstar/blob/master/cstar/job.py#L360 ?
Should beL

    def resume_on_running_hosts(self):
        for host in self.state.progress.running:
            debug("Resume on host", host.fqdn)
            threading.Thread(target=self.job_runner(self, host, self.ssh_username, self.ssh_password, self.ssh_identity_file, self.ssh_lib, self.get_host_variables(host)),
                             name="cstar %s" % host.fqdn).start()
            time.sleep(self.sleep_on_new_runner)

Instead:

    def resume_on_running_hosts(self):
        for host in self.state.progress.running:
            debug("Resume on host", host.fqdn)
            threading.Thread(target=self.job_runner(self, host, self.ssh_username, self.ssh_password, self.ssh_identity_file, self.ssh_lib),
                             name="cstar %s" % host.fqdn).start()
            time.sleep(self.sleep_on_new_runner)

Thank you, Yakir.

Yakir Gibraltar · Answer 4 · Wed Sep 30 2020 23:50:03 GMT+0800 (China Standard Time)

Hi @adejanovski can you take a look please on #62 (comment) ?

Yakir Gibraltar · Answer 5 · Sun Oct 04 2020 22:31:29 GMT+0800 (China Standard Time)

Checked on my env and add self.get_host_variables(host) to resume_on_running_hosts solved the issue.

Yakir Gibraltar · Answer 6 · Tue Oct 13 2020 19:46:00 GMT+0800 (China Standard Time)

Hi @adejanovski , can you check please?

Alexander Dejanovski · Answer 7 · Sat Oct 31 2020 01:06:34 GMT+0800 (China Standard Time)

@yakirgb,

I've pushed another commit with the actual fix.
Let me know if this works for you.

Thanks

Yakir Gibraltar · Answer 8 · Sun Nov 01 2020 15:24:33 GMT+0800 (China Standard Time)

Hi @adejanovski ,
~~The fix is working~~
~~Can you bump version please?~~

Now i'm getting new errors:

[jenkins@dba ~]$ /usr/bin/python3.6 /usr/local/bin/cstar continue d416447c-a1d6-4b3a-9649-67b48967fa64 -v
Retry :  False
Resuming job d416447c-a1d6-4b3a-9649-67b48967fa64
Running  /usr/local/lib/python3.6/site-packages/cstar/resources/commands/run.sh
Traceback (most recent call last):
  File "/usr/local/bin/cstar", line 11, in <module>
    load_entry_point('cstar==0.8.1.dev0', 'console_scripts', 'cstar')()
  File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 225, in main
    namespace.func(namespace)
  File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 70, in execute_continue
    job.resume()
  File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 332, in resume
    self.run()
  File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 344, in run
    self.schedule_all_runnable_jobs()
  File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 433, in schedule_all_runnable_jobs
    next_host = self.state.find_next_host()
  File "/usr/local/lib/python3.6/site-packages/cstar/state.py", line 75, in find_next_host
    ignore_down_nodes=self.ignore_down_nodes)
  File "/usr/local/lib/python3.6/site-packages/cstar/strategy.py", line 69, in find_next_host
    return _strategy_mapping[strategy](remaining, endpoint_mapping, progress.running)
  File "/usr/local/lib/python3.6/site-packages/cstar/strategy.py", line 83, in _topology_find_next_host
    for next in endpoint_mapping[h]:
KeyError: Host(fqdn='1.2.3.4', ip='1.2.3.4', dc='Ud', cluster='Ud Cluster', rack='RAC1', is_up=True, host_id='488a32ea-4b75-458d-b34b-a43b92075894')

Thank you a lot, Yakir.

Yakir Gibraltar · Answer 9 · Tue Dec 01 2020 18:02:56 GMT+0800 (China Standard Time)

@adejanovski can you merge your fix please?

Alexander Dejanovski · Answer 10 · Wed Dec 02 2020 14:16:23 GMT+0800 (China Standard Time)

Hi @yakirgb,

done 👍

Yakir Gibraltar · Answer 11 · Wed Dec 02 2020 16:58:15 GMT+0800 (China Standard Time)

Thank you @adejanovski
Any option to bump minor version?

Alexander Dejanovski · Answer 12 · Fri Dec 04 2020 21:55:05 GMT+0800 (China Standard Time)

@yakirgb, I've released 0.8.1 with the fixes.

Yakir Gibraltar · Answer 13 · Fri Dec 04 2020 22:01:21 GMT+0800 (China Standard Time)

Thank you!