cstar failing on "Failed getting data from cache" during "continue"
yakirgb opened this issue · comments
Hi,
I'm using ctar 0.8.0
I executed cstar and tried to resume/continue but the command failed with "Failed getting data from cache".
First run:
/usr/local/bin/cstar run --command=/home/dba/cassandra/scripts/fstrim_run.sh --seed-host=reco001.tab.com --ignore-down-nodes --topology-per-dc --max-concurrency=14 --ssh-lib=ssh2 --ssh-identity-file=/var/lib/jenkins/.ssh/id_rsa --ssh-username=root -v
Job id is e8507178-a99e-4765-a967-458e5f3d8892
I see that cache file created:
/var/lib/jenkins/.cstar/cache/endpoint_mapping-a566699b-a182-3247-a9e6-95dba8308b58-b164c8933dc0fdc44c4f030609bc25f6
As i understand, the cache file created based on schema_versions and status_topology_hash:
[root@dba e8507178-a99e-4765-a967-458e5f3d8892]# grep schema_versions -A1 job.json
"schema_versions": [
"a566699b-a182-3247-a9e6-95dba8308b58"
[root@dba e8507178-a99e-4765-a967-458e5f3d8892]# grep status_topology_hash -A1 job.json
"status_topology_hash": [
"b164c8933dc0fdc44c4f030609bc25f6"
And "continue" failed:
/usr/local/bin/cstar continue e8507178-a99e-4765-a967-458e5f3d8892 -v
The error:
08:30:22 Failed getting data from cache : <traceback object at 0x7f3eb3bd3688>
08:35:22 Traceback (most recent call last):
08:35:22 File "/usr/local/bin/cstar", line 8, in <module>
08:35:22 sys.exit(main())
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 214, in main
08:35:22 namespace.func(namespace)
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 57, in execute_continue
08:35:22 output_directory=args.output_directory, retry=args.retry_failed)
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 35, in read
08:35:22 return _parse(f.read(), file, output_directory, job, job_id, stop_after, max_days, endpoint_mapper, retry)
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 92, in _parse
08:35:22 endpoint_mapping = endpoint_mapper(original_topology)
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 222, in get_endpoint_mapping
08:35:22 pickle.dump(dict(endpoint_mappings), open(self.get_cache_file_path("endpoint_mapping"), 'wb'))
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 130, in get_cache_file_path
08:35:22 return os.path.join(self.cache_directory, "{}-{}-{}".format(cache_type, "-".join(sorted(self.schema_versions)), "-".join(sorted(self.status_topology_hash))))
08:35:22 AttributeError: 'Job' object has no attribute 'cache_directory'
...
08:35:22 Summary:
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 35, in read
08:35:22 return _parse(f.read(), file, output_directory, job, job_id, stop_after, max_days, endpoint_mapper, retry)
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/jobreader.py", line 92, in _parse
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 222, in get_endpoint_mapping
08:35:22 File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 130, in get_cache_file_path
08:35:22 AttributeError: 'Job' object has no attribute 'cache_directory'
Thank you,
Yakir Gibraltar
Thank you @adejanovski , testing your PR, i'll update.
Hi @adejanovski , installed your branch, the job failed with:
12:21:55 unbuffer /usr/local/bin/cstar continue d8e7e11e-665d-4e97-a432-eb5d31ba9e8e -v 2>&1 | tee -a /var/log/cstar/cstar_RESUME_20200929-122154.log
12:21:55 Retry : False
12:22:07 Resuming job d8e7e11e-665d-4e97-a432-eb5d31ba9e8e
12:22:07 Running /usr/local/lib/python3.6/site-packages/cstar/resources/commands/run.sh
12:22:26 Traceback (most recent call last):
12:22:26 File "/usr/local/bin/cstar", line 11, in <module>
12:22:26 load_entry_point('cstar==0.8.1.dev0', 'console_scripts', 'cstar')()
12:22:26 File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 225, in main
12:22:26 namespace.func(namespace)
12:22:26 File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 70, in execute_continue
12:22:26 job.resume()
12:22:26 File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 331, in resume
12:22:26 self.resume_on_running_hosts()
12:22:26 File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 360, in resume_on_running_hosts
12:22:26 threading.Thread(target=self.job_runner(self, host, self.ssh_username, self.ssh_password, self.ssh_identity_file, self.ssh_lib),
12:22:26 TypeError: __init__() missing 1 required positional argument: 'host_variables'
12:22:26
Maybe need to add self.get_host_variables(host)
in https://github.com/spotify/cstar/blob/master/cstar/job.py#L360 ?
Should beL
def resume_on_running_hosts(self):
for host in self.state.progress.running:
debug("Resume on host", host.fqdn)
threading.Thread(target=self.job_runner(self, host, self.ssh_username, self.ssh_password, self.ssh_identity_file, self.ssh_lib, self.get_host_variables(host)),
name="cstar %s" % host.fqdn).start()
time.sleep(self.sleep_on_new_runner)
Instead:
def resume_on_running_hosts(self):
for host in self.state.progress.running:
debug("Resume on host", host.fqdn)
threading.Thread(target=self.job_runner(self, host, self.ssh_username, self.ssh_password, self.ssh_identity_file, self.ssh_lib),
name="cstar %s" % host.fqdn).start()
time.sleep(self.sleep_on_new_runner)
Thank you, Yakir.
Hi @adejanovski can you take a look please on #62 (comment) ?
Checked on my env and add self.get_host_variables(host)
to resume_on_running_hosts
solved the issue.
Hi @adejanovski , can you check please?
Hi @adejanovski ,
The fix is working
Can you bump version please?
Now i'm getting new errors:
[jenkins@dba ~]$ /usr/bin/python3.6 /usr/local/bin/cstar continue d416447c-a1d6-4b3a-9649-67b48967fa64 -v
Retry : False
Resuming job d416447c-a1d6-4b3a-9649-67b48967fa64
Running /usr/local/lib/python3.6/site-packages/cstar/resources/commands/run.sh
Traceback (most recent call last):
File "/usr/local/bin/cstar", line 11, in <module>
load_entry_point('cstar==0.8.1.dev0', 'console_scripts', 'cstar')()
File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 225, in main
namespace.func(namespace)
File "/usr/local/lib/python3.6/site-packages/cstar/cstarcli.py", line 70, in execute_continue
job.resume()
File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 332, in resume
self.run()
File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 344, in run
self.schedule_all_runnable_jobs()
File "/usr/local/lib/python3.6/site-packages/cstar/job.py", line 433, in schedule_all_runnable_jobs
next_host = self.state.find_next_host()
File "/usr/local/lib/python3.6/site-packages/cstar/state.py", line 75, in find_next_host
ignore_down_nodes=self.ignore_down_nodes)
File "/usr/local/lib/python3.6/site-packages/cstar/strategy.py", line 69, in find_next_host
return _strategy_mapping[strategy](remaining, endpoint_mapping, progress.running)
File "/usr/local/lib/python3.6/site-packages/cstar/strategy.py", line 83, in _topology_find_next_host
for next in endpoint_mapping[h]:
KeyError: Host(fqdn='1.2.3.4', ip='1.2.3.4', dc='Ud', cluster='Ud Cluster', rack='RAC1', is_up=True, host_id='488a32ea-4b75-458d-b34b-a43b92075894')
Thank you a lot, Yakir.
@adejanovski can you merge your fix please?
Hi @yakirgb,
done 👍
Thank you @adejanovski
Any option to bump minor version?
@yakirgb, I've released 0.8.1 with the fixes.
Thank you!