HPO cannot access logger.report_single_value metrics

Question

HPO cannot access logger.report_single_value metrics

ilouzl opened this issue 5 months ago · comments

Describe the bug

HPO will not include metrics generated by logger.report_single_value when searching for the desired objective_metric

To reproduce

Define a base task which logs it's result metric using task.get_logger().report_single_value(some_value, 'accuracy')
Define a HPO task that tries to optimize that metric:

...
task = Task.init(project_name="examples", task_name="HP optimizer", task_type=Task.TaskTypes.optimizer)
task.execute_remotely(queue_name="services")

an_optimizer = HyperParameterOptimizer(
    base_task_id=...,
    hyper_parameters=...
    objective_metric_title="Summary",
    objective_metric_series="accuracy",
)
...

The HP task will have an error of this form:

Traceback (most recent call last):

File "/home/miniconda3/envs/clearml/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/miniconda3/envs/clearml/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/.clearml/venvs-builds.2.3/3.9/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1997, in report_daemon
self.report_completed_status(completed_jobs, cur_completed_jobs, task_logger, title)
File "/home/.clearml/venvs-builds.2.3/3.9/lib/python3.9/site-packages/clearml/automation/optimization.py", line 2052, in report_completed_status
iteration = [it[0] if it else -1 for it in iteration_value]
TypeError: 'NoneType' object is not iterable

Expected behaviour

the metrics should be discovered and optimized as regular (i.e. tensorboard like) metrics.

Environment

Server type - app.clear.ml
ClearML SDK Version - clearml==1.14.4
Python Version - 3.9.16
Dockerized worker

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1709203560313889

Noam Wasersprung · Answer 1 · Tue Mar 05 2024 00:40:16 GMT+0800 (China Standard Time)

Thanks for reporting @ilouzl.
We'll update when a fix is available.

Alexandru Burlacu · Answer 2 · Fri Mar 08 2024 01:49:17 GMT+0800 (China Standard Time)

Hey @ilouzl, you seem to have a small issue with the way you report the single value. As per documentation you first need to provide name of the single value, and then its value.

Could you please re-run your code with these changes and report whether the issue still persists? Because I wasn't able to reproduce the problem

Liron Ilouz · Answer 3 · Fri Mar 08 2024 02:53:06 GMT+0800 (China Standard Time)

Hi @AlexandruBurlacu, its just a typo in the issue description.
The actual implementation is correct - first name and then value.

Alexandru Burlacu · Answer 4 · Fri Mar 08 2024 21:12:17 GMT+0800 (China Standard Time)

Can you please provide a full example, because I couldn't reproduce it on my side using this code:

import logging

from clearml import Task
from clearml.automation import (
    DiscreteParameterRange,
    HyperParameterOptimizer,
    RandomSearch
)

aSearchStrategy = RandomSearch

def job_complete_callback(
    job_id,                 # type: str
    objective_value,        # type: float
    objective_iteration,    # type: int
    job_parameters,         # type: dict
    top_performance_job_id  # type: str
    ):
    print('Job completed!', job_id, objective_value, objective_iteration, job_parameters)
    if job_id == top_performance_job_id:
        print('WOOT WOOT we broke the record! Objective reached {}'.format(objective_value))


task = Task.init(project_name='Hyper-Parameter Optimization',
                 task_name='Automatic Hyper-Parameter Optimization',
                 task_type=Task.TaskTypes.optimizer,
                 reuse_last_task_id=False)


def objective_function():
    import time
    task = Task.current_task()
    epochs = task.get_parameter("General/epochs", cast=True)

    for ep in range(epochs):
        task.get_logger().report_scalar(title="epoch_accuracy", series="epoch_accuracy", iteration=ep, value=ep ** 2)

    task.get_logger().report_single_value("Final value", ep ** 2)
    time.sleep(1)

    return ep ** 2


objective_task = task.create_function_task(objective_function)


# experiment template to optimize in the hyper-parameter optimization
print(">>>>>>>", objective_task.id)
args = {
    'template_task_id': objective_task.id}
args = task.connect(args)

execution_queue = 'queue-7'

an_optimizer = HyperParameterOptimizer(
    base_task_id=args['template_task_id'], 
    hyper_parameters=[
        DiscreteParameterRange('General/epochs', values=list(range(10, 30))),
    ], 
    objective_metric_title='Summary',
    objective_metric_series='Final value',
    objective_metric_sign='max',
    max_number_of_concurrent_tasks=3,
    optimizer_class=aSearchStrategy,
    execution_queue=execution_queue,
    spawn_project=None,
    time_limit_per_job=10.,
    pool_period_min=0.2,
    total_max_jobs=10,
    max_iteration_per_job=30,)

an_optimizer.start(job_complete_callback=job_complete_callback)
an_optimizer.set_time_limit(in_minutes=120.0)
an_optimizer.wait()
top_exp = an_optimizer.get_top_experiments(top_k=3)
print([t.id for t in top_exp])

Liron Ilouz · Answer 5 · Sun Mar 10 2024 19:19:58 GMT+0800 (China Standard Time)

Well @AlexandruBurlacu your code does work for me.
I probably had a different error, but it seems to be just fine now.
Thanks!