Go scheduler behaves really slow

Question

Go scheduler behaves really slow

pySilver opened this issue 6 years ago · comments

Thanks for your work. I'm curious how to deal properly with the following case: I have 10.000 products to be updated and the single product update task is written the way it can be safely executed in parallel. I have successfully changed concurrency with a8c_cron_control_concurrent_event_whitelist filter giving it 10-20 parallel executions for the given task name. I've tried different number of setups for go runner but without much success. The latest one is just a simple $ ./cron-control-runner -wp /srv/www/demo.test/current/web/wp -cli /usr/bin/wp -debug -workers-run 10 which is damn slow, it just sits there doing nothing for ~15 seconds and then boom it executes this 10 product tasks at once.

DEBUG: 2018/10/25 23:05:13 runner.go:375: runEvents-8 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|f7864b1434785d9fb4cc1b12cdd7a788 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:13 runner.go:375: runEvents-7 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|9cade429bfd00716fdf9ca4886840495 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:13 runner.go:375: runEvents-6 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|0c0c2ac066fca18f73e157be45f40379 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:14 runner.go:375: runEvents-1 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|53c318d53aa2db8abd22018475b729be for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:15 runner.go:375: runEvents-10 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|b3b76fe4079e2255db145895c747b6c5 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:15 runner.go:375: runEvents-3 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|02f5c0600626f409e7e3b2726936b64b for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:15 runner.go:375: runEvents-2 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|44d5c3c6d445622b00b27ade79cfbd54 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:15 runner.go:375: runEvents-9 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|4205e26bedbf4cd5a2b23570e357f3f1 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:15 runner.go:375: runEvents-4 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|038737393b45664206a9f67e4d95d2fb for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:32 runner.go:168: <heartbeat eventsSucceededSinceLast=27 eventsErroredSinceLast=0>
DEBUG: 2018/10/25 23:05:34 runner.go:375: runEvents-6 finished job 1540508563|cedd59e6b30268afe084b397b17c583c|3df9c58a35264f1f6c2bcb50f6981c55 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:34 runner.go:375: runEvents-10 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|dd9255b48124aefb52db1effd32c5e4c for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:35 runner.go:375: runEvents-3 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|9cef01964fc0732afaeb0f7eadf72b4f for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:35 runner.go:375: runEvents-9 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|e7b82e4760de09b0f890300da277838f for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:35 runner.go:375: runEvents-8 finished job 1540508563|cedd59e6b30268afe084b397b17c583c|dd608cf1e25e46658f2d35d5562296b4 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:35 runner.go:375: runEvents-2 finished job 1540508563|cedd59e6b30268afe084b397b17c583c|51a669691fee0c1765a641e8e9595b51 for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:36 runner.go:375: runEvents-7 finished job 1540508563|cedd59e6b30268afe084b397b17c583c|7cff37f1b3c4ae79de9abc187654992e for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:36 runner.go:375: runEvents-1 finished job 1540508563|cedd59e6b30268afe084b397b17c583c|5d96369c053d3eb243594e6c74c15bff for http:// domain.test/fo/
DEBUG: 2018/10/25 23:05:36 runner.go:375: runEvents-4 finished job 1540508562|cedd59e6b30268afe084b397b17c583c|4bd657ad2670fce64207ea51634e7c05 for http:// domain.test/fo/

(as you can see there is a gap for ~15 when runner simply is not doing anything).

Is that intentional behaviour? I expected it to behave more like a parallel queue where tasks are executed asap while free worker is available.

Quick note: This problem arise even with 100-200 products. Tasks appears this way:

<parent_update_tasks> is running itself as a periodic task that pulls specific IDs from db and schedules N events (lets say 100) with wp_schedule_single_event(time(), ...., product_id) and then it waits with while(true) { check queue; } till all items are processed. So this one is the long-running task. However it makes no difference if skip using this while(true) and simply exit once all subtasks are scheduled.

What I'm doing wrong?

p.s. im also having trouble with exiting go runner. most of the time it hangs this way and the only thing that help is to kill -9:

DEBUG: 2018/10/25 22:18:42 runner.go:383: exiting event worker ID 10
DEBUG: 2018/10/25 22:18:43 runner.go:383: exiting event worker ID 8
DEBUG: 2018/10/25 22:18:43 runner.go:383: exiting event worker ID 7
DEBUG: 2018/10/25 22:18:43 runner.go:383: exiting event worker ID 6
DEBUG: 2018/10/25 22:18:43 runner.go:383: exiting event worker ID 9
DEBUG: 2018/10/25 22:18:43 runner.go:383: exiting event worker ID 4
DEBUG: 2018/10/25 22:18:45 runner.go:162: exiting heartbeat routine
DEBUG: 2018/10/25 22:18:45 runner.go:176: event retriever ID 1 still running
DEBUG: 2018/10/25 22:18:45 runner.go:177: sending empty site object for worker 1
^CDEBUG: 2018/10/25 22:18:51 runner.go:505: caught termination signal interrupt, scheduling shutdown
^CDEBUG: 2018/10/25 22:18:51 runner.go:505: caught termination signal interrupt, scheduling shutdown
^CDEBUG: 2018/10/25 22:18:52 runner.go:505: caught termination signal interrupt, scheduling shutdown
DEBUG: 2018/10/25 22:18:58 runner.go:505: caught termination signal terminated, scheduling shutdown

Erick Hitter · Answer 1 · Tue Oct 30 2018 01:39:08 GMT+0800 (China Standard Time)

@pySilver Have you adjusted the get-events-interval setting for the runner? How about the CRON_CONTROL_JOB_QUEUE_SIZE constant?

I suspect that what you're seeing is a combination of the runner's default retrieval interval of 60 seconds, along with the queue curation that Cron Control provides; in other words, retrieval and execution is occurring as designed. Rather than load all pending events into the runner queue, a selection of those events are queued per retrieval interval, allowing things like scheduled posts to occur on time, even with a large queue of other jobs.

To start, I'd suggest increasing the queue size. If that doesn't help, decreasing the retrieval interval may help, but there's an artificial floor to that value; if the runners are busy, the retrieval process will keep returning the same set of pending events already awaiting runner availability.

p.s. im also having trouble with exiting go runner

This is a known issue (it arose from #154 but was difficult to reproduce in production at that time). kill -9 is the only remedy at this point.

pySilver · Answer 2 · Tue Oct 30 2018 20:56:50 GMT+0800 (China Standard Time)

Yes I've tried almost all options. CRON_CONTROL_JOB_QUEUE_SIZE = 250 (or greater) get-events-interval = 1-3-5 (small amounts)

It looks like queue get feeded once previous batch is fully processed (at least it looks like that by watching processes).

At the moment I almost rewritten go-runner to a fully compatible python version based on asyncio and process pools that spawns new job as soon as previous has been done and it's free from quit problem. If anyone interested I can share the code here (or PR)

pySilver · Answer 3 · Wed Oct 31 2018 07:49:56 GMT+0800 (China Standard Time)

Python version of runner, no need to compile no deps except python3:

# -*- coding: utf-8 -*-
import argparse
import json
import os
import logging
import time
import signal
import asyncio
import functools
from random import shuffle
from subprocess import Popen, PIPE
from collections import OrderedDict
from concurrent.futures import ProcessPoolExecutor


class StatefulUniqueQueue(asyncio.Queue):
    """
    Queue that contains unique task signatures and supports simple
    task execution state.
    """

    def _init(self, maxsize):
        self._queue = OrderedDict()
        self._running_tasks = set()

    def _put(self, item):
        if item not in self._running_tasks:
            self._queue[item] = None

    def _get(self):
        for task_id, _ in self._queue.items():
            del self._queue[task_id]
            self._running_tasks.add(task_id)
            return task_id

    def task_done(self, task_id=None):
        try:
            self._running_tasks.remove(task_id)
        except KeyError:
            pass

        if self._unfinished_tasks <= 0:
            raise ValueError('task_done() called too many times')
        self._unfinished_tasks -= 1
        if self._unfinished_tasks == 0:
            self._finished.set()


def human_readable_timedelta(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60.
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


async def site_retriever(executor, loop, get_events_interval=60):
    """Retrieves list of known sites"""
    last_called_at = 0
    while True:

        if last_called_at and time.time() - last_called_at <= get_events_interval:
            await asyncio.sleep(0, loop=loop)
            continue

        task = loop.run_in_executor(executor, wp_get_sites)

        last_called_at = time.time()
        for f in asyncio.as_completed([task], loop=loop):
            try:
                results = await f
                logger.info("Completed periodic sites list retrieval. {:d} sites found.".format(len(results)))
                sites.clear()
                if results:
                    sites.update(results)

            except BaseException as exc:
                logger.error('[PERIODIC SITES RETRIEVAL] {}'.format(exc))


async def event_producer(queue, executor, loop, get_events_interval=60):
    """Retrieves events for each known site"""
    last_called_at = 0
    while True:

        if not sites or (last_called_at and time.time() - last_called_at <= get_events_interval):
            await asyncio.sleep(0, loop=loop)
            continue

        # Shuffle site order so that none are favored
        _sites = list(sites)
        shuffle(_sites)

        tasks = []
        for site in _sites:
            tasks.append(loop.run_in_executor(executor, wp_get_site_events, site))

        last_called_at = time.time()
        for f in asyncio.as_completed(tasks, loop=loop):
            try:
                _site, events = await f
                logger.info("{:d} events retrieved for {}".format(len(events), _site))
                for event in events:
                    event['site'] = _site
                    await queue.put('{timestamp}|{action}|{instance}|{site}'.format(**event))
            except BaseException as exc:
                logger.error('[PERIODIC EVENTS RETRIEVAL] {}'.format(exc))


async def event_consumer(queue, executor, loop, heartbeat_interval=60):
    """Process events for known sites"""
    futures = dict()
    events_succeed = 0
    events_failed = 0
    last_heartbeat_at = 0
    while True:
        if heartbeat_interval and time.time() - last_heartbeat_at >= heartbeat_interval:
            if not last_heartbeat_at:
                last_heartbeat_at = time.time()
            else:
                logger.info('<heartbeat eventsSucceededSinceLast={:d} eventsErroredSinceLast={:d}>'.format(
                    events_succeed,
                    events_failed
                ))
                last_heartbeat_at = time.time()
                events_succeed = 0
                events_failed = 0

        while not queue.empty():
            event_id = await queue.get()
            event = event_id.split('|')

            if int(event[0]) > time.time():
                logger.debug('Skipping premature event {}|{}|{} for {}.'.format(*event))
                continue

            futures[loop.run_in_executor(executor, wp_run_event, *event)] = event_id

        if not futures:
            await asyncio.sleep(0, loop=loop)
            continue

        done, not_done = await asyncio.wait(futures, loop=loop, timeout=.1, return_when=asyncio.FIRST_COMPLETED)
        for future in done:
            _event_id = futures[future]
            event = _event_id.split('|')
            try:
                event.append(future.result())
                logger.info('Completed event {}|{}|{} for {} in {}.'.format(*event))
                events_succeed += 1
            except BaseException as exc:
                logger.error('[EVENT CONSUMER] {}'.format(exc))
                events_failed += 1
            finally:
                queue.task_done(_event_id)

            # remove the now completed future
            del futures[future]


def shutdown(sig, loop, executors):
    """Graceful shutdown handler"""
    logging.warning('Caught termination signal {}. Shutting shutdown...'.format(sig.name))

    tasks = asyncio.Task.all_tasks()
    for t in [t for t in tasks if not (t.done() or t.cancelled() or t is asyncio.tasks.Task.current_task())]:
        t.cancel()

    loop.stop()

    for ex in executors:
        ex.shutdown(wait=True)


def get_logger(file_path, verbosity):
    """Log events to file or stdout"""
    log_formatter = logging.Formatter("%(asctime)s [%(process)d] [%(levelname)s]  %(message)s")
    root_logger = logging.getLogger()

    if file_path is not None:
        file_handler = logging.FileHandler(file_path)
        file_handler.setFormatter(log_formatter)
        root_logger.addHandler(file_handler)
    else:
        console_handler = logging.StreamHandler()
        console_handler.setFormatter(log_formatter)
        root_logger.addHandler(console_handler)

    if verbosity is not None and verbosity >= 1:
        root_logger.setLevel(logging.DEBUG)
    else:
        root_logger.setLevel(logging.INFO)

    return root_logger


def validate_path(value):
    """Validates path existence"""
    if not value:
        raise argparse.ArgumentTypeError("Invalid empty path supplied")
    if not os.path.exists(value):
        raise argparse.ArgumentTypeError("{} does not exists".format(value))
    return value


def validate_wp_cli(value):
    """Validates path to WP-CLI installation"""
    return validate_path(value)


def validate_wp_path(value):
    """Validates path to WP installation"""
    return validate_path(value)


def wp_run_event(timestamp, action, instance, site_url):
    """Execute WP Event"""
    started = time.time()
    ret_code, output, err = wp_cli([
        "cron-control",
        "orchestrate",
        "runner-only",
        "run",
        "--timestamp={}".format(timestamp),
        "--action={}".format(action),
        "--instance={}".format(instance),
        "--url={}".format(site_url)
    ])

    if ret_code:
        raise RuntimeError("Failed to execute event: {}".format(err.decode('utf-8')))

    return human_readable_timedelta(time.time() - started)


def wp_get_sites():
    """Retrieves list of sites to ask for events"""
    site_info = wp_get_instance_info()[0]

    if site_info['disabled']:
        return []

    if site_info['multisite']:
        return [s['url'] for s in wp_get_multisite_sites()]

    return [site_info['siteurl']]


def wp_get_instance_info():
    """Retrieve WP instance"""
    ret_code, output, err = wp_cli(["cron-control", "orchestrate", "runner-only", "get-info", "--format=json"])
    if ret_code:
        raise RuntimeError("Failed to retrieve instance info: {}".format(err.decode('utf-8')))

    return json.loads(output.decode('utf8'))


def wp_get_multisite_sites():
    """Retrieve WP sites in multisite environment"""
    ret_code, output, err = wp_cli([
        "site", "list", "--fields=url", "--archived=false", "--deleted=false", "--spam=false", "--format=json"
    ])
    if ret_code:
        raise RuntimeError("Failed to retrieve multi sites: {}".format(err.decode('utf-8')))

    return json.loads(output.decode('utf8'))


def wp_get_site_events(site_url):
    """Retrieve WP events for particular site"""
    ret_code, output, err = wp_cli([
        "cron-control", "orchestrate", "runner-only", "list-due-batch", "--url={}".format(site_url), "--format=json"
    ])
    if ret_code:
        raise RuntimeError("Failed to retrieve sites events for {}: {}".format(site_url, err.decode('utf-8')))

    return site_url, json.loads(output.decode('utf8'))


def wp_cli(command):
    """Call WP-CLI"""
    command.extend(["--allow-root", "--quiet", "--path={}".format(args.wp)])
    if args.network > 0:
        command.append("--network={}".format(args.network))

    command.insert(0, args.cli)

    p = Popen(command, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    output, err = p.communicate()

    return p.returncode, output, err


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        prog='cron-control-runner.py',
        description='Execute WP cron events in parallel using process pools.',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--cli', help='Path to WP-CLI binary', default='/usr/local/bin/wp', type=validate_wp_cli)
    parser.add_argument('--log', help='Log path, omit to log to stdout')
    parser.add_argument('--network', help='WordPress network ID, 0 to disable', default=0, type=int)
    parser.add_argument('--workers-get',
                        help='Number of workers to retrieve events. '
                             'Increase for multisite instances so that sites are retrieved in a timely manner',
                        default=1, type=int)
    parser.add_argument('--get-events-interval', help='Seconds between event retrieval', default=60, type=int)
    parser.add_argument('--workers-run',
                        help='Number of workers to run events. '
                             'Increase for cron-heavy sites and multisite instances so '
                             'that events are run in a timely manner',
                        default=5, type=int)
    parser.add_argument('--wp', help='Path to WordPress installation', default='/var/www/html', type=validate_wp_path)
    parser.add_argument('--heartbeat', help='Heartbeat interval in seconds', default=60, type=int)
    parser.add_argument('-v', '--verbosity', help='Increase verbosity', action="count")
    args = parser.parse_args()

    # Setup logging
    logger = get_logger(args.log, args.verbosity)
    logger.info('Starting with {} event-retrieval worker(s) and {} event worker(s)'.format(
        args.workers_get, args.workers_run)
    )
    logger.info('Retrieving events every {} seconds'.format(args.get_events_interval))

    # Setup queue
    sites = set()
    event_queue = StatefulUniqueQueue()

    # Setup event loop & executors
    event_loop = asyncio.get_event_loop()
    producer_executor = ProcessPoolExecutor(args.workers_get)
    consumer_executor = ProcessPoolExecutor(args.workers_run)

    asyncio.ensure_future(site_retriever(producer_executor, event_loop, args.get_events_interval), loop=event_loop)
    asyncio.ensure_future(event_producer(event_queue, producer_executor, event_loop, args.get_events_interval),
                          loop=event_loop)
    asyncio.ensure_future(event_consumer(event_queue, consumer_executor, event_loop, args.heartbeat), loop=event_loop)

    # Shutdown handlers
    for code in [signal.SIGTERM, signal.SIGINT]:
        event_loop.add_signal_handler(
            code,
            functools.partial(shutdown, code, event_loop, [producer_executor, consumer_executor])
        )

    try:
        event_loop.run_forever()

        # Let's also finish all running tasks:
        pending = asyncio.Task.all_tasks()
        event_loop.run_until_complete(asyncio.gather(*pending))
    except BaseException:
        pass
    finally:
        event_loop.close()
        logger.info(".:sayonara:.")
        exit(0)

pySilver · Answer 4 · Wed Oct 31 2018 10:10:37 GMT+0800 (China Standard Time)

btw, I came across this comment:

Busy sites may have several Cron runners in separate containers all processing the queue simultaneously. Our VIP Cron infrastructure takes particular care to orchestrate the activity of the event workers in the different containers, to avoid clashes with two workers processing the same event.
https://vip.wordpress.com/2017/11/15/a-vip-cron/

Is that something supported by the plugin or not? If not, how do I accomplish that? We need to process long queue on a farm of servers.

Simon Wheatley · Answer 5 · Wed Aug 14 2019 21:37:14 GMT+0800 (China Standard Time)

Is that something supported by the plugin or not?

This is supported by the plugin here.

Thanks for the Python runner!

Closing this issue.