Cache result of initdb

Question

Cache result of initdb

rutsky opened this issue 9 years ago · comments

On Ubuntu 14.04 with PostgreSQL 9.3 database initialization with initdb is the slowest part of testing.postgresql: it takes around 2.5 seconds to create data directory with default contents.

While it's possible to create PG data directory outside of testing.postgresql, cache it, copy for each test and pass to testing.postgresql.Postgresql with copy_data_from argument, it's a lot of work and requires to reimplementing of some of the testing.postgresql functionality (e.g. search of initdb utility).

I propose to implement caching of initdb result inside PostgreSQL and use it by default.

This can be done pretty straightforward and I can prepare PR for this issue if you think my approach is satisfactory:

Add dependency for library that provides API to work with user cache directory (e.g. appdirs). Cache directory will be used for caching initdb result. Cache should be verioned by initdb version (initdb -V) and by testing.postgresql version.
Add option to testing.postgresql.Postgresql to disable cache. e.g. cache=False.
In testing.postgresql.Postgresql if cache is enabled: check if cache for the current version of initdb+testing.postgresql exists, if no, create it and fill it with initdb; copy cached directory content to a temporary PG data directory.

Takeshi KOMIYA · Answer 1 · Mon Nov 30 2015 00:15:05 GMT+0800 (China Standard Time)

Thank you for suggestion. I'm very interested in your idea.
Indeed, running initdb for every testcase is too waste.
It is polite but unwelcome behavior.

My short investigation: Following code accelerate test cases by 2.3 times

pgsql = None


def setUpModule(self):
    global pgsql
    pgsql = testing.postgresql.Postgresql(auto_start=0)
    pgsql.setup()
    testing.postgresql.DEFAULT_SETTINGS['copy_data_from'] = pgsql.base_dir + '/data'  # cache empty database


def tearDownModule(self):
    testing.postgresql.DEFAULT_SETTINGS['copy_data_from'] = None  # reset
    pgsql.stop()

Result:

(before)$ nosetests
....................
----------------------------------------------------------------------
Ran 20 tests in 59.478s

OK

(after)$ nosetests
....................
----------------------------------------------------------------------
Ran 20 tests in 26.017s

OK

I will add this feature in next version.
Let me think about the API to do that.

If you have any idea of API, please let me know.

Vladimir Rutsky · Answer 2 · Wed Dec 02 2015 23:27:57 GMT+0800 (China Standard Time)

About API: I suggest to add following keyword arguments to testing.postgresql.Postgresql:

use_initdb_cache=None.
initdb_cache_dir=None.

If use_initdb_cache is specified and copy_data_from is not specified, then enable caching of initdb result.

If initdb_cache_dir is specified, then use it as cache location. Otherwise use something like appdirs.user_cache_dir('testing.postgresql', 'Takeshi KOMIYA') result as cache directory.

Takeshi KOMIYA · Answer 3 · Wed Dec 09 2015 01:30:10 GMT+0800 (China Standard Time)

In response to your proposal, I'll add the factory class named testing.postgresql.PostgresqlFactory.
For example:

import unittest
import testing.postgresql

# Generate Postgresql class which caches the generated database
Postgresql = testing.postgresql.PostgresqlFactory(use_initdb_cache=True)


def tearDownModule(self):
    # clear cached database at end of tests
    Postgresql.clear_cache()


class MyTestCase(unittest.TestCase):
    def setUp(self):
        # Use the generated Postgresql class instead of testing.postgresql.Postgresql
        self.postgresql = Postgresql()

    def tearDown(self):
        self.postgresql.stop()

It makes testcases more efficient.

The factory class supports all options of testing.postgresql.Postgresql.
So, you can use copy_data_from option:

Postgresql = testing.postgresql.PostgresqlFactory(copy_data_from='/path/to/your/appdir`)
with Postgresql() as pgsql:
    # ...

Probably, it is not instead of initdb_cache_dir which you require.
but I do not want to add any dependencies or rules.

Is this helps you?

Vladimir Rutsky · Answer 4 · Wed Dec 09 2015 08:27:02 GMT+0800 (China Standard Time)

Yes, your solution will be useful, but can it be used without Postgresql.clear_cache()?

If cache will be cleared on each test suite run, tests still will run on few seconds slower then they may run (if initdb is cached "permanently" for specific version of testing.postgresql/PostgreSQL).

Takeshi KOMIYA · Answer 5 · Wed Dec 09 2015 13:16:55 GMT+0800 (China Standard Time)

Yes, your solution will be useful, but can it be used without Postgresql.clear_cache()?

The answer is yes and no.
If you use use_initdb_cache=True option, clear_cache() is required (or automatically cleared by GC).
On the other hand, you do not have to call it if you use other options. Because any caches are not generated in that case.

I think the cache generated by a library should be removed by the library. It's basic principle.
so I do not want to keep initdb cache till after the script ended.

If you want to keep the cache beyond the scripts, you should generate cache database manually, and then, use copy_data_from option on test script.
Fortunately, PostgresqlFactory takes copy_data_from option and it is bypassed to each testing.postgresql.Postgresql object.
It might be helpful you on refactoring testcases.

Thanks,

Takeshi KOMIYA · Answer 6 · Wed Dec 16 2015 22:26:57 GMT+0800 (China Standard Time)

Finally, I renamed the options to cache_initialized_db and on_initialized.
Because the name of parameter should be used to other testing.* packages to keep compatibility. I think initdb is a part of PostgreSQL.

The feature will released soon.
Thank you for great suggestion. :-)