django-nonrel / djangotoolbox

Django tools for building nonrel backends

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Type conversions refactor bis

wrwrwr opened this issue · comments

Type conversions refactor split into more logical parts, plus a complete PEPification and style normalization and some additional cleanup. The actual changes are described in the original issue -- all is still relevant including the warning at the end. The new branches are here:

https://github.com/django-nonrel/djangotoolbox/compare/feature/type-conversions-refactor-2
https://github.com/django-nonrel/djangoappengine/compare/feature/type-conversions-refactor-2
django-nonrel/mongodb-engine@develop...feature/type-conversions-refactor-2

Hopefully you'll find reviewing them commit-wise doable, unlike before. The additional nonrel branch is optional, it depends on the refactor, but not vice versa [update: not any longer, after renaming nonrel_db_type to db_type it is necessary].

First commit on each branch contains all the pedantic stuff; there are no actual changes in these commits -- other than reflowing, capitalizing & dotting, blank adding there is some method reordering, variable renaming, additional comments, one unused internal method dropped and some small blocks of code rewritten for better readability. If anything in these first commits raises any doubts I'm going to separate it.

The code style chosen is not really any better than what it was, but it is consistent across toolbox and both engines (and dbindexer is to follow). I've based it on PEP8, PEP257, Django & Google code styles (generally trying to stick with the first one unless there was some good reason not to). There are some some pros of cleaning up and normalizing things this way:

  • following commits should be more readable, as the temptation to include stylistic changes should be smaller;
  • you may run the pep8 verifier on . to check that your code is fine and dandy (use --exclude="docs", and --exclude="multiple_database" for Mongo; ignore W602);
  • and also you may comfortably place a traceback and an editor side-by-side, even on a small monitor, and maybe you'll find yourself scrolling a bit too much and reading another class than you though you were a bit less often ;-)

The additional cleanup includes:

  • separating parts that are not relevant to the issue (I've committed the set serialization test and EmbeddedModelField.update fixes to the develop branch, tests for the natural usage of ListField with ForeignKeys are on a separate branch);
  • some further simplifying of nonrel fields code, namely this commit: 5fce56f;
  • and of GAE code: django-nonrel/djangoappengine@8f81ef7;
  • I've renamed DatabaseCreation.nonrel_db_type to just db_type after realizing that the parent method has been added by nonrel, moreover back-ends can now actually work without the related_db_type changes. Probably I'll propose letting back-ends override Field.db_type in this way to Django (replacing the #17336 ticket).
  • Removing or explaining some TODOs, rewritting a few comments.

tests for the natural usage of ListField with ForeignKeys are on a separate branch);

Interested in this. Does it accomplish the same thing I tried with this pull request django-nonrel/mongodb-engine#83 and my custom ForeignKey field https://github.com/emperorcezar/Django-utils/blob/master/fields.py ?

You'll still have to use model_instance.pk rather than simply model_instance (but other than that it should now work with Mongo, not just GAE). I did some research on the issue, but so far this only ended with some tests and a broken, one-sided solution:
https://github.com/django-nonrel/djangotoolbox/compare/feature/collections-with-foreignkeys-tests

The hard part seems to be making ForeignKeys in a ListField fetched from the database appear like model instances -- Django uses descriptors / managers for that, so it would be nice to do something along this line for collection fields.

In your tests you seem to be pushing, querying, and returning model instances. In what circumstances do you need to use model_instance.pk?

Thanks so much for the amazing work!

In my opinion, however, a big part of the style changes are just wrong.

I really really appreciate your hard work but you need to stop thinking about PEP8 (and friends) as hard rules. They're just guidelines and it's okay to think on your own sometimes.

Returning model instances is not solved -- the test will just fail there; the solution on the foreignkey branch is just a (bad) idea. Take
a look at this test: https://github.com/django-nonrel/djangotoolbox/blob/feature/type-conversions-refactor-2/djangotoolbox/tests.py#L230, this is what works after refactor.

Update: due to the renaming of DatabaseCreation.nonrel_db_type the nonrel branch:
https://github.com/django-nonrel/django-nonrel/compare/feature/drop-related-db-type
is required for the -2 branches to work (not having it results in an infinite recursion).

Woa GitHub's formatting of email replies is really horrible. Can you just re-post the whole thing using the webinterface?

Returning model instances is not solved -- the test will just fail there; the solution on the foreignkey branch is just a (bad) idea. Take a look at this test: https://github.com/django-nonrel/djangotoolbox/blob/feature/type-conversions-refactor-2/djangotoolbox/tests.py#L230, this is what works after refactor.

The only way I found to solve that was overriding the to_python on the ForeignKey field and calling rel.to.objects.get which worked well in my case, but my bet is you're not overriding the ForeignKey field.

https://github.com/emperorcezar/Django-utils/blob/master/fields.py#L18

In my opinion, however, a big part of the style changes are just wrong.

Probably you're simply right, nevertheless I'll try to explain myself.

I'd say 120 is always OK. There may be a few that did decrease readability, but in most cases there's no harm done. And the time spent thinking if 80 / 90 / 100 is fine here or there can be better used.

OK, but lets use the same thing in toolbox and GAE too. Fixed.

  • Not all comments need to end in a dot, nor do exception messages need to end in a dot.

Yup, this is my personal bias. Can't be helped, sorry :-)

  • XXX is not the same thing as TODO

Undone. Maybe lets also explain XXX in docs somewhere -- I can think of several different meanings of it (like FIXME, like HACK, "had no time to properly check", "a critical problem" or any of these); some more TODOs should probably be turned to XXXs then.

Quite right, some mindless automation must have been in work here. Fixed.

It (sometimes) does help maintainability though (you don't have to reindent after adding another case) and (often) does not harm readability significantly. Some guides recommend this explicitly -- the Google one above, but also some for different languages (surely you'll find some that say differently); I'd say the recommendation is sound. Of course, there are exceptions, but I don't think this specific case deserves to be one.

I really really appreciate your hard work but you need to stop thinking about PEP8 (and friends) as hard rules. They're just guidelines and it's okay to think on your own sometimes.

Maybe you'd be surprised to see the styling I tend to use in private (with a lot of camelCase, whole paragraphs on a single line, and even most inline comments lowercase ;-) However, I believe it's good to stick to something for a project that many are working on -- this doesn't have to be PEP or anything other, just something.

I wouldn't like to push these style changes against anyone, so if you'd rather have just the refactor that's OK (some renaming and merging will have to be done, but otherwise it shouldn't be very hard to just drop these first commits).

@emperorcezar Yes, I had the ambition of making it work with the standard ForeignKey :-) I've also given a short attempt at the descriptor approach (something like defining __get__ and __set__ for AbstractIterableField) -- didn't work, but should be possible.

You should be able to drop the db_type method in your class and do just:

def get_db_prep_value(self, value, connection, prepared=False):
    if isinstance(type(value), type(self.rel.to)):
        return value.pk
    else:
        return value

instead of both get_db_prep_save and get_db_prep_lookup with the refactor.

Okay just did the review. This looks really really great. I added a few comments here and there. I didn't review the App Engine part because I'm not familiar with (nor interested in) it.

Thanks thanks thanks!

ah btw I did some MongoDB Engine compiler refactoring a while ago, if you like you can incorporate what's left after your refactoring. https://github.com/django-nonrel/mongodb-engine/tree/feature/compiler-cleanup

looks really great now. RFC from my point of view for djangotoolbox and MongoDB Engine.

@jonashaag Thanks for the review!

Second update:

  • Fixed in-memory filtering (matches_filters only used by GAE). It was comparing non-primary key model values with database values, which doesn't work in a few cases.
  • Fixed decimal sorting for Mongo (by pulling GAE solution to toolbox).
  • Did some more compiler refactoring:
    • incorporating Jonas's branch;
    • making toolbox pass converted values to filter processing (thus mostly letting back-ends only worry about writing a proper value_for/from_db and otherwise not having to deal with conversions at all; toolbox now also knows not to convert isnull arguments and can handle lists of arguments for in, range or year lookups properly, so writing those conversion methods should be simple);
    • passing related model primary key field instead of an original related field to conversions (as generally you should convert a foreign key value in the same way as a "parent" primary key value), passing fields instead of columns to order_by and lookup name to value_for_db.
  • Coded an alternative solution for the AutoField "just ints" problem -- getting rid of the value_to_db_auto (we're blanking rest of the value_to_db_* methods as most conversions needed are not tied to a specific field kind) and also giving AutoField's validation its strength back. There are different solutions possible (global setting / descriptor instead of the metaclass / just some public methods...), please take a closer look at this one -- I've tried a couple of different things, but nothing seems perfect.
  • Retested for GAE 1.6.3 and MongoDB 2.0.3.

The Django branch now uses the same name as the rest:
https://github.com/django-nonrel/django-nonrel/compare/feature/type-conversions-refactor-2

Hey Wojtek, are you going to merge your branches soon?

Anytime we decide that's the time :-) I could find some more things to improve or fix, but it would be better done on some new branches, these are already big enough.

If you think they're ready for merging, go ahead.

@wrwrwr would you be kind enough to wait a couple of days before merging this? I'd like to take a look at the patch and issue a review. I agree that it might be hard if not impossible to stick with the django's value_to_db_* idiom but to me it's a big step backward to make DateField.get_db_prep_(value|save) return a value that's not ready for database insertion. I mean... all the other backend work this way, how is someone porting code from django to django-nonrel is supposed to know that? What happens with people writing new field subclasses, do they have to write their custom backend to wrap value_to_db to return the correct type for their field? How is someone calling SetField.value_to_string is supposed to know that it will return a list?

@charettes Certainly, but please compare to what nonrel currently has to offer.

DateField.get_db_prep_* -- you already have to use the nonrel conversions for many (all) fields (nonrel fields, AutoFields, any "long string" fields with GAE etc.), so I find it more consistent this way (toolbox blanks most of the vanilla database conversions; additonal conversions are currently defined on compilers; at least you'll learn quicker that you should also pass values through them ;-) Potentially you could call value_for_db from Field.get_db_prep_* -- see a consideration of two future options: b4315fc#L0R109.

Field subclasses -- they don't need to do anything differently than they'd do with vanilla, the framework is due to back-ends needing to do additional value preparation.

value_to_string -- it's the same on develop... Nonrel fields serialization is still broken anyway, I believe the method should really look more or less like this (for List/Set/DictField, EmbeddedModelField needs a bit more):

return smart_unicode(self._map(self.item_field.value_to_string, self._get_val_from_obj(obj)))

However, I remember someone else was working on the serialization, and this is not the place to fix it.

Third update:

  • Yet another AufoField type solution -- a bit nonstandard usage of Settings instead of the static variable, in case someone needs to know / set the type in a context in which fields can't be imported. Also turned the type mismatch exception into a warning -- there are some circumstances that back-ends with different AutoField types may still be used (e.g. without using AutoField at all).
  • Introduced a MONGO_INT_BASED_AUTOFIELDS setting (defaults to False) -- ObjectIds are the natural, scalable choice, but this seems the only way to run Django test suite (at the moment ~25% tests fail mostly due to ManyToManyFields / abstract models). The code is nearly a copy of the MongoDB docs one (an additional counters collection + findAndModify).
  • Moved remaining parts of the value conversion framework to DatabaseOperations, not only compilers may need to use them.
  • Did some small refactoring to _get_ordering that lets back-ends use query.standard_ordering directly when there is no field ordering defined.

I don't like the MONGO_INT_BASED_AUTOFIELDS thing. It doesn't help at all to make the Django test suite pass with integer IDs because all real-live use cases use ObjectIds.

I understand your desire to run the Django test suite against MongoDB Engine but please have the hacks required for that in a separate branch. (That of course may be on this GitHub repo!)

I have to disagree. If we want our patches accepted into Django core, then we're going to have to deal with the tests. While this may not be the way, the issue needs to be tackled.

The "cleaner" way is to open tickets against the Django tests and patch those.

I totally agree Adam! But I don't think the Django folks care about our backend being testing well, only the changes we make to Django.

Good point. I don't know how they test things. Do they have a test server that runs against each of the backends, or do they just run it against sqlite and call it good?

Alright, I'll move the MONGO_INT_BASED_AUTOFIELD to a separate branch. The Django suite is quite slow anyway, hard to use in practice, but even with the AutoField substitute it can detect a lot of other problems (with the additional inconvenience of having to filter out all those "This query is not supported(...)" / "Multi-table inheritance(...)" / "Cursors (...)" failures). I had a slight hope you won't consider this a hack, but an alternative solution :-)

https://code.djangoproject.com/changeset/17641
https://code.djangoproject.com/changeset/17698

probably fixes the same thing you did with explicit lazy object handling?

Almost looks so, thanks for the pointers, should I rather remove the checks already or mark them for consideration in 1.4?

Nevertheless, these changes make me wonder about proxies in updates / lookups (toolbox currently only does the lazy evaluation for lookups) and wrappers around objects that shouldn't be cast to a string... Take a look at this comment too: https://code.djangoproject.com/ticket/10498#comment:12.

I'd keep the test, skipping until 1.4, and remove the fixing code.

Hmmm, the test will likely fail with 1.4 too (it has cases for lookups with lazy translations and it doesn't seem Django will get any further refactoring for 1.4 -- unless I've missed some big ticket; currently it doesn't need to handle this case because lookup values are always cast to strings for SQL back-ends).

The issue is not significant enough to really care about it and I'm not sure what will actually happen with 1.4, but I guess I'll leave the code on another branch; we might want to bring it back (at least to some future GAE database-adapter or just for lookups).

@emperorcezar here's the CI they use to test against each backend. Looks like they've broken something recently... I should have time to review the pull request today, sorry for the delay.

Hmm. There's not oracle on there so they don't seem to run it against all official backends.

@emperorcezar It used to be there, they removed it because they were having trouble setting it up on the ci box.

@jonashaag I've checked that the Django changes don't fix lookups (nor some potential cases with using conversions directly, not through the sql.Query) and confirmed that's also a problem for MongoDB (fails with InvalidDocument: cannot convert value of type <class 'django.utils.functional.__proxy__'> to bson), so I'm only marking this for reconsideration for now (testing if it's a lookup or not just adds an unnecessary condition).

right

Merged into develop. Thanks to everyone involved!

There's a flaw in your AutoField handling implementation -- it doesn't not allow for using e.g. SQL and MongoDB at the same time (because the AutoField type is hardcoded).

The AutoField type should depend on the backend used for each model.

Yes, that's true, this was by design.

Models/fields are not tied to back-ends, they only get a database connection when preparing a value for the database, however some AutoField (type-dependent) value processing may happen without any database interaction (value validation, serialization / deserialization).

If we want to support multi-database settings with differing AutoField types there won't be effective field-level value validation, what can potentially result in problems in non-database related areas (which may expect AutoField values to be properly validated even though they have not passed through the database layer). It's more or less what we had previously, so maybe that's acceptable?

I've repushed the branches with two more solutions (two commits for the nonrel repo: moving to a wrong place or dropping the validation). I can also bring back the original check:

if not (value is None or isinstance(value, (basestring, int, long))):
    raise exceptions.ValidationError(self.error_messages['invalid'])

but that doesn't seem much better than doing no validation at all (and doesn't work for things that could be cast to unicode or int, but are not instances of their subclasses).

It's critical to support multi-database settings. That's been touted as a benefit of using non-rel for a very long time. To break it would break a lot of installations.

I guess I'll merge one of the "database setting" solutions tomorrow then. I'm a bit worried about the ineffective validation, but seems there is no way around it. A careful analysis on how AutoField values are used would be nice -- with nonrel changes (master or develop-to-be) something like this: mark_safe("The id is <em>%s</em>." % field.pk) may become a script-injection vector where it previously wasn't and things like log spoofing may be a bit easier to carry out (but the lack of the (field-layer) validation shouldn't make any vulnerability on its own).

Do you want the old check back (so I'm not the one to blame ;-) or should I rather drop the validation completely?

These questions can get pretty philosophic btw ;-)

I guess I prefer the solution that does validation based on the backend's AutoField type (i.e. PKs in forms for a MongoDB model must be ObjectIds). So the backend-setting approach probably makes most sense. However why don't we just use connection.ops.convert-autofield-value-to-what-the-backend-understands()? So that to-autopk-conversions are nicely abstracted away from the model code.

Back-end setting makes a lot of sense, but it's not possible to validate using it (field can validate a value that later may be used by various back-ends or no back-end; technically it's something separate from the database layer, you don't have the connection or anything like that available). I did an attempt of pushing the validation for AutoField to the database layer, but that turns rather ugly -- Django doesn't expect it there, and it's not effective until the field is stored in the database.

The database casts / conversions are another thing, you're right: a method rather than a type-setting is better here, and that's actually what we had before :-)

So the current situation is (on tcr-2 branches):

  • value_to_db_auto is back (with a minor cleanup I did some time in the past, but generally that's the original solution; the casts could be moved to value_for_db if it makes it into Django -- but that should probably happen together with a deeper abstraction refactor that's described on the SoC page);
  • AutoField's field-layer validation is dropped -- there is no way of doing it properly and I believe not keeping the instance check makes the situation clearer (you should expect AutoFields to hold values of various types, including arbitrary strings, and pass them through output filtering suitable for context).

I don't get why you think this isn't something that can be validated properly -- after all, this is code of the database abstraction layer, so we can safely assume we have a database connection (more specifically a DatabaseWrapper instance). Maybe you're confusing form validation and model validation here?

This is all about field/form validation (done in to_python where there's no connection and for this purpose it can't be easily made available), this is the one that have been changed/dropped -- vanilla Django assumes that AutoField values will be ints and enforces this "on input". There's no impact on the database layer (it does it's own "validation", quoting or casts; everything looks good on this level), but the int-type may have, potentially, been assumed elsewhere.

Agreed. The problem is that you're not thrown a validation error if you input invalid values into a FK form field but a DatabaseError which results in a HTTP 500.

But that's kind of a no-realworld-relevance usecase and the fix promises to be complex, so I'm fine with no validation at all.

Feel free to merge the branches if you think they're ready.

Well, seems like we've got this beast done :-) Thank you so much for the hard work!

Remerged, thanks again for all the reviews and discussions, wouldn't get anywhere without that.