Crell / Serde

Robust Serde (serialization/deserialization) library for PHP 8.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serde should enforce SequenceField declaration

pqt-edannenberg opened this issue · comments

When marking a property as a SequenceField Serde currently just validates via array_is_list on deserialization, possibly throwing an exception if the validation fails. I would like to propose that Serde enforces a list via array_values if the validation fails instead.

Detailed description

When deserializing a Laravel request array I ran into a Serde\InvalidArrayKeyType exception, which proved fairly tricky to debug. In the end the problem turned out to be rather silly as Laravel's request validation may reorder request array elements, effectively turning the array into an associative one:

final class Bar
{
    public function __construct(
        public string $baz,
        #[SequenceField(arrayType: Foo::class)]
        public array $fooArray = [],
    ) {
        //
    }    
}

final class Foo
{
    public function __construct(
        public string $name,
    ) {
        //
    }        
}

$serde = new SerdeCommon();

$ok = ['baz' => 'bla', 'fooArray' => [0 => ['name' => 'meh'], 1 => ['name' => '123']]];

$error = ['baz' => 'bla', 'fooArray' => [1 => ['name' => '123'], 0 => ['name' => 'meh']]];

$serde->deserialize($ok, 'array', Bar::class);
$serde->deserialize($error, 'array', Bar::class);

Context

Enforcing a list when validation fails would save other users the pain of debugging this when major frameworks like Laravel might reorder elements unexpectedly. Also having to manually enforce lists via array_values is rather cumbersome for bigger DTOs.

Your environment

$ composer show -t crell/serde
crell/serde 0.6.0 A general purpose serialization and deserialization library
├──crell/attributeutils ~0.8.2
│  ├──crell/fp ~0.4.0
│  │  └──php ~8.1
│  └──php ~8.1
├──crell/fp >= 0.3.3
│  └──php ~8.1
└──php ~8.1

Hm. Well, here's the challenge: Since PHP arrays don't distinguish between a sequence and a dictionary, the best we can do is guess. array_is_list() checks if the array is sequentially keyed, which is a good indication that it is a Sequence. However, if it's keyed by integers not in order, how can we differentiate between "this is a sequence, PHP just screwed up the key orders" and "this is a map that happens to have integer keys that make sense in the domain"?

We could type check all the keys and ensure they're all integers, but that will be slower than array_is_list and more likely to have false positives (whereas the current code has a false negative, as you show. FFS, Laravel...)

Note that this would need to be addressed in 2 places: Once in SequenceField, and once in SequenceExporter, both of which check against array_is_list.

I'm not sure whether it's better to err on the false positive or false negative side. Thoughts?

However, if it's keyed by integers not in order, how can we differentiate between "this is a sequence, PHP just screwed up the key orders" and "this is a map that happens to have integer keys that make sense in the domain"?

I think we shouldn't, if a user explicitly marks a property as SequenceField it should be clear that array keys don't matter at all and it shouldn't be an issue if Serde enforces it via array_values() for convenience. If array keys do matter it should be a DictionaryField instead.

I may be biased though, as I'm sitting on the long end of the false negative stick. :)

We could type check all the keys and ensure they're all integers, but that will be slower than array_is_list and more likely to have false positives (whereas the current code has a false negative, as you show. FFS, Laravel...)

The primary reason we choose Serde was the performance edge it had over it's competitors and given my reasoning above I think checking each key is unnecessary. But of course I don't have the full picture.

Hm. So your argument is that instead of "SequenceField will verify that it's only used on the correct data type", "SequenceField means it will treat the value as a sequence, and if it's not and you lose key data when serializing/deserializing, that's your own damned fault?"

I can see the argument for that; I'll have to sleep on it, but I'm open to that argument.

The primary reason we choose Serde was the performance edge it had over it's competitors

Good to know. 😸 If you spot any obvious places for performance improvements let me know, as keeping the process fast is one of the main design goals. (Though feature-completeness has to take first priority.)

So, I dug into this a bit more. Removing the current validator is trivial: Just replace array_is_list with is_array, and remove two test cases that were checking that behavior.

However, there's no obvious place to put an array_values() call to guarantee what you get is a valid sequence. validate is called from this block in ObjectImporter:

$value = $dict[$propField->serializedName];
if ($value !== SerdeError::Missing && !$propField->validate($value)) {
    throw InvalidArrayKeyType::create($propField, 'invalid');
}

Which... is not actually the right error to use there because it's not just for arrays. But that aside, what we're looking for is some kind of type-specific normalization, not validation. validate returns a bool, not a modified value. So to make this work, we'd need to add another method, normalize() to TypeField that is used... pretty much only by SequenceField. That then begs the question, validate first then normalize, or normalize then validate? Are there other use cases for such inbound-value-normalization that would justify adding the method?

At the moment the only TypeFields are sequence and dictionary, but in concept there could be any number. I don't know if normalize makes any sense elsewhere.

Thoughts?

Hmm, I don't have the full picture but maybe it would be enough to simply make validate less restrictive for SequenceFields? Just by changing array_is_list to is_array the issue above was resolved and Serde deserialized the request as expected. Are there cases where an associative array marked as SequenceField would cause problems if it wasn't normalized?

What should happen in this case:

class Foo {
  #[SequenceField]
  public array $seq;
}
{
  "seq": {"a": "A", "b": "B"}
}

Without some validation on deserialization, deserializing that JSON to that class would result in $foo->seq being an associative array, not a sequential array. That would at least be unexpected, and may or may not lead to buggy behavior depending on the application.

Currently, I believe the validate() method is blocking that JSON from being deserialized, which I'd argue is good. (The error message is bad, but the feature is good.) However, it is also over-blocking; the proposal above is to change it from "reject" to "actively discard keys." Which is a reasonable softening (maybe based on the strict flag?), but it's not clear to me how to do that cleanly. That's the issue.

Thinking aloud... Maybe if TypeField was broken up into a base interface, a TypeValidator interface, and a TypeNormalizer interface? Then a TypeField could opt-in to validating its incoming data, and to normalizing it. That wouldn't be all that helpful for Sequence and Dictionary, but it would let Sequence use normalizing without imposing that requirement on all other TypeFields, and the no-op function call that would result. That... would also give more power to writing custom TypeFields, as you could include your own entirely custom validation or normalization for your specific field/object.

However, to really do that properly will likely require at least a small API change to TypeField, at least. We're still technically in 0.x land so that's fine, but I don't want to break too many things now that people are actually using the library. 😄 (Expect 1.0 sometime in the next month, because I'm presenting on Serde at Longhorn PHP.)

Does any of that make sense?

Although at that point, I wonder if that should be even a THIRD type of sub-attribute, so you aren't forced to choose between SequenceField and an entirely custom type field. Dur.

Note: As a workaround, you can subclass SequenceField and override the validate() method yourself. I won't guarantee that doesn't break before 1.0, but I left classes non-final so people would still have that kind of "unsupported escape hatch".

OK, I'm glad I slept on this for a while. Turns out, we were thinking about this problem all wrong. The issue has nothing to do with the Importer. It has to do with the Deformatter. That's where strict mode gets enforced. Fixing that is actually pretty trivial, but doing so uncovered a bunch of other subtle issues. They should all be fixed in #34.

Please give that branch a try using strict: false for the fields that Laravel is screwing up. That should give you a clean deserialized result.

Sorry for the late reply, had last week off. I just gave that branch a spin and with strict: false the test case above works now. 🎉 That particular case might still be hard to debug though when encountered by unsuspecting users, maybe add a hint for strict mode to the TypeMismatch exception message?

Error message updated.