altoxml / schema

ALTO XML schema - latest and all former versions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restrict PointsType to a well defined format

cipriandinu opened this issue · comments

This topic is derived from #49. On previous issue we focus on changing documentation and announce PointsType restrictions, and on this topic we will have the discussion regarding restrictions implementation, for version 5.0

The most common form of PointsType, as well as the one in all of the examples, seems to be of the x1 y1 x2 y2 ... x_n y_n. As such I'd suggest standardizing to this format and explicitly define it as a string containing an even number of floats with a regex like this (non-functional, just a sketch):

([0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?\s+[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)+

It might also be advantageous to split PointsType into two different types, one for simple points sequences such as BASELINE that require at least two points, and one for Polygon that requires at least three points.

I propose following - a bit more complex, but looks it works (should handle numbers as integer or floating point expressed as 2.4 or .5 or 1.832e-5 or 1.322E-5 or 3.4e+3 or 2.3E+3 or 2.4E5 or 8.9e2, comma separated inside a pair, or space separated inside a pair and at least two pairs - this is useful for basic baseline, not for poly shape). This would match with what we stated into version 4.4

(\d?\d*(.\d+([eE][-+]?\d+)?)?\s*[,\s]\s*\d?\d*(.\d+([eE][-+]?\d+)?)?\s+)+\d?\d*(.\d+([eE][-+]?\d+)?)?\s*[,\s]\s*\d?\d*(.\d+([eE][-+]?\d+)?)?