Cells with custom date format not converted correctly or throw error
gorj-tessella opened this issue · comments
When attempting to process a sheet with a date in the custom format "d-mmm-yyyy", it failed with a "ValueError: Invalid length for field 'mmm'". The error was thrown by babel.
Upon further inspection I found that pyxl was using the cell's number_format property which was "d-mmm-yyyy". When it ran the date conversions function, nothing was changed. Babel needed this to be converted to "d-MMM-yyyy". I found that the normalize_date_format was not doing this; it was making the following changes:
xlsx2html/xlsx2html/format/dt.py
Lines 8 to 12 in bbd3843
The root issues are as follows:
- dt.py expects dates to be written with upper case for some reason. Maybe that is true for internal formats, but it isn't for custom formats. The "mmm" needed to be converted to "MMM"
- Custom date/time formats in excel are frustratingly ambiguous. Specifically, both month and minute are specified with lowercase "m". "m" is only interpreted as minute if follows "hh" or precedes "ss".
- Openpyxl doesn't provide any capabilities to get the value as formatted from a cell, though it really should.
Reference:
The following code should be sufficient to format an openpyxl datetime object using an excel format, based on excel's definitions.
RE_DATE_TOK = re.compile(r"y+|m+|d+|h+|s+|\.0+|AM/PM|A/P")
WEEK_DAYS = [
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday',
'Sunday'
]
YEAR_MONTHS = [
'January',
'February',
'March',
'April',
'May',
'June',
'July',
'August',
'September',
'October',
'November',
'December'
]
def excel_format_custom_datetime(fmt, value):
"""Only works for US"""
def zpad(v, tok):
v = str(v)
i = len(tok) - len(v)
if i > 0:
v = ('0' * i) + v
return v
has_ap = False
is_minute = set()
must_minute = False
ms = list(RE_DATE_TOK.finditer(fmt))
for i, m in enumerate(ms):
tok = m.group(0)
if tok in ['AM/PM', 'A/P']:
has_ap = True
elif tok[0] == 'h':
# First m after h is always minute
must_minute = True
elif must_minute and tok in ['m', 'mm']:
is_minute.add(i)
must_minute = False
elif tok[0] == 's':
last_i = i - 1
if last_i < 0:
must_minute = True
elif last_i not in is_minute:
if ms[last_i].group(0) in ['m', 'mm']:
# m right before s is alway minute
is_minute.add(last_i)
elif not len(is_minute):
# if no previous m, first m after s is always minute
must_minute = True
parts = []
pos = 0
for i, m in enumerate(ms):
tok = m.group(0)
start, end = m.span(0)
parts.append(fmt[pos:start])
if tok[0] == 'h':
tok = tok[:2]
v = value.hour
if has_ap:
v = v % 12
if v == 0:
v = 12
tok = zpad(v, tok)
elif tok[0] == 'm':
if len(tok) > 5:
tok = tok[:4] # Defaults to MMMM
if tok == 'mmm':
tok = YEAR_MONTHS[value.month - 1][:3]
elif tok == 'mmmm':
tok = YEAR_MONTHS[value.month - 1]
elif tok == 'mmmmm':
tok = YEAR_MONTHS[value.month - 1][0]
elif i in is_minute:
tok = zpad(value.minute, tok)
else:
tok = zpad(value.month, tok)
elif tok[0] == 's':
tok = tok[:2]
tok = zpad(value.second, tok)
elif tok[:2] == '.0':
digits = len(tok) - 1
v = value.microsecond / 1000000.0
v = ("{." + str(digits) + "f}").format(v)[1:]
tok = v
elif tok == 'AM/PM':
tok = 'AM' if (value.hour < 12) else 'PM'
elif tok == 'A/P':
tok = 'A' if (value.hour < 12) else 'P'
elif tok[0] == 'y':
if len(tok) <= 2:
tok = str(value.year)[-2:]
else:
tok = str(value.year)
elif tok[0] == 'd':
if len(tok) <= 2:
tok = zpad(value.day, tok)
elif tok == 'ddd':
tok = WEEK_DAYS[value.weekday()][:3]
else:
tok = WEEK_DAYS[value.weekday()]
else:
raise ValueError(f'Unhandled datetime token {tok}')
parts.append(tok)
pos = end
parts.append(fmt[pos:])
return ''.join(parts)
Great job! I will check soon and add your solution as soon as there is time.
The current implementation was incomplete and only covered my tasks.