Skip to content

Pure-Python date.fromisoformat silently mis-parses malformed basic-format dates #152204

Description

@tonghuaroot

Bug description

_pydatetime.date.fromisoformat (the pure-Python reference used when the C
accelerator is unavailable, and via _pydatetime directly) returns a
wrong-but-plausible date for several strings that are not valid ISO-8601
dates. The C accelerator raises ValueError for every one of them. Because the
result is a silently incorrect date rather than an error, malformed input
becomes valid-looking data with no signal that anything went wrong.

There are two surface forms of the same underlying defect in
_parse_isoformat_date, which slices fixed-width substrings and calls int()
on them without checking that each slice is exactly N ASCII digits:

(1) int() tolerates a leading + / - / space in a basic-format field.

>>> import _datetime, _pydatetime
>>> _pydatetime.date.fromisoformat('2020+12')
datetime.date(2020, 1, 2)
>>> _datetime.date.fromisoformat('2020+12')
Traceback (most recent call last):
  ...
ValueError: Invalid isoformat string: '2020+12'
>>> _pydatetime.date.fromisoformat('+020-06-15')
datetime.date(20, 6, 15)
>>> _pydatetime.date.fromisoformat('2020-W 5')
datetime.date(2020, 1, 27)
>>> _pydatetime.date.fromisoformat('202012+9')
datetime.date(2020, 12, 9)
>>> _pydatetime.date.fromisoformat('2020 12')
datetime.date(2020, 1, 2)

Here int('+1') == 1, int(' 1') == 1 and int('+9') == 9, so the month /
day / week fields parse a sign or space that is not part of any ISO-8601 date.

(2) The length gate admits 7-character strings, and a fixed-width slice then
reads a 1-character tail.

>>> _pydatetime.date.fromisoformat('2020061')
datetime.date(2020, 6, 1)
>>> _datetime.date.fromisoformat('2020061')
Traceback (most recent call last):
  ...
ValueError: Invalid isoformat string: '2020061'
>>> _pydatetime.date.fromisoformat('2020123')
datetime.date(2020, 12, 3)
>>> _pydatetime.date.fromisoformat('2020-W2')
datetime.date(2020, 1, 6)
>>> _pydatetime.date.fromisoformat('9999121')
datetime.date(9999, 12, 1)

'2020061' is 7 chars; the gate len(date_string) in (7, 8, 10) lets it
through, the month slice reads '06' and the day slice dtstr[6:8] reads the
1-character tail '1', giving date(2020, 6, 1). '2020-W2' reads a 1-digit
week int('2'). The C parse_digits(p, ..., 2) requires exactly two digits, so
C rejects all of these.

datetime.fromisoformat inherits the same defect via the date branch, e.g.
_pydatetime.datetime.fromisoformat('2020061') returns
datetime.datetime(2020, 6, 1, 0, 0) while the C path raises.

C vs pure-Python

input C _datetime pure-Python _pydatetime
date.fromisoformat('2020+12') ValueError date(2020, 1, 2)
date.fromisoformat('+020-06-15') ValueError date(20, 6, 15)
date.fromisoformat('2020-W 5') ValueError date(2020, 1, 27)
date.fromisoformat('202012+9') ValueError date(2020, 12, 9)
date.fromisoformat('2020061') ValueError date(2020, 6, 1)
date.fromisoformat('2020123') ValueError date(2020, 12, 3)
date.fromisoformat('2020-W2') ValueError date(2020, 1, 6)
date.fromisoformat('9999121') ValueError date(9999, 12, 1)

Root cause

Lib/_pydatetime.py, _parse_isoformat_date (the function's own comment notes
it "assumes an ASCII-only string of lengths 7, 8 or 10"). On current main the
function body is:

def _parse_isoformat_date(dtstr):
    # It is assumed that this is an ASCII-only string of lengths 7, 8 or 10,
    # see the comment on Modules/_datetimemodule.c:_find_isoformat_datetime_separator
    if len(dtstr) not in (7, 8, 10):           # line 361
        raise ValueError("Invalid isoformat string")
    year = int(dtstr[0:4])                     # line 363
    ...
        weekno = int(dtstr[pos:pos + 2])       # line 370  (week field)
        ...
        dayno = int(dtstr[pos:pos + 1])        # line 380  (week day field)
    ...
        month = int(dtstr[pos:pos + 2])        # line 384  (month field)
        ...
        day = int(dtstr[pos:pos + 2])          # line 390  (day field)

The if len(dtstr) not in (7, 8, 10) gate at line 361 only bounds the total
length; date.fromisoformat (the caller, lines 1059-1060) applies the same
length gate before calling in. Neither gate checks the content of the
fixed-width fields. Each field is read with int(dtstr[pos:pos+N]). int()
accepts a leading +/-/whitespace and a short string, so:

  • a +/-/space that lands in a month/day/week field is silently consumed
    (form 1), and
  • on a 7-char string the day/week slice runs off the end and int() happily
    parses the 1-character remainder (form 2).

Wrong side: pure-Python, which over-accepts. ISO-8601 calendar dates contain
no sign or space inside the date, and have no 1-digit month-day or 1-digit
week; date.fromisoformat's docstring promises a string "in the format emitted
by date.isoformat()". The C accelerator's parse_digits rejects any non-digit
byte and requires the exact field width, then verifies the whole string was
consumed, so C is correct here.

Suggested fix

Validate each slice in _parse_isoformat_date before converting: require that
the year/month/day/week/weekday slice is exactly N ASCII digits (mirroring the
C parse_digits). The module already has an _is_ascii_digit helper used by
the fraction path (_parse_hh_mm_ss_ff does
all(map(_is_ascii_digit, tstr[pos:]))), so reusing it keeps the check
consistent, e.g. raise ValueError unless
len(s) == N and all(map(_is_ascii_digit, s)) before calling int(s). That
makes the malformed basic-format strings above raise ValueError on the
pure-Python path exactly as the C accelerator does, and closes the 7-char
short-slice hole at the same time. (The length gate (7, 8, 10) can stay; the
per-field width check is what rejects '2020061', since its day slice is then a
1-char string.)

Environment

Relation to existing issues

This is distinct from the known nearby issues:

Found with a differential C-vs-pure-Python fromisoformat testing harness
(AI-assisted, each case hand-verified).

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions