Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions docs/temporal-covering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Temporal-covering descriptor

`meta/temporal-covering.json` is the **single codegen source of truth**
(RFC #870 TemporalParquet / #913 Temporal Data Lake) for projecting a MEOS
temporal column into Parquet/Iceberg **covering columns**. The pipeline
folds it into `meos-idl.json` as `temporalCovering`. Every binding/engine
(PyMEOS, JMEOS, MobilityDuck, MobilitySpark, …) generates the **identical**
covering schema from this one mapping, so a temporal table prunes the same
way on every platform — no per-engine covering code to maintain.

## What it is

A temporal value is stored on disk as a canonical MEOS-WKB `BLOB`. Iceberg
and Parquet cannot prune on a `BLOB`. The covering descriptor names, per
temporal-type **class**, the primitive columns to *materialise alongside*
the value — the bounding box and SRID — which Iceberg collects as manifest
statistics and Parquet as row-group min/max. A bbox/time predicate then
prunes whole files and row groups with **no spatial-aware engine**
(GeoParquet 1.1 `covering.bbox`; MVB v3 measured this as ~10× faster than
the `ST_Intersects` path).

The mapping is keyed by **class**, not by type — adding a type is one entry
in its class:

| Class | Box | Types | Covering columns |
|---|---|---|---|
| `spatial` | `STBOX` via `tspatial_to_stbox` | tgeompoint, tgeogpoint, tgeometry, tgeography, tcbuffer, tnpoint, tpose, trgeometry | `xmin xmax ymin ymax [zmin zmax] tmin tmax srid` |
| `number` | `TBOX` via `tnumber_to_tbox` | tint, tfloat, tbigint | `vmin vmax tmin tmax` |

The canonical value column is unchanged and lossless; covering columns are
denormalised derivations of the value's box. `zmin`/`zmax` are emitted only
for 3D values (`when: hasZ`).

## In the catalog

`temporalCovering` carries the verbatim `classes`, plus derived lookups for
codegen:

```json
"temporalCovering": {
"valueCodec": { "asHexWkb": "temporal_as_hexwkb",
"fromHexWkb": "temporal_from_hexwkb" },
"byType": { "tgeompoint": { "class": "spatial", "box": {...},
"srid": "tspatial_srid", "columns": [...] }, ... },
"symbols": ["stbox_xmin", "tbox_xmin", "tspatial_to_stbox", ...],
"count": 11
}
```

- `byType` — `"tgeompoint"` → its class, box converter, SRID accessor, and
covering columns (each with its MEOS bbox accessor and SQL type). A
generator reads this directly; it never re-derives the mapping.
- `symbols` — every MEOS C symbol the descriptor depends on. The covering
parity audit (`tools/covering_parity.py`) checks each is exported by the
catalog and each covered type is a real `MeosType` — a miss is reported as
a worklist (add/export the accessor in MEOS), never a fabricated pass.

## How a generator uses it

For a column `traj TGEOMPOINT`, emit alongside the WKB value column:

```sql
xmin = stbox_xmin(tspatial_to_stbox(traj)), xmax = stbox_xmax(...),
ymin = stbox_ymin(...), ymax = stbox_ymax(...),
tmin = stbox_tmin(...), tmax = stbox_tmax(...),
srid = tspatial_srid(traj)
```

(each engine in its own idiom — DuckDB generated columns, a Spark UDF
projection, a PyMEOS writer), plus the `temporal` and GeoParquet `geo` /
`covering.bbox` file metadata keys from `metadataKeys`.

## Not yet covered

- **Time-only** (`tbool`, `ttext`): a `tmin`/`tmax` covering needs a span
lower/upper bound accessor; `temporal_to_tstzspan` is exported but a span
bound accessor is not. Surfaced as a MEOS export gap (close in MEOS C),
not filled binding-side.
- **Point-cloud / cell-index** (`tpcpoint`, `tpcpatch`, `th3index`,
`tquadbin`): fold into the `spatial` class once the catalog confirms a
uniform temporal→`STBOX` converter for these families.
74 changes: 74 additions & 0 deletions meta/temporal-covering.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
{
"_comment": "Temporal-covering descriptor — the single codegen source of truth for projecting a MEOS temporal column into Parquet/Iceberg covering columns (GeoParquet 1.1 `covering.bbox`). Every binding/engine generates the IDENTICAL covering schema from this mapping, so a temporal table prunes the same way on every platform (Iceberg manifest pruning + Parquet row-group min/max) with no spatial-aware engine. Curated canonical data keyed by temporal-type FAMILY (a `class`), not per type — adding a type is one entry in its class. The canonical MEOS-WKB value column is unchanged and lossless; the covering columns are denormalised derivations of the value's bounding box. RFC #870 (TemporalParquet) / #913 (Temporal Data Lake).",
"provenance": {
"rfc": "MobilityDB RFC #870 (TemporalParquet) + #913 (Temporal Data Lake)",
"discussion": "MobilityDB#861 (edge-to-cloud SQL portability: one query, three platforms)",
"geoParquet": "GeoParquet 1.1 covering.bbox (geoparquet.org/releases/v1.1.0)",
"benchmark": "MVB v3 — the scalar AND-chain on materialised covering columns prunes row groups identically to the spatial-aware path and ~10x faster, with no DuckDB spatial extension"
},
"version": "1.0.0",
"valueCodec": {
"asHexWkb": "temporal_as_hexwkb",
"fromHexWkb": "temporal_from_hexwkb",
"note": "The canonical MEOS-WKB stays the lossless value column (BLOB); covering columns are denormalised and never the source of truth."
},
"metadataKeys": {
"temporal": "temporal",
"geo": "geo",
"covering": "bbox"
},
"classes": {
"spatial": {
"doc": "Spatial temporal types — STBOX covering (x/y[/z] extent + time extent + SRID).",
"box": {"type": "STBOX", "from": "tspatial_to_stbox"},
"srid": "tspatial_srid",
"types": ["tgeompoint", "tgeogpoint", "tgeometry", "tgeography", "tcbuffer", "tnpoint", "tpose", "trgeometry"],
"columns": [
{"name": "xmin", "sqlType": "double", "accessor": "stbox_xmin", "source": "box"},
{"name": "xmax", "sqlType": "double", "accessor": "stbox_xmax", "source": "box"},
{"name": "ymin", "sqlType": "double", "accessor": "stbox_ymin", "source": "box"},
{"name": "ymax", "sqlType": "double", "accessor": "stbox_ymax", "source": "box"},
{"name": "zmin", "sqlType": "double", "accessor": "stbox_zmin", "source": "box", "when": "hasZ"},
{"name": "zmax", "sqlType": "double", "accessor": "stbox_zmax", "source": "box", "when": "hasZ"},
{"name": "tmin", "sqlType": "timestamptz", "accessor": "stbox_tmin", "source": "box"},
{"name": "tmax", "sqlType": "timestamptz", "accessor": "stbox_tmax", "source": "box"},
{"name": "srid", "sqlType": "int", "accessor": "tspatial_srid", "source": "value"}
]
},
"number": {
"doc": "Numeric temporal types — TBOX covering (value range + time extent).",
"box": {"type": "TBOX", "from": "tnumber_to_tbox"},
"srid": null,
"types": ["tint", "tfloat", "tbigint"],
"columns": [
{"name": "vmin", "sqlType": "double", "accessor": "tbox_xmin", "source": "box"},
{"name": "vmax", "sqlType": "double", "accessor": "tbox_xmax", "source": "box"},
{"name": "tmin", "sqlType": "timestamptz", "accessor": "tbox_tmin", "source": "box"},
{"name": "tmax", "sqlType": "timestamptz", "accessor": "tbox_tmax", "source": "box"}
]
},
"timeOnly": {
"doc": "Time-only temporal types — no spatial box; time extent only.",
"box": null,
"srid": null,
"types": ["tbool", "ttext"],
"columns": [
{"name": "tmin", "sqlType": "timestamptz", "accessor": "temporal_start_timestamptz", "source": "value"},
{"name": "tmax", "sqlType": "timestamptz", "accessor": "temporal_end_timestamptz", "source": "value"}
]
}
},
"deferred": {
"pointcloudCellIndex": {
"types": ["tpcpoint", "tpcpatch", "th3index", "tquadbin"],
"reason": "STBOX covering via a type-specific box path (e.g. tpcbox_to_stbox); fold into the `spatial` class once the catalog confirms a uniform temporal->STBOX converter for these families."
}
},
"notes": [
"The covering columns are a denormalisation of the value's bounding box; the canonical MEOS-WKB BLOB remains the lossless source of truth.",
"Materialising the covering columns as primitive Parquet columns gives Iceberg manifest-level file pruning and Parquet row-group min/max pruning, with no spatial-aware engine.",
"zmin/zmax are emitted only for 3D values (`when: hasZ`); 2D values omit them or store null.",
"`source: box` accessors take the box returned by `class.box.from(value)`; `source: value` accessors take the temporal value directly.",
"This descriptor is type-agnostic per class exactly as `portable-aliases.json` is type-agnostic per operator family — codegen consumes it identically across every binding."
]
}
88 changes: 88 additions & 0 deletions meta/temporal-covering.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://github.com/MobilityDB/MEOS-API/blob/main/meta/temporal-covering.schema.json",
"title": "Temporal-covering descriptor — canonical SoT",
"description": "Schema for `meta/temporal-covering.json` (RFC #870/#913). Catches shape regressions earlier than the unit tests; validated as a test step in `tests/test_covering.py`.",
"type": "object",
"additionalProperties": true,
"required": ["provenance", "version", "valueCodec", "metadataKeys", "classes", "notes"],
"properties": {
"_comment": {"type": "string"},
"provenance": {
"type": "object",
"additionalProperties": true,
"required": ["rfc"],
"properties": {
"rfc": {"type": "string"},
"discussion": {"type": "string"},
"geoParquet": {"type": "string"},
"benchmark": {"type": "string"}
}
},
"version": {"type": "string"},
"valueCodec": {
"type": "object",
"additionalProperties": true,
"required": ["asHexWkb", "fromHexWkb"],
"properties": {
"asHexWkb": {"type": "string"},
"fromHexWkb": {"type": "string"},
"note": {"type": "string"}
}
},
"metadataKeys": {
"type": "object",
"additionalProperties": true,
"required": ["temporal", "covering"],
"properties": {
"temporal": {"type": "string"},
"geo": {"type": "string"},
"covering": {"type": "string"}
}
},
"classes": {
"type": "object",
"minProperties": 1,
"additionalProperties": {
"type": "object",
"additionalProperties": true,
"required": ["types", "columns"],
"properties": {
"doc": {"type": "string"},
"srid": {"type": ["string", "null"]},
"box": {
"type": ["object", "null"],
"required": ["type", "from"],
"properties": {
"type": {"type": "string"},
"from": {"type": "string"}
}
},
"types": {
"type": "array",
"minItems": 1,
"items": {"type": "string", "pattern": "^t[a-z0-9]+$"}
},
"columns": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"additionalProperties": false,
"required": ["name", "sqlType", "accessor", "source"],
"properties": {
"name": {"type": "string", "pattern": "^[a-z][a-z0-9]*$"},
"sqlType": {"enum": ["double", "int", "timestamptz"]},
"accessor": {"type": "string"},
"source": {"enum": ["box", "value"]},
"when": {"enum": ["hasZ"]}
}
}
}
}
}
},
"deferred": {"type": "object"},
"notes": {"type": "array", "items": {"type": "string"}}
}
}
71 changes: 71 additions & 0 deletions parser/covering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
"""Temporal-covering descriptor — the single codegen source of truth for
projecting a MEOS temporal column into Parquet/Iceberg covering columns.

`meta/temporal-covering.json` is the curated, authoritative mapping (RFC
#870 TemporalParquet / #913 Temporal Data Lake): per temporal-type *class*
(spatial → STBOX, number → TBOX) it names the box converter, the SRID
accessor, and the covering columns with their MEOS bbox accessors. Folding
it into the catalog means every binding/engine generates the *identical*
covering schema, so a temporal table prunes the same way on every platform
(Iceberg manifest pruning + Parquet row-group min/max) with no spatial-aware
engine.

This is curated canonical data, not a heuristic — it is preserved verbatim
and only *derived* lookups are added (a flat `byType` index and the set of
referenced C symbols), so a generator never has to re-derive the mapping.
Pure dict → dict; no libclang.
"""

import json
from pathlib import Path


def attach_temporal_covering(idl: dict, path: Path) -> dict:
"""Attach ``idl["temporalCovering"]`` from the canonical mapping file."""
if not Path(path).exists():
return idl
data = json.loads(Path(path).read_text())

classes = data["classes"]

# Integrity: a temporal type may belong to at most one covering class —
# two classes claiming the same type would make codegen ambiguous.
by_type = {}
for class_name, spec in classes.items():
for t in spec["types"]:
if t in by_type:
raise ValueError(
f"temporal-covering: type {t!r} in two classes "
f"({by_type[t]['class']!r} and {class_name!r})")
by_type[t] = {
"class": class_name,
"box": spec.get("box"),
"srid": spec.get("srid"),
"columns": spec["columns"],
}

# The complete set of MEOS C symbols this descriptor depends on — the
# covering parity audit checks every one is actually in the catalog.
symbols = {data["valueCodec"]["asHexWkb"], data["valueCodec"]["fromHexWkb"]}
for spec in classes.values():
if spec.get("box"):
symbols.add(spec["box"]["from"])
if spec.get("srid"):
symbols.add(spec["srid"])
for col in spec["columns"]:
symbols.add(col["accessor"])

idl["temporalCovering"] = {
"provenance": data["provenance"],
"version": data["version"],
"valueCodec": data["valueCodec"],
"metadataKeys": data["metadataKeys"],
"classes": classes,
"deferred": data.get("deferred", {}),
"notes": data["notes"],
"byType": by_type, # "tgeompoint" -> class + columns
"types": sorted(by_type),
"symbols": sorted(symbols), # referenced C symbols (audit set)
"count": len(by_type),
}
return idl
13 changes: 11 additions & 2 deletions run.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@

from parser.parser import parse_all_headers, merge_meta
from parser.portable import attach_portable_aliases
from parser.covering import attach_temporal_covering
from parser.typerecover import recover_collapsed_types


HEADERS_DIR = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("./meos/include")
META_PATH = Path("./meta/meos-meta.json")
PORTABLE_PATH = Path("./meta/portable-aliases.json")
COVERING_PATH = Path("./meta/temporal-covering.json")
OUTPUT_DIR = Path("./output")


Expand All @@ -36,20 +38,27 @@ def main():
print(f"[2/3] No meta found at {META_PATH}, skipping.", file=sys.stderr)

# 3. Attach the canonical portable bare-name mapping (codegen truth)
print(f"[3/3] Attaching portable aliases from {PORTABLE_PATH}...",
print(f"[3/4] Attaching portable aliases from {PORTABLE_PATH}...",
file=sys.stderr)
idl = attach_portable_aliases(idl, PORTABLE_PATH)

# 4. Attach the temporal-covering descriptor (Parquet/Iceberg projection)
print(f"[4/4] Attaching temporal covering from {COVERING_PATH}...",
file=sys.stderr)
idl = attach_temporal_covering(idl, COVERING_PATH)

idl_path = OUTPUT_DIR / "meos-idl.json"
with open(idl_path, "w") as f:
json.dump(idl, f, indent=2)
print(f" → {idl_path} written", file=sys.stderr)

pa = idl.get("portableAliases", {}).get("count", 0)
cov = idl.get("temporalCovering", {}).get("count", 0)
print(f"\nDone: {len(idl['functions'])} functions, "
f"{len(idl['structs'])} structs, "
f"{len(idl['enums'])} enums, "
f"{pa} portable bare-name aliases", file=sys.stderr)
f"{pa} portable bare-name aliases, "
f"{cov} temporal covering types", file=sys.stderr)


if __name__ == "__main__":
Expand Down
Loading